Datamining standard libraries for Gramps?

StoltHD · November 11, 2020, 12:06pm

When I open one of my place databases, I would really have liked multi-prosessing and multi-thread utilization, it would have reduced the load time significantly…
Same with the deep connection and some of the graphs…

So I think adding multi-core, multi-thread functionality to Gramps, would have been of great help for a lot of feature… And it could have been added a limitation in the settings… ““Limit to 12 threads/6 Cores””

I also wished it was a way to preload a database sub-set to memory when you have large amount of memory… and had a back ground “save committed changes” task.

SNoiraud · November 11, 2020, 2:20pm

Why do you want to limit this. The best solution should be to ask the system how many core it have and if it use threads. If you have a CPU with 256 cores (512 threads), this limit will be removed.

StoltHD · November 11, 2020, 3:29pm

Because I have other services and processes running, and I know that if something is not limited, it can suddenly be a system conflict and loss of data if another process starts…
I have experienced multiple times software without manually limitation of threads hanging the system until the processing are finished.
Yes, it is bad programming, but still… an option to limit it are just a single simple setting…

emyoulation · November 11, 2020, 3:37pm

When fantasizing about multi-core, I was unrealistically hoping for an alternate approach. More of an intelligent throttling & load balancing concept.

If the system has more than a dual core, the ideal would be for the main Gramps application performance to remain independent of add-on load. So half of multi core allocations might be dedicated to the built-in Gramp processes. Conversely, Add-ons might be allocated to only touch the other cores. Thus, even badly written Add-ons wouldn’t slow basic navigation and GUI responsiveness.

And, in monitoring load balances with OS performance monitors, I note that system & legacy application processes lean more heavily on the 1st 2 cores. So it seems like it would be better to have our GUI favor selecting the more idle upper numbered cores.

StoltHD · November 11, 2020, 4:02pm

The one doesn’t stop the other…

If you have 2 or 4 cores without multithreading and you use i.e. Chrome or Firefox in addition to Gramps to do research while Gramps run a “deep connection” rendering within a database with 3-400k relations, you may want to be able to run your web browser and maybe look at a youtube video about searching the ancient library of Alexandria at the same time…?

I have 3 different RAW converters/DAM software, two of them has thread limit settings for rendering edited images for export, the last have not…
the two with the setting works in parallell with any software without leg or hang, the last one just goes to the roof and even though I can do basic OS work, I can not run any other software… it even take the resources from my databases, so that they get unresponsive until it’s finished.
Same with a few video rendering software…

And I run a Ryzen 9, 12 core, 64GB RAM, 2 fast NVME for system and working area for data…
No software should take all resources in a system, therefore it’s a good thing to be able to limit the resources for any process if and when needed…

and yes, it should be possible to allocate the least utilized threads to a process, it’s at least my understanding of what I have read about multi-threading…

It would have been great if Gramps could shit to a multi-thread application, and I urge for a manual setting for the amount of cores/threads, since it’s just a parameter… it could default to 1/4 of available resources or half… but for people that runs on high end platforms, it might be that they want to run a process in Gramps on 8 of 12 cores because they want something to go faster, but that they turn it back to 4 cores for normal usage…

SNoiraud · November 11, 2020, 5:21pm

This is not a problem, even if you have only 4 cores without threading, you can create 30 threads. It will work, but only 4 in parallel will be running. It will be less performant. that’s all.

StoltHD · November 11, 2020, 5:51pm

The whole point was the performance and to limit the resources so that it was possible to run other multiple other services in parallell not in serie.

StoltHD · November 11, 2020, 6:09pm

Another thing, when I load my 32k place database it take 28 to 30 second to start Gramps, That is way to long on a new Ryzen system with NVME SSD’s both as system and data storage.
Multi-threading would have reduced that load.

I’m not looking forward to the day I shall load my 800k + Place Name database (And that is only Places, how would it be when I add a few thousand People, Events and Relations ???

SNoiraud · November 11, 2020, 6:44pm

I am not sure. The loading is not multithread when loading libraries or the database.

StoltHD · November 11, 2020, 6:52pm

SQLITE can work in multithread mode, same can any other database server, and mongodb also support multithreaded clients

SNoiraud · November 11, 2020, 6:57pm

Yes, but the problem is we have only one file for the database and this file is on the single disk. You can put 50 threads to read the database, you will not run faster.

StoltHD · November 11, 2020, 7:01pm

Still it is multi-threaded and when the sqlite database is on a SSD, the single thread doesn’t utilize the read speed of the SSD, but a multi-thread will.

StoltHD · November 13, 2020, 3:59pm

This is something I really wonderabout, why is it that the loading of the Python Libraries isn’t loaded in a multi-threaded way? Why can’t the libraries after initializing be loaded in parallell into memory?
I know the disk speed for older harddisk are slow, but even on a SATA disk, the load of Gramps, utilize only a few prosent of the HDD’s speed, and on my new system, the load of Gramps takes 15-25 second, and an additionally 5-10 second for a small database. still, the utilization of the SSD’s read speed is less than 1%, both when loading libraries when starting Gramps, and when loading the small sqllite database, when I load one of may larger databases, that’s still small in the sqllite world, it takes 10 to 30 second to load the database when Gramps is running…

There has to be ways to do those things more effective?
Just so it’s clear, I don’t mind, it’s not a complain against Gramps, but I don’t look forward to the day I shall do the first import of my 800k places database.
– I run a Ryzen 9 3900X overclocked all cores to 4300, I have 2 NVME SSD’s, one for system and software, one for data

Mihle · November 13, 2020, 5:44pm

With my tree that ish 300 people, so small one, loading straight in to it takes maybe 6-8 sec to load, no single core get much load or anything, so that time is most likely software limited.

Not representative for bigger ones most likely tho.

StoltHD · November 13, 2020, 6:24pm

I have tested this on 2 systems, and one virtual machine, mostly because I use the mongodb connector that are much slower, so I wanted to see how much.
I have tested 3 different databases (one with 891 people, a few hundred places, approx. 1800 events, ±4000 citations and a few hundred sources, number two with 4700 people, approx. 6000 events, 13000 citations and a little over 1200 sources, and some hundred places, the last was a database with 32000 places) on all three system, the exact same databases without media links.
The Virtual machine was a Debian 10…

The result was approx. the same on all systems, the load of the databases only differed with a few sec from system to system, and that is alarming, since the “old” system was my 7 years old intel I7 4820K and SATA SSD’s vs. a new Ryzen (see spec in my last comment).*
I thought it should be larger differences from the different systems, as long as the storage on the new on are more than 6 time as fast…
But as long as everything is loaded as one single thread, every file must be loaded one at the time, each object in the database must be queried and loaded one at the time, and the same is the fact when you create any graphical or textual report… or apply any filter…

As I write, it doesn’t matter much for me now, but I really wonder how it is for one with 6-700k people and maybe the same amount of citations and events…?

if it can be calculated as a flat line, my 800k place database will use approx. 1 hour to load, and that will happen every time I change something so that the views must update.
I really hope I’m wrong…

SNoiraud · November 13, 2020, 7:58pm

This is relarted to all operating systems. When you load a program into memory, all libraries are loaded sequentially. This are things which are done only one time at the beginning. So, why do you want they use multithreading ? This iscomplicated because there are relationships between the different libraries.

very difficult to do for an operating sytem as each program has its specificities.

If you have a processor with 8 threads, if you run firefox ans gramps, each program will launch 8 threads because you don’t know how many threads are used by others programs. At the end, all threads are on a queue and used sequentially.

StoltHD · November 13, 2020, 8:31pm

My Photo and Video software do not work that way, there I set a limit for cores that shall be used, and the software utilize those threads to the full extend when I initialize work to be done, that also include all GPU’s that I have installed, so when I run with 4 GPU’s I could utilize all of them in addition to the amount of cores I had set…
The cores are also used when loading the software.

SNoiraud · November 14, 2020, 12:06am

Totaly false. The only thing you can set is to set a cpu on which a thread is eligible to run.
If two programs says I need to use the cpu 2 to run my thread, they will share this cpu and you’ll have 50% of cpu for each thread.

you have the right to believe that. I never saw that in an operating system. Loading a program in memory is too complex to use multithreading.

multi threading is only useful if you have long task which can be executed separately and this is not the case for a program loading.

Mihle · November 14, 2020, 11:54am

It would be nice if someone just had a massive tree you could download to test things with.

StoltHD · November 14, 2020, 12:18pm

I don’t need to “believe” that, I know that, I see it every time I load Cytoscape also, the moment I click the shortcut and until all “apps” (Cytoscape addons) are loaded, all my 24 threads (or as it is called in Windows, logical processors), goes to 45% utilization, and stay there the 5-10 sec it takes to load the application and all it modules.

Constellation, another network graph software seems to use the first 4 logical process, and in addition utilize approx. 30% of the first logical process on the rest of the cores.

Anaconda trigger the first 4 logical proseccors (0-3) when starting…

And this is how it looks while Anacona Navigator is running and installing Glueviz to a virtual env on a NVME SSD.

and this is how it looks like when I install Rstudio to the same virtual env.

So it looks like for me that both Python and JAVA applications can be started and loaded in a multi-thread process?
The DAM and Video software I have are written in C/C++/C# I think, so not “directly comparable”.

Topic		Replies	Views
About Gramps usage and organization to store and research Help	27	887	November 14, 2020
What its the best way to get a family tree from TMG to Gramps in Windows? Help	39	1379	October 29, 2020
Getting Started for users of Ubuntu / Ubuntu Budgie 20.04 Ideas prerequisite-mgmt	16	1221	January 28, 2021
Web based Gramps Development websolutions	31	3564	May 18, 2023
Graphviz Export - Wish for Feature Ideas thickets	23	3200	September 10, 2020

Datamining standard libraries for Gramps?

Related topics