Performance Issues with Gramps

One of my improvement wishes on time

Place Tree items not already Enclosed By requires 11 steps to set a
Place to be Enclosed By
Could we drag & drop or better still
a Bypass
Right Click on Place New Option (On List) Send to Enclosure by entering
Enclosure Place ID
phil

As promised re ran Narrative Web Site Report
First checking that I had removed the 3 instances of gc.collect
Overall time approx 1 hour compared with 10 hours previous on same i3 NUC with 16Gb Ram and SSD
phil

1 Like

A similar one. Mine is the Karel GEDCOM that I sent, merged with my own tree, after cleaning duplicates, totalling a little over 635k persons.

Results on my Fujitsu Lifebook E Series laptop, with an Intel i5, 8 GB RAM, and a Hitachi HDD, running Gramps 5.1.6 with SQLite are:

  • 45 seconds on Linux Mint 21.3
  • 65 seconds on Windows 10 Pro, 64 bit, with Windows Defender as the standard virus scanner.

And on that, RootsMagic shows the filtered results within a second too.

As you can see, hardware matters, and on the same hardware, the OS also makes a difference, and maybe virus scanners too. On my laptop, it is clear that Linux is faster than Windows, and that may be caused by the file system, ext4 on Linux, NTFS on Windows, and other factors, like a cloud service, MEGA vs. OneDrive, and a virus scanner, none on Linux, Windows Defender on my laptop.

My laptop is 10 years old, and in most cases, laptop drives are slower than desktop drives, and on my laptop, my Gramps 5.1.6 is still a bit faster with SQLite than yours with BSDDB, and that is on a HDD.

Which leads me to repeat the question that I asked in a PM. What is your hardware like, and what’s the virus scanner that you use?

In our BSDDB backend we allocate a large read cache. This could account for the difference in performance between SQLite and BSDDB. It would be easy to test by just allocating a bigger cache for SQLite.

From a command line type:

cd .gramps/grampsdb/<database_id>
sqlite3 sqlite.db

Now run the query:

SELECT gramps_id from person where surname = 'Smith';

How long does it take?

Can you elaborate on the process of hacking the cache size for SQLite?

And specify a cache size value that you’d like us to test on different hardware?

As I said: a few years ago that was a new upper middle class office machine with 16 GB RAM, database on SSD (BSDDB, SQLite), Win10/64, and MS virus scanner (I don’t know how they call it today). So it’s a bit unclear for me where the differences should come from. No cloud drive, no HDD, nothing of that.

I guess any speed comparisons make only sense on one machine. To compare between different machines is always problematic. But at the end of the day, we come to the same conclusion, that Gramps is much slower compared to RM which appears to depend on the different ways the frontend communicates with the database backend. And that was the point of the whole discussion, at least in my understanding.

1 Like

I currently do not have a 5.2.2 on SQLite but only on PostgreSQL. With the 100k dataset its 0.004 sec for the first 200 rows in DBeaver. I’d expect SQLite in the same range. As I keep saying: even the Confuzius Challenge is no challenge at all for a modern database backend :wink:

In the prototype that I wrote 8 years ago, I demonstrated that using indexes significantly improved the performance of our filters.

Our filters currently loop through all objects of a given type in the database and then apply python rules to them. What I did was to query the database using an index, and then run the remaining rules on a reduced list of objects.

The performance increase was achieved by using indexes. The same approach also works on older backends such as BSDDB.

Deep connections has a clear problem with the algorithm, which is known, and proven by the fact that the consanguinity Gramplet is much faster, and the algorithm in GeneWeb (used by Geneanet) too.

1 Like

I just did another test with 5.1.6 on Windows 11, on my desktop, with 16 GB RAM and SSD, again with SQLite, and the search was indeed faster than on my laptop, with 37 seconds, instead of 65.

Linux scores are 20 vs. 45.

Insert following line after the connect statement at line 103 in the gramps/plugins/db/dbapi/sqlite.py file.

self.__connection.execute("PRAGMA cache_size = -8000")

The default value is -2000 which means 2000 pages of 4096 bytes each.

2 Likes

Well – that’s exactly my point. The frontend code has to be adjusted to benefit from modern database backends. Your prototype basically reduces the dataset your frontend Python code has to work on by a certain factor. If that factor is big enough, then you will end up in a situation equivalent to the 1k or even a smaller dataset of my table and we all know that this is no problem.

An alternative solution would of course be that your Python code doesn’t need to apply any rules at all because the database backend has already done this since the question it was asked, was so specific that it sent exactly 100 % of the correct results back instead of the 100+ %. In that 100 % situation your Python code does not need to apply any rules on anything and will therefore per se be faster.

In the long run, this alternative solution will in my opinion always be the more robust solution since it will fully leverage the power of the backend. Your solution which is basically the sharing of work between the backend and the frontend, is always dependent on the quality of the question the backend is asked. This may not only be a problem of an uneducated user asking basically for “SELECT * FROM …”, but it will also depend on the data itself. Just imagine a database full with the name “Smith”. If the backend is asked “SELECT * FROM … WHERE name = ‘smith’” and no other condition, then the backend will simply throw all its rows on the frontend and the frontend has to sift through it. Therefore I’d always try to implement this alternative solution because it’s the more robust one.

1 Like

This was also discussed 8 years ago.

We already have a framework of rules for filters that people generally like. The Isotammi project has even extended the filter editor and added new filters of their own. Other people have also written custom rules, some of which are quite complex.

Rules written in python have the benefit that our developers already know python. Do we want to expose SQL (or some other language) outside of the database layer? The GQL language written by @DavidMStraub could provide an option that is not specific to a particular backend. We really wouldn’t want several versions of the same rule (perhaps users would like to use ArangoDB in the future to take advantage of its graph capabilities).

A new framework could always be written alongside the existing filter editor. We could always use both for a while. Creating a new framework takes time and you would need to get developers interested.

My prototype provided a quick solution that could use our existing database. So why wasn’t it implemented? There actually wasn’t much interest. Most of our users have trees with under 20,000 people and don’t suffer from performance issues. I also fall into this category, but I sympathise with users who have larger trees.

2 Likes

Ok, valid arguments! I have no problems at all to accept them.

This would mean of course that Gramps is permanently locked in a user base with only medium sized trees and will never be able to accept larger trees or the Confuzius challenge. I’m not sure if the development team really thinks that this should be the future of Gramps once everybody really understands the situation. It sounds a bit like permanently throwing away my honor as a Python developer :wink:

1 Like

How about swapping in @kku’s Filter+ gramplet variation for 5.3 and inviting people to do some testing with a few sample Trees? The project could gather some statistics from a wide variety of hardware configurations.

Also perhaps you could add a Cache scaling preference?
So users could dynamically have Gramps incrementally scale when record count exceeds a certain base number?

We have a LOT of filter rule requests in MantisBT listed in another thread.. They could be a good project for the community to work on as experiments for boosting Query performance.

I think Nick’s alternative is a good first step…

Going from BLOBS to JSON in tables utilizing the JSON features that is in sqlite and postgresql as a first step and include at least some indexes to those tables… and maybe even some views…

If that works, it would be a lot easier for someone to convert some of the Python code to functions and/or stored procedures and see if that speeds up things even more.

I am no Python developer, but if the next step was to change/convert the python code used for rules and filter to a function in sql, I do think it will be possible and relatively easy to make a GUI and a transcoder that let people write their filters in Gramps and push a button to transcode it to sql, json or the language used by a selected multi-model database use…

Today’s relational databases have built in support for json documents, so I really don’t think it will matter much if Gramps use a relational way, e.g. lots of tables with lots of joins, or a json document approach with less joins.

And by just changing from BLOBS to JSON, those data vill actually be open for anyone that want to query the data directly…

It’s not huge differences to query a JSON table in a sqlite or postgresql or to query a relational type datastore.

I am usually the one with the big “demands” here by using Gramps way outside what it was ment for, and even I see the benefit of starting with the BLOBS. utilizing the JSON feature in sqlite, then eventually change the sql queries that needs to be changed for backends like postgresql and arangodb or orientDB or similar systems.

It will even be possible to write sync functions from sqlite to other databases with this one change.

And to be honest, with the little I have played with sqlite, postgresql and MS SQLServer, I think this little change will do a lot for those with larger datasets, regardless of if it is people, places, events or any other object that is in focus.
As long as it is stored using the JSON functions in the database engines used, and that there are indexes on those tables.

Regardless, when that work is done, tweaking the database and how Gramps query the database will be possible to change one step of the time when people see what could be better.

I tested an XML import with approx. 300k place name in hierarchy a few years ago, and I found that I had to go down to approx. 80k place names for Gramps to populate the view within a reasonable time, I don’t remember now but I think it made the Place list browsable after approx. 20-28 sec, the problem was that every time I did a change to anything, it took another 20-28 sec or more before I could use Gramps again.
In this test I only had approx. 3000 people and 200 sources, don’t remember the citation numbers because it was a test for a place database.

The 300k place name database was near impossible to import and use, and for the CSV import it took so long that I had to kill it.

All this was done with a Windows 11 pro, with a AMD 3900x CPU, 64GB ECC RAM, 2 NVME PCI-3 1TB SSD’s, one with system and software, the other with databases and documents, I tested the import with the csv and xml files on separate SSD, on system SSD, on a SATA-3 SSD and on a 7200 rpm HDD.
I stopped using the HDD because it was near impossible to get the import done, even the one with 80k place names.

I imported both to sqlite and to a local mongodb running the databases on the data SSD, there was not much difference in speed, but the mongodp-api had a little different response pattern than sqlite… but nothing significant…
One of the problems with the mongodb solution was that the driver was extremely outdated and was known to have som performance problems, if those was triggered in the way Gramps used it, I don’t know.

My main tree is smaller indeed, and I don’t do many imports, but larger data sets are attractive to users like @StoltHD and @UlrichDemlehner , and I also use one as my reference tree, in which I can hardly use the Deep Connectons Gramplet, or run a Duplicate Persons check, because it makes Gramps unusable for hours.

And if the authors of Ancestral Quest, who also created PAF, could write fast imports 20 years ago, and the authors of RootsMagic can do the same, I see no reason why we can’t do similar things. That will require some rewriting of the GEDCOM import, so that it works in phases, like they do.

And such rewrites are a natural part of any project, or at least they should be, as described here:

I actually don’t have a large database at all…

I have a lot of data for my research on Norwegian Mercantile Fleet, but can’t do anything with it in Gramps, so I use Obsidian and other type of software.

The reason I did the place database stuff was because someone asked me for some historical place names years ago.

Personally, I am more interested in Main-Sub Events, and Events on Places, plus the change in the source and citation function in Gramps so that I can use external bibliography software without writing all the data 2-3-4-5 times.
Then I could use Gramps for the research I do.

But in difference to some others, I can actually understand that there are people that wish to do more than one thing in Gramps, because the potential of Gramps as a historical research tool is huge…

1 Like

That → Obsidian (Software) – Wikipedia Obsidian? And I would be interested to learn about your research and why you cannot use Gramps for it since I appear to be one of those →

guys. Since those topics do not really fit into this thread, I’d highly appreciate if you share your thoughts by private mail or directly by mail to ulrich(dot)demlehner (at) gmail (dot) com.