Development towards the next version of Gramps continues (probably called 6.0). I thought I’d give an update and some additional details. I’ve written about some of this before.
The first set of changes has been merged three weeks ago. This changed the internal database representation from what we call “blobs” to JSON data. The “blobs” are actually a binary format (Python pickle) of an array representation of the data. The blobs were a clever solution used since almost the beginning of Gramps. They were fast and compact. However, they also had some serious problems: the pickle format changes over time (not ideal for genealogical data) and the array representation was hard to use, and could result in bugs.
The new JSON data is slower to convert to Python objects, and occupies more space (disk and RAM). But has many other advantages: the dictionary format is much more useful than the old array format, which allows many things to be done without turning the dict into a full Python object (eg, Person). It also can be queried directly in SQL and other systems, if we wanted.
The second set of changes (merged last week) adds a “wrapper” around the JSON-based dictionary representation so that you don’t have to write person["gramps_id"]
but can write person.gramps_id
. This is the same syntax one would use for the full Python object. But even better is if you need to access some property of the real Python object, the JSON dict will (basically) turn into a real Python object. This is sometimes called lazy evaluation because it only does the extra conversion if needed.
What’s next? There is a proposed Pull Request (PR) that fixes some bugs, cleans up the filter code, uses the new lazy evaluation above, and performs some optimization on the filter system. The fixes and cleanup are pretty straightforward, and the lazy evaluation will work on either real objects or JSON dicts. These steps will make the filters work faster in almost all cases.
But the optimizer step can allow huge speed ups. It builds on a convention already used in some filters: a “prepare” step. Sometimes, a filter will create a “map” of object “handles” that represent all values that match the filter. Normally, we would go through all data and just check to see if they match the prepared data. But the optimizer turns this inside out, and only considers those that match the prepared data.
How much faster? Here are some data testing the IsRelatedWith
filter. Time is in seconds (smaller is better) on a table of 40k people (thanks @emyoulation); Gramps 6.0 refers to all of the changes described above.
scenario | Prepare Time | Apply Time | Total Time |
---|---|---|---|
Gramps 5.2 | 33.94 | 4.37 | 38.31 |
Gramps 6.0 | 2.22 | 16.45 | 18.67 |
Gramps 6.0, with orjson | 1.81 | 11.68 | 13.49 |
Gramps 6.0, with orjson + SQL | 1.69 | 6.36 | 8.05 |
The final line is performed with proposed methods that can be implemented in SQL (or other databases). It rewrites part of the IsRelatedWith rule implemented with a proposed PR adding some select methods.
Some things of note:
- The Apply Time of Gramps 5.2 is the fastest. Converting the blobs to objects is fast.
- However, Gramps 5.2 also has the slowest Prepare Time by far. Converting a lot of data unnecessarily has its costs.
- Gramps 6.0 swaps Prepare Time decreases with Apply Time increases
- Once we are we are using JSON in Gramps 6.0, there are lots of possibilities to make things faster:
a.orjson
is a fast JSON string to dict converter
b. SQL (applied correctly) can help reduce time dramatically - The new changes (merged and proposed) make the system more than 200% faster
- There are more possibilities to explore to make further speed ups. For example. the Apply Time of Gramps 6.0 (I think) should be able to be done instantly in some cases (like this one).
One thing I did learn: SQL is not magic. You can write some SQL queries (using JOINS for example) that are actually slower than different Python code. Lots of things to explore!