Understanding Gramps 6.0

Development towards the next version of Gramps continues (probably called 6.0). I thought I’d give an update and some additional details. I’ve written about some of this before.

The first set of changes has been merged three weeks ago. This changed the internal database representation from what we call “blobs” to JSON data. The “blobs” are actually a binary format (Python pickle) of an array representation of the data. The blobs were a clever solution used since almost the beginning of Gramps. They were fast and compact. However, they also had some serious problems: the pickle format changes over time (not ideal for genealogical data) and the array representation was hard to use, and could result in bugs.

The new JSON data is slower to convert to Python objects, and occupies more space (disk and RAM). But has many other advantages: the dictionary format is much more useful than the old array format, which allows many things to be done without turning the dict into a full Python object (eg, Person). It also can be queried directly in SQL and other systems, if we wanted.

The second set of changes (merged last week) adds a “wrapper” around the JSON-based dictionary representation so that you don’t have to write person["gramps_id"] but can write person.gramps_id. This is the same syntax one would use for the full Python object. But even better is if you need to access some property of the real Python object, the JSON dict will (basically) turn into a real Python object. This is sometimes called lazy evaluation because it only does the extra conversion if needed.

What’s next? There is a proposed Pull Request (PR) that fixes some bugs, cleans up the filter code, uses the new lazy evaluation above, and performs some optimization on the filter system. The fixes and cleanup are pretty straightforward, and the lazy evaluation will work on either real objects or JSON dicts. These steps will make the filters work faster in almost all cases.

But the optimizer step can allow huge speed ups. It builds on a convention already used in some filters: a “prepare” step. Sometimes, a filter will create a “map” of object “handles” that represent all values that match the filter. Normally, we would go through all data and just check to see if they match the prepared data. But the optimizer turns this inside out, and only considers those that match the prepared data.

How much faster? Here are some data testing the IsRelatedWith filter. Time is in seconds (smaller is better) on a table of 40k people (thanks @emyoulation); Gramps 6.0 refers to all of the changes described above.

scenario Prepare Time Apply Time Total Time
Gramps 5.2 33.94 4.37 38.31
Gramps 6.0 2.22 16.45 18.67
Gramps 6.0, with orjson 1.81 11.68 13.49
Gramps 6.0, with orjson + SQL 1.69 6.36 8.05

The final line is performed with proposed methods that can be implemented in SQL (or other databases). It rewrites part of the IsRelatedWith rule implemented with a proposed PR adding some select methods.

Some things of note:

  1. The Apply Time of Gramps 5.2 is the fastest. Converting the blobs to objects is fast.
  2. However, Gramps 5.2 also has the slowest Prepare Time by far. Converting a lot of data unnecessarily has its costs.
  3. Gramps 6.0 swaps Prepare Time decreases with Apply Time increases
  4. Once we are we are using JSON in Gramps 6.0, there are lots of possibilities to make things faster:
    a. orjson is a fast JSON string to dict converter
    b. SQL (applied correctly) can help reduce time dramatically
  5. The new changes (merged and proposed) make the system more than 200% faster
  6. There are more possibilities to explore to make further speed ups. For example. the Apply Time of Gramps 6.0 (I think) should be able to be done instantly in some cases (like this one).

One thing I did learn: SQL is not magic. You can write some SQL queries (using JOINS for example) that are actually slower than different Python code. Lots of things to explore!

12 Likes

This is so exciting!

There’s been so much work by so many contributors that is impossible to thank people properly. But still, please accept our gratitude.

4 Likes

I only understand half of what you wrote, but it sounds great and I’m looking forward to it.
Thanks!

2 Likes

When you say “JSON + SQL” - do you mean PostgreSql properitery JSON support in SQL so like WHERE person->>'name' = 'Peter' for example? What about SQLite? It also has some JSON SQL support I just noticed. I didn’t cheked both for syntax compatibility though. I only worked with PostgreSql and JSON in SQL myself and have no dev experience with SQLite.

It should be considered carefully to use it or not since its not ANSI standard.

You are right that the JSON syntax in different SQL engines can vary. Syntax like json_data->>"$.primary_name.first_name" can be expressed differently. Some thought was made to allow different engines to be able to adapt, but sqlite is the official API in core Gramps—others are addons. There are also some other differences between Postgresql and Sqlite. See the addon for details.

Will there be a Compatibility note in the code about possible syntax conflicts?

And a link to a Gramps technical article about the most efficient way to test for database backend in Gramps and a sample workaround? Like a PostgreSQL variant when not using SQLite

I (personally) don’t know what the possible syntax conflicts might be. There are probably syntax difference in various versions of sqlite and postgresql. When I wrote the initial postgresl addon, it was trial and error.

We don’t currently have much of a SQL abstraction mechanisms in Gramps. At some point we might want to use an ORM (Object Relation Mapper) like sqlalchemy to get away from having to deal with such issues ourselves.

In any event, we haven’t put any JSON access in Gramps yet. The initial PR is still under development.

2 Likes

I thought I had seen (but cannot now find) an earlier posting with a link to an upgraded version of the Gramps “example” database. Is there one that we can play with, i.e. query directly in SQLite, just to get familiar with its JSON SQL functions?

We have not come up with an effective way to submit expansion XML snippets for the example.gramps

Here is the MantisBT issue:

  • 0012394: [Example.gramps] changes needed for v5.3
  • 0010976: [Example.gramps] should have an example of the “Name editor” choice of (“Display As”)
  • 0012203: [Example.gramps]Example person cited in Display Name Editor is not…
  • 0009891: [Example.gramps] should have custom place types - Gramps - Bugtracker…
  • 0007798: [Example.gramps] should have more examples of “calendar” types
  • 0010480: [Example.gramps] When installing Gramps, make the sample files in a more accessible place for users.

I will look through Discourse and slowly add links. (And then paste into the MantisBT issue)

That’s a good idea. I was going to attach one here, but .zip files aren’t allowed. I’ll try to figure out where to put it…

.gramps and .gpkg file attachments ARE allowed.

Or you can post XML source code for import via the Import Text gramplet. (Source on Discourse has a copy codeblock button enabled in the upper right of the scrollable code.)

In an example of sharing a custom filter XML definition we add the xml notation after the codeblock start with 3 backticks so that the color coding is appropriate. (I like to wrap them in <small></small> html tags. But there must be an extra linefeed between.)

.sql is now allowed, but the file size limit for attachments is 8M.

3 Likes

Note: this is a current version of the database structure of the current Gramps code. The final database structure could still change before Gramps 6.0 is released.

Here is a subset of the “Example” database in gramps, in sqlite format. It contains people related to the home person born after 1920 (total 209 people).

The person table columns are:

handle, given_name, surname, json_data, gramps_id, gender, 
    death_ref_index, birth_ref_index, change, private

The single change is dropping “blob_data” and adding “json_data”, and the same change is made for all tables.

For exploring in sqlite3, just:

sqlite3 small-example.sql

Some things to play with:

sqlite> SELECT json_data->>"$.handle" FROM person WHERE json_data->>"$.primary_name.surname_list[0].surname" = "Garner";

If your SQLite doesn’t support the json_data->>"$.handle" syntax, you can use json_extract(json_data, "$.handle");

I’d be interested in SQL statements that are generally useful and would save quite a bit of time over Python code. For example, I tried finding ancestors of a person in pure SQL but failed. (Chatbots also failed).

small-example.sql (1.9 MB)

Does this help at all?
SQL Server recursive query - Stack Overflow

CTEs appear to be supported in SQLite
The WITH Clause

1 Like

I learned about them through a chatbot, but only got infinite loops.

I used to do things like that with the old “SQLite Export” version of the database. I’ll work on some examples using the new schema; thanks for sharing the sample data. It will take me some time to get familiar with the JSON functions.

With SQLite Export, I found it helpful to create a bunch of views so that didn’t have to keep repeating a lot of the underlying joins as I wrote new queries. For example, I would have a “person_birth_parent” view that returned the birth parents for each person, and I could use that in a recursive query to build an Ahnentafel.

I imagine it will be useful to create views with the new schema as well. It would be good if there were a way to share them via “addons” that would simply add the view definitions to the database. And maybe there could be a generic filter rule that lets you point to a saved view.

1 Like

Yes, that is the kind of building blocks that I’d like to make for a proposed “business logic” layer. They would be there and available if the database supported them, but with a fallback to Python.

That’s an interesting idea. We’re trying to keep a lot of this written in a generic fashion, so that DBs like MongoDB and Postgresql will be functional.

For those queries that are “simple” (no JOINS, using standard SELECT, WHERE, and ORDER BY) I’m hoping that the proposed db.select_from_person() PR can be useful.

Meanwhile, I’ll share here an example of a recursive query using the SQLite Export.

First, I renamed all of the provided tables by prefixing them with “gramps_” so that I could distinguish them from tables and views that I created. Second, I created a table “parameters” to contain some key/value pairs, as you’ll see later.

Here is a view “person_birth_parent” that I use later in the recursive query “ancestors”:

CREATE VIEW person_birth_parent AS
SELECT DISTINCT child_person.handle AS person_handle, child_person.gid AS person_gid, child_person.gender AS person_gender,
father_person.handle AS birth_parent_handle, father_person.gid AS birth_parent_gid, father_person.gender AS birth_parent_gender
FROM gramps_child_ref child_ref
JOIN gramps_person child_person ON child_ref.ref = child_person.handle
JOIN family_child_ref family_child_ref ON child_ref.handle = family_child_ref.child_ref_handle
JOIN gramps_family family ON family_child_ref.family_handle = family.handle
JOIN gramps_person father_person ON family.father_handle = father_person.handle
WHERE child_ref.frel1 = ‘Birth’
UNION ALL
SELECT DISTINCT child_person.handle AS person_handle, child_person.gid AS person_gid, child_person.gender AS person_gender,
mother_person.handle AS birth_parent_handle, mother_person.gid AS birth_parent_gid, mother_person.gender AS birth_parent_gender
FROM gramps_child_ref child_ref
JOIN gramps_person child_person ON child_ref.ref = child_person.handle
JOIN family_child_ref family_child_ref ON child_ref.handle = family_child_ref.child_ref_handle
JOIN gramps_family family ON family_child_ref.family_handle = family.handle
JOIN gramps_person mother_person ON family.mother_handle = mother_person.handle
WHERE child_ref.mrel1 = ‘Birth’;

The columns “frel1” and “mrel1” mentioned in the query above aren’t populated by the SQLite Export; I had to hack it like this:

# handle, ref, frel[0], frel[1],
handle, ref, frel[0], str(ChildRefType(frel)),
# mrel[0], mrel[1], private)
mrel[0], str(ChildRefType(mrel)), private)

See this issue for details.

Finally, here is the recursive query “ancestors” (the WHERE clause looks for a key/value pairs in my “parameters” table to get the Gramps id of the person whose ancestors to find):

CREATE VIEW ancestors AS
WITH RECURSIVE ancestor AS
(
SELECT person_handle AS root_person_handle, 1 AS generation, birth_parent_handle AS person_handle, birth_parent_gid AS person_gid, birth_parent_gender AS person_gender
FROM person_birth_parent person_birth_parent
JOIN parameters parameters ON person_birth_parent.person_gid = parameters.value
WHERE parameters.key = ‘:ancestors-of’
UNION ALL
SELECT ancestor.root_person_handle, ancestor.generation + 1, birth_parent_handle, birth_parent_gid, birth_parent_gender
FROM person_birth_parent person_birth_parent
JOIN ancestor ancestor ON person_birth_parent.person_handle = ancestor.person_handle
)
SELECT DISTINCT *
FROM ancestor;

2 Likes

Maybe this is the next step?
This type of information will be relatively easy to extract from a Network Graph table in both SQLite and other relational databases, and even easier if the idea of support for multi-model databases gets realized.

SQLite even has an addon for this called “SQLite Graph Extension”.
PostgreSQL use Apache AGE for this.

By utilizing those addons it will most likely speed up those type of queries a lot, especially in bigger databases with a lot of relations…


Just an idea…

2 Likes

It occurs to me that the ability to query the database directly, even being able to access all of the JSON data, still doesn’t make certain things available, such as the descriptions associated with the numeric values for all of the various types, since those aren’t stored in the database.

So I think there is still a need (or a desire, on my part at least) for the SQLite Export, which itself lacks a few things (see feature requests 0012620, 0012765, 0012766).

(Even the Gramps XML export has some inconsistencies – it provides the descriptions, but not the numeric values, for the various types, while it provides only the numeric constants for things like the citation confidence level.)

But having said all that, it’s great to be able to query more than I could before!

1 Like