Gramps, Next Generation

Developers,

I think the most important question for the next version of gramps is: how to allow Gramps to take advantage of JSON for large database speedups? Now that the JSON format is settled in Gramps 6.0, I think it is time to starting planning.

This question implies the following constraints:

  1. The code needs to allow implementations in non-SQL backends
  2. It should be robust and maintainable, and not too much code
  3. It should not introduce non-Pythonic interfaces
  4. It should not “look like SQL” to the developer

(This list is a summary of discussions about this topic over at least the last 7 years. These are not just my preferences, but those from the developer community.)

I proposed the idea in a bit of detail (shown below) in January of this 2025 on the Gramps Developers Mailing list.

There wasn’t much feedback, but Tim Lyons asked:

I’m not entirely sure how much something like this would actually be used in the main Gramps code.

I’ve been working on this for the last few weeks, and I think I have figured out the tricky parts, not just in implementation, but how it must properly used in the Gramps environment. The first big integration hurdle is that:

  • You can’t use low-level database access for speed, if you are using a Gramps proxy

A “proxy” in Gramps terminology is a Python wrapper around a database. It does things like remove sensitive detail if an object is marked Private.

On the other hand, a IsPrivate filter is very simple. Here is a simplified example:

    def apply_to_one(db, person) -> bool:
        return person.private

It say “include the object in the selected items if person.private is True” and it works even if there are Proxies in place (Proxies can stack on top of one another). So, how can we make that code any faster? Well, the filter requires that you go through each and every Person in the database to check to see if they are private. That is slow because it happens in Python. If we could move the select part into, say SQL, then we can do the sweep over the database in SQL rather than in Python (fast!).

Ok, here is the IsPrivate filter using the proposed API (simplified for demonstration):

    def prepare(db):
        if db.can_use_fast_selects():
            self.selected_handles = db.select_from_person(
                what="person.handle", 
                where="person.private"
            )

    def apply_to_one(db, person) -> bool:
        if db.can_use_fast_selects():
            return person.handle in self.selected_handles
        else:
            return person.private

Notes:

  1. The db.can_use_fast_selects() checks two things: is the select_from_table implemented in this database backend, and are we not using a proxy? If both are true, then we can use select_from_person()
  2. If it can’t use fast selects, then it defaults to the old implementation
  3. The what and where parameters are strings of Python code. You can use most simple Python syntax expressions that you would use regularly.

But how does this prevent from sweeping over the entire database? Because in Gramps 6.0 we introduced the filter optimizer, and it knows that if you set selected_handles then you only need to look at those handles, and nothing more. For many filters that only match a few objects, the speedups are large. And by moving the selected_handles into SQL, then the speed up is truly amazing.

One additional nice feature: because the arguments what, where, order_by can all takes strings of Python syntax, it makes it very easy to use in Gramps Web. In fact, with this, Gramps Web can be much faster because it, too, would not need to sweep over the entire set of objects. (Care has been made to be able to use different dialect of SQL, including Postgresql).

@Nick-Hall, there are a lot of details to discuss, but please keep an open mind on the basic proposal. Any detail can of course be changed. In addition, I think “business logic” functions can be implemented using the select_from_table API.

Please let me know if you have questions or comments!

6 Likes

Thanks for this initiative. This one surprised me:

Why? Will keeping this flexibility not introduce a lot of complexity? What would the benefit be? I’d actually turn it around and say it would be good to commit fully to SQL and exploit both the JSON and SQL structure for speedup. Or did I misunderstand this point?

As you well know, genealogy records are hierarchical, often stored as trees with complex relationships.
They are not transactional relational data; they are object‑relational data.
This means that a non‑SQL database backend is naturally better suited for handling such structures than any SQL‑server implementation.
Genealogy data benefits from flexible document or graph storage, where relationships and hierarchies can be represented directly, rather than forced into rigid transactional tables.

But if the DB‑API fully supports SQL, it will also be possible to extend support to other types of database backends.
This flexibility ensures that Gramps can take advantage of both SQL and non‑SQL approaches, depending on the nature of the data and the performance requirement.

This point came from discussions with the community of developers over the last few years, mostly before Gramps Web ever existed, around the time that Gramps 5.0 was in development. There was a concern from some developers that Gramps API should not commit to being SQL biased, either in syntax or semantics. There are many Python ORMs that can query SQL databases, but the consensus was against those, and I did not want to get bogged down in that debate again.

In any event, there is nothing in this PR that requires SQL, or even would be better if we made a commitment to only using SQL. Thus the conversion from Python syntax in strings to SQL. (There is also in this PR a conversion from lambda expressions to strings. But that required to drop down to byte codes, and doesn’t gain as much as I thought.)

Hope that helps!

1 Like

As we have moved away from the blob structures in previous versions of Gramps and expose that data via JSON to the database engine (SQL) we actually have a hybrid system. The JSON structures in each table are not flat, but, with a PR like this one, it allows direct selection and comparison of nested items.

Currently there isn’t a no-sql engine that operates as nicely as SQLite. But nothing in this PR prevents, say, MongoDB, from being used today. And nothing prevents a complete project migration to no-sql in the future.

I know…
…that’s why I wrote that if Gramps can support full SQL and perhaps move some of the serializing logic from the frontend Python code to the backend database engine, sending dicts and objects back to the frontend and letting the DB‑API control the process, then when someone finds the time or need, it will be easier to create a DB‑API for any other type of database backend.

But I cannot tell you how to do that, because I am not a programmer or developer; I only understand the logic and see how features can help improve the workflow.
Even though I would really like to see Gramps in a multi‑model database backend, I know that it’s important to first get a good implementation with SQLite before adding support for other types of backends.
There is already an experimental advanced DB‑API for PostgreSQL that I found and linked in a post a few days ago; maybe it could be extended further and also used as a template for SQLite (It might be you already knew about that repo).
I don’t know, I just wish for Gramps to reach its full potential as a lot more than “just another family tree software.”

Many SQL database engines know how to access JSON items, and so there is no need for any additional “serialization” for values. Now, that doesn’t help too much with the complex items (such as Gramps Dates) but you can do a lot with the JSON structure and values.

As an example, consider this complex selection:

    db.select_from_person(
        what="[eref.role.value for eref in person.event_ref_list if eref.role.value == EventRoleType.PRIMARY]",
        where="any([eref for eref in person.event_ref_list if eref.role.value == EventRoleType.PRIMARY])"
    )

This is largely what you would need to write in regular Gramps (if you avoided using Gramps objects). However, it is converted into SQL. Or could be converted into a future graph DB.

Is that what you mean?

Yes, I mean that some of the Python serializing that is still done in the frontend can be moved to the database engine, as functions, saved queries, views, etc. If the database engine supports it, it might even be possible to utilize some ORM (SQLAlchemy? using the JSON objects already there maybe—I don’t know how it is done, or if it has any benefits for Gramps). It is still the serializing in the frontend that slows Gramps down the most, but if the database can send answers as filtered objects or dicts, that would already give a large speed improvement, because the frontend logic would have fewer serialized objects to iterate through.

Another thing that might speed things up a little could be a combination of multi‑process and multi‑I/O for selected features. For example, a multi‑I/O multi‑process to single‑process import/export, where data is sent to a queue before being written to the database. SQLite only supports one writer but multiple readers, while PostgreSQL or other backends that support full multi‑I/O (multi‑connection) would benefit even more. It could be a combined auto/manual setting for how many threads to use, depending on your hardware, while Python provides a default recommendation based on the system. As far as I understand, the library will use the right code depending on the OS platform.

PS. I know this last bit is most likely a lot of extra work, so this is not in any way a “request.” I’m just throwing out my thoughts. The only reason I know anything about this is because I’ve been reading about it for a script I’m trying to write for a scanning and transcoding project I’ve been thinking about for the last 10 years or so.

In fact, Gramps 6.0 does allow returning the JSON data directly, and it is used in many filters, without de-serializing it.

And the next generation of Gramps can filter at the database layer in exactly the same format that developers write queries.

You have some interesting ideas. Some are probably not going to be developed for core Gramps because they are outside the scope of database-engine independence. But others are thought provoking, and we appreciate those ideas not as “requests” but as intended as discussion points.

1 Like

In exploring using JSON + SQL queries for the next Gramps (I’m thinking that this might be called Gramps 7.0 because of the ability for low-level queries) I have also found some places for optimization that we missed in 6.0. Just those changes help a lot, but need more testing.

I’m creating a method of creating artificial trees that have all of the modern items that Gramps supports, so we can test optimizations. Trying to approximate the counts of various combinations of filterable things (such as twins, private people, adoptions, etc). This is an ongoing project as I have about half of the filters covered with test data.

There are 250 filters in Gramps, and I’m using what the filters do in order to make sure that when applied to the test trees, it will match some (not all, and not none).

To see the timings of the current tested filters on 100,000 people in 6.0, see:

I think most of the filters that take 10 - 3 seconds can be reduced to less than 1 second. using SQL queries. Those filters that take more than 10 seconds will probably require a different strategy.

2 Likes

Wow. The cost of the IsFemale and IsMale for Person objects is a surprise. I might expect that for a Has but not an Is

Is that related to the recent “Other” addition to reduce the dependence on Unknown?

And why is person.HasOtherGender not an Is and why is it so much less expensive?

Both are basically the same, because the 6.0 code (and all versions before) is:

    def apply_to_one(self, db: Database, person: Person) -> bool:
        return person.gender == Person.MALE

Which means that to find the males, females, or any Has, you must go through all 100,000 objects (or however many you have). That is Big O(n) at the Python level. It is faster than 5.0 (in one respect) because of 6.0 DataDIct use, but still slow. Those are all about 5 seconds, the time to traverse all of the records at the Python level.

But the new code will be something like:

    def prepare(self, db, user):
        self.selected_handles = db.select_from_person(
            what="person.handle",
            where="person.gender == Person.MALE")
        )

    def apply_to_one(self, db: Database, person: Person) -> bool:
        return person.handle in self.selected_handles

which takes about 0.50 second, a 10x increase!

I’m not sure what you are referring to with that question.

1 Like

I am wondering if the HasGenderOther is faster because the feature is a recent addition and more efficient code was written by people with more experience?

Or maybe it is because the statistics need to be normalized.

Maybe the differences in gender matching timings are because there are few “Others” to be found and there are somewhat more Males than Females in your sample data.

No, it is the same kind of code exactly, just replaced MALE with OTHER.

Nope, the stats do not come into play in any filter operation.

Yes, I think that can be the only difference (which I hadn’t noticed; thanks). It is the cost of returning about 50k handles, versus returning 2k handles. Interesting!

1 Like

Are the timings of “start filter to finalized refresh” the sorted and filtered view?

And is the timing test with the new code comparing apples to apples there?

i.e. is the view sorting overhead included in the “new code” timing test? Maybe that overhead is the 25٪ faster portion of the 2k other results compared to the 50k male results timings.

If not included, it might add a second (or more) to the 0.5 second result. Which would mean a huge downward revision in the 10x performance boost estimate.

All of the filter bench mark tests are just the application of the rules (including the time it takes to return the handles). There is no view component. But the view is another process I’d like to look at. If we could do lazy loading of the view, we could see dramatic responses in the GUI.

The only thing that changes in the benchmark test is the version of Gramps used. So, apple to exact apple.

No, you were right the first time: it is just the number of handles returned that has a bit of overhead.

1 Like

Thanks for the clarifications! I always question the data gathering methodology and normalization when reading stats.

(Darrell Huff’s 1954 classic “How To Lie With Statistics” and Gary King’s 1986 less popular “How Not to Lie With Statistics” corollary books are good reading.)

1 Like

There are actually 4 things done when just applying a filter. Here is the breakdown for the rule “People without a known birth date” when applied to 100,000 with new SQL query API:

Prepare time: 1.3625314235687256 seconds
Pre process time: 0.017786741256713867 seconds
Apply time: 0.001865386962890625 seconds
Post process time: 0.0342717170715332 seconds

I had to break down these times because I found that the post processing time was taking about 25 second. Turns out that we had a sneaky slowdown in Gramps 6.0. Now fixed!

To compare, the time for this filter in Gramps 6.0 on 100k people was 5.717543 seconds.

1 Like

I hope that that fix was a separate PR with a title mere users could understand? This change could be a big motivator.

One of the issues delaying users from upgrading in the last 2 recent cycles has been the segmentation of the change log/release notes.

Many people used the release summaries to decide whether the benefits were worth the upgrading pain. And since the public release change log only lists the changes between the final beta and the accepted release candidate, those who did not participate in the beta testing were mislead in their decision process.

Ennoborg surprised me when he would insist that there weren’t enough changes between 5.1, 5.2 and 6.0 to justify the releases. But he seemed to be basing the decision on the Release Notes which contained only the changes applied the last 3 weeks of each cycle.

1 Like

I have fleshed out the proposal into a GEP: GEPS 047: Select data from SQL API - Gramps

I also reconsidered a question raised by @DavidMStraub. I hadn’t fully exploited what the Python string expressions could query. Why not JOINs? Why not list comprehension? And as I started to work on converting rules into select_from_table formats, I realized that these types of queries are needed.

Could I make an intuitive Python-based syntax that could do all of that? Yes, it turns out.

So, I refactored the API into a QueryBuilder that can handle a subset of Python expressions with a very intuitive API. Here is a JOIN:

results = list(
    db.select_from_person(
        what=["person.handle", "family.handle"],
        where="person.handle == family.father_handle"
    )
)

There are examples of the various supported formats in the GEP. Of course, feedback and questions welcomed!

2 Likes