The Road Ahead for database representations

dsblank · November 3, 2024, 1:15pm

A tree in multi-tree gramps-web is an identifier of which family tree you want to work with. All trees are stored in the same tables, and so the treeid allows you to operate as if there is only one tree’s data. treeid + handle is then a new handle.

dsblank · November 3, 2024, 1:20pm

I thought we might leave it there for a version, as a backup, in case there is an issue and we need to convert again.

dsblank · November 3, 2024, 1:43pm

There is a standard format in SQL that appears to work for SQLite, MySQL, and PostgreSQL using syntax based on the JSON_EXTRACT(). That looks like:

SELECT json_data->>'$.tag_list[0].name' FROM person;

That is “slow” by SQL standards (JSON data is parsed for each row). Much faster than doing it in Python! But there are other faster options. SQLite has Generated Columns that can be virtual or stored. Looks like PostgreSQL has similar functionality.

Nick-Hall · November 3, 2024, 1:45pm

I think the confusion here is that we have previously been discussing the idea of a tree_id to indicate trees that have been derived from a common source tree.

Nick-Hall · November 3, 2024, 1:56pm

I’m prioritising Doug’s pull requests at the moment. When I have time I will get around to looking at all other contributions.

emyoulation · November 3, 2024, 2:44pm

Does it make sense to have a “JSON Inspection” gramplet (expert or developer audience) that shows a JSON representation of the active object with its secondary objects?

It would allow users to quickly determine if an object is getting funky.

dsblank · November 3, 2024, 3:02pm

Maybe… these are pretty hard to read. Here is an example:

{"handle": "02NKQC5GOZFLSUSMW3", "change": 1185438865, "private": false, "tag_list": [], "gramps_id": "I1487", "citation_list": ["c140d245cc614c5c140"], "note_list": [], "media_list": [], "event_ref_list": [], "attribute_list": [], "address_list": [], "urls": [], "lds_ord_list": [], "primary_name": {"private": false, "surname_list": [{"surname": "\u041c\u0435\u0434\u0432\u0435\u0434\u0435\u0432", "prefix": "", "primary": true, "origintype": {"_class": "NameOriginType", "value": 1, "string": ""}, "connector": "", "_class": "Surname"}], "citation_list": [], "note_list": [], "date": {"format": null, "calendar": 0, "modifier": 0, "quality": 0, "dateval": [0, 0, 0, false], "text": "", "sortval": 0, "newyear": 0, "_class": "Date"}, "first_name": "Mary", "suffix": "", "title": "", "type": {"_class": "NameType", "value": 2, "string": ""}, "group_as": "", "sort_as": 0, "display_as": 0, "call": "", "nick": "", "famnick": "", "_class": "Name"}, "family_list": ["IRMKQCD49F9P9QTQRT"], "parent_family_list": [], "alternate_names": [], "person_ref_list": [], "death_ref_index": -1, "birth_ref_index": -1, "_class": "Person", "gender": 0}

DavidMStraub · November 3, 2024, 4:08pm

At the moment in the SharedPostgreSQLAddon, the tree ID is a property of the database connection, so it cannot be modified by any of the “normal” database access methods like get_person_from.... I consider that to be an important security feature to keep data from accidentally leaking between users in a multi-tenant environment, so I would strongly suggest to treat handles and tree IDs on a different footing.

emyoulation · November 8, 2024, 6:59pm

dsblank:

Maybe… these are pretty hard to read. Here is an example:

{"handle": "02NKQC5GOZFLSUSMW3", "change": 1185438865, "private": false, "tag_list": [], "gramps_id": "I1487", "citation_list": ["c140d245cc614c5c140"], "note_list": [], "media_list": [], "event_ref_list": [], "attribute_list": [], "address_list": [], "urls": [], "lds_ord_list": [], "primary_name": {"private": false, "surname_list": [{"surname": "\u041c\u0435\u0434\u0432\u0435\u0434\u0435\u0432", "prefix": "", "primary": true, "origintype": {"_class": "NameOriginType", "value": 1, "string": ""}, "connector": "", "_class": "Surname"}], "citation_list": [], "note_list": [], "date": {"format": null, "calendar": 0, "modifier": 0, "quality": 0, "dateval": [0, 0, 0, false], "text": "", "sortval": 0, "newyear": 0, "_class": "Date"}, "first_name": "Mary", "suffix": "", "title": "", "type": {"_class": "NameType", "value": 2, "string": ""}, "group_as": "", "sort_as": 0, "display_as": 0, "call": "", "nick": "", "famnick": "", "_class": "Name"}, "family_list": ["IRMKQCD49F9P9QTQRT"], "parent_family_list": [], "alternate_names": [], "person_ref_list": [], "death_ref_index": -1, "birth_ref_index": -1, "_class": "Person", "gender": 0}

Perplexity indicates that there are 3 basic styles for JSON representation: compact, commented and pretty (for printing)

If a JSON Inspection Gramplet used the “Pretty”, wouldn’t it be a lot easier for a Human to inspect?

Nick-Hall · November 8, 2024, 7:17pm

It is easy to output JSON in a pretty format.

SQLite has the json_pretty() function and the python json module has the indent option which does the same thing.

emyoulation · November 8, 2024, 7:46pm

I suspect that a JSON inspection would be very handy when diagnosing problems such as the current discussion: Parent Citation.

Sometimes Citations are buried deeply as secondary objects of secondary objects … or even deeper. So, identification of where missing Citations are attached helps identify where a report needs improvement.

Cross-reference related feature request:
0012436: “Deep References” gramplet (add optional configuration of existing…

dsblank · November 10, 2024, 12:53pm

I didn’t mean to suggest that it was hard to read because it wasn’t indented nicely (I tried to get discourse to do that) but, as @emyoulation said, the structures can be quite deep and complex.

But I don’t see why the clipboard couldn’t allow pasting (or viewing) the raw JSON if one wanted to.

dsblank · November 11, 2024, 8:03pm

As we finish up the conversion of blobs to JSON, I’m thinking about the details of how to take advantage of it. The two things I’m working on:

create “business logic” functions for generic and dbapi layers
add an extract_data() method for generic and dbapi layers

It is important to always have a generic implementation (uses standard database methods) so that any DB that doesn’t (or can’t) implement something better has an implementation.

For (1) I realized that there is no method that an addon writer could use JSON until specialized methods are added to generic and (at least) dbapi. It would be nice if there were a general generic method that also had an optimized version.

Which got me thinking about (2). I wrote a prototype such that this would work in generic and dbapi:

db.extract_data(["$.gramps_id", "$.private"], handle="12345")

But that gains very little (100 milliseconds) just to move the jsonpath to SQL. But if you change extract_data (rename it, and change format of handle argument and value slightly) then you get:

db.select(["$.gramps_id", "$.private"], ("$.handle", "=", "12345"))

This version operates identically to the above in terms of efficiency. But now it is directly translatable into other languages, like SQL. With this there is a db method that works on all database types, and can access the power of SQL (or other db engines). It doesn’t do joins or other more sophisticated expressions, and is limited to only what jsonpath supports in Python. But does give users and developers a doorway into writing faster code without waiting for someone to add “business logic” to the core codebase.

What do you think?

dsblank · November 12, 2024, 12:18am

Rereading the thread from 2016, I don’t think this suggestion is going to go anywhere. Although that conversation had people talking about many different things (GUIs, non-filter code).

In any event, I think a collection of “business logic” methods might strike the right balance, and I’m excited to get working on that. (I do have a working db.select() as described above if people are interested).

dsblank · November 13, 2024, 10:13pm

Now that the conversion from blob data is under review, this is a good time to look at performance, and explore possible changes to the filter engine.

For this analysis, I did 5 experiments:

Regular isAncestorOf all people in Example (blob)
Regular isAncestorOf all people in Example (JSON)
No constructing Person() in loop
No loop
No loop, SQL biz logic

A brief explanation on 3, 4, and 5. There are two kinds of filters:

those filters that take a Gramps object, perform some logic, and return True or False
those filters that compute a set of matches before hand, and just make a lookup to see if True

The first type requires a loop. The second kind doesn’t really need a loop… it has already figured out all that are True. So, for these experiments, I step by step tried different options to see where the expensive bits are.

The results:

First, the bad news: just switching from blobs to JSON made this process almost 3 times slower. I guess this makes sense as unpickling is a process in C. It’s fast, but risky in a number of ways (including that it is a really bad format for historical data since it changes over time).

Moving on from there we see some very dramatic possible speed ups. First, by getting rid of the object creation, that drops from 375 to 93 seconds. This will be possible because we can use the JSON format where possible. But the largest change is getting rid of the loop where possible. That drops down to 8 seconds. Using SQL business logic drops it down to 3 seconds.

This suggests that refactoring the filter engine will bring about the biggest changes in performance.

dsblank · November 15, 2024, 8:03am

The results shown in this chart were very confusing to me. Why would using JSON (2)(that is only slightly slower than blobs) have such a big difference here (with 1)? And why would removing the object (3) be that much faster?

Now I know: the filter system recursively does too much. Just removing the unnecessary bits is going to speed this up. Then we can do optimizations on top of that.

dsblank · November 15, 2024, 6:35pm

I refactored the filter engine, plus using the JSON repr as opposed to objects (no other changes), the test now clocks in at 100 seconds, faster than previous blob version and very close to the js-no-obj (3) hack at 93 seconds. There are about 275 filters/rules that will need to be slightly adjusted.

Combining business logic with the above knocks another 10% off the time, down to 90 seconds.

I’m not 100% sure the sub 10 second optimizations will generalize to all filters and rules, but even without them, already it is worth the conversion to JSON.

GeorgeWilmes · November 15, 2024, 7:40pm

Are there similar opportunities for performance improvements in other recursive code (for example, the relationship calculator)? I once tried adding a “Relationship to Home Person” column to the People category view, but it was extremely slow when I filtered on it and seemed to hang when I tried to sort on it. And that was on a small database (~5,600 people).

dsblank · November 15, 2024, 7:42pm

I hope so ; I would think so. Can’t wait to explore more!

emyoulation · November 15, 2024, 7:54pm

The Deep Connections is one of the slowest and most resource-intensive Gramplets.

It probably exercises the widest variety of relationship calculations

Topic		Replies	Views
Understanding Gramps 6.0 Development roadmap	25	1562	February 19, 2025
JSON schema specification Development metadata	17	1013	June 15, 2022
Are the database schema design considerations described somewhere? Development database	5	87	February 5, 2025
How to make gramps-web fast Development gramps-web , performance	4	150	October 20, 2024
Making Gramps filters Faster, and then Superfast Development performance	15	303	February 1, 2025

The Road Ahead for database representations

Related topics