The Road Ahead for database representations

dsblank · November 1, 2024, 4:09pm

An update (and look at what is possible) as we move away from pickled blobs and toward JSON data in the Gramps database.

First, @Nick-Hall has already refactored existing JSON methods for converting to and from Gramps objects. This cleaned up the system, made the representation 100% idempotent, and made the conversions much faster than they were.

With this solid foundation, I wrote a db conversion from blob to JSON, and rewrote the access to the database to only use the JSON interface. This in and of itself doesn’t do too much (it does clean up a lot of cryptic code that accessed the unpicked array structure, and makes the data much easier to use outside of Gramps).

But the next steps can take advantage of the new data representation. First, we can index on, and provide direct access to, items that previously had to be unpickled and perhaps converted into a Gramps object. This is a gain for some operations. This functionality uses a standard JSONpath interface to access items in the JSON data fields, but only one row at a time.

A next further step could be taken to move this type of JSONpath access to the SQL layer. Moving the JSONpath to SQL doesn’t offer a huge gain when operating on data row by row. But allowing SQL to use SELECT, ORDER BY, and WHERE clases on these JSONpaths is a game changer.

I’m working on a proposal to have some expensive filters (such as “is ancestor of” that could return a list of handles) that are first refactored into a single abstract database method (currently called “business logic”). Nothing would be new here, just a rearrangement of current code.

But, if a low-level database can implement that functionality using the power of SQL (or whatever backend) then we can get huge gains on more operations.

At that point, certain operations in gramps-desktop can be made much much faster. I would imagine some gramplets, like “Deep Connections” could be very fast. But gramps-desktop would still be limited in some ways. For example, it still might not be able to handle very large databases.

But new projects, like gramps-web could. I hope to work with @DavidMStraub to allow it to take full advantage of all of the above.

If you have any questions or comments, let us know.

emyoulation · November 1, 2024, 7:33pm

If you want to follow along on the heavy lifting being done for this change, follow the conversation for the Pull Request:

github.com/gramps-project/gramps

Switch from pickled blobs to JSON data

gramps-project:master ← gramps-project:dsb/depickle

opened 09:56PM - 10 Oct 24 UTC

dsblank

+615 -474

This PR converts the database interface to use JSON data rather than the pickled… blobs used since the early days. 1. Uses a new abstraction in the database: `db.serializer` a. abstracts data column name b. contains serialize/unserialize functions 2. Updates database format to 21 3. The conversion from 20 to 21 reads pickled blobs, and writes JSON data. a. It does this by switching between serializers 4. New databases do not contained pickled blobs 5. Converted databases contain both fields

Thanks Doug (@dsblank) and Nick (@Nick-Hall).

And thanks to the GitHub contributors doing sanity checks.

emyoulation · November 1, 2024, 7:41pm

Peplexity was asked to comment on the implications:

This change-in-progress to the Gramps source code represents a significant architectural shift in how Gramps handles data storage and retrieval. Here’s an assessment of the proposed changes:

JSON Conversion Refactoring:
- Positive: Improving idempotency ensures consistency in data representation.
- Positive: Faster conversions will enhance overall system performance.
- Positive: Cleaner code improves maintainability and readability.
Database Conversion from BLOB to JSON:
- Positive: Makes data more accessible and readable outside of Gramps.
- Positive: Simplifies cryptic code, improving maintainability.
- Potential Concern: Migration of existing databases may be challenging.
JSON Interface for Database Access:
- Positive: Provides a more standardized and flexible way to interact with data.
- Positive: Potentially easier to implement new features or integrations.
Indexing and Direct Access to JSON Data:
- Positive: Improved performance for certain operations.
- Positive: Eliminates the need for unpickling in many cases.
- Positive: JSONpath interface provides a standard way to query JSON data.
Moving JSONpath Access to SQL Layer:
- Highly Positive: This could significantly improve query performance and flexibility.
- Positive: Allows for more complex and efficient data operations directly in the database.
- Potential Concern: Might require careful optimization to ensure performance at scale.

Overall Assessment:
This change represents a modern approach to data management in Gramps. It aligns with current best practices in database design and API development. The shift to JSON-based storage and querying offers several advantages:

Improved Interoperability: JSON is widely supported, making it easier to integrate Gramps with other systems or tools.
Enhanced Query Capabilities: JSONpath and SQL integration can lead to more powerful and flexible data retrieval options.
Better Performance: Direct JSON querying can be faster than unpickling BLOBs for many operations.
Improved Maintainability: JSON is human-readable, making debugging and data inspection easier.

Potential Challenges:

Data Migration: Existing Gramps databases will need a migration path.
Performance Tuning: Ensuring that JSON-based queries perform well with large datasets.
Backward Compatibility: Maintaining support for older Gramps versions or plugins.

In conclusion, this change appears to be a well-thought-out modernization of Gramps’ data layer. It has the potential to significantly improve the software’s flexibility, performance, and maintainability. However, careful implementation and thorough testing will be crucial, especially regarding data migration and performance with large genealogical datasets.

Citations:
[1] Groovy - Convert object to JSON string - Stack Overflow
[2] JSON schema specification
[3] API: Issue Opening Databases in Gramps 5
[4] GitHub - gramps-project/addons-source: Contributed 3rd party plugins for Gramps
[5] Excel, CSV switch to webtrees? - webtrees
[6] Gramps 5.1 Wiki Manual - Command Line - Gramps
[7] Supply id via csv import
[8] python - How can I parse a CSV file and display only specific column? - Stack Overflow

SteveY · November 1, 2024, 11:40pm

My reading of the code is that this is already implemented. Databases are converted the first time they are opened. The old “blob” representation is left intact and the new JSON representation is added. Thereafter only the JSON data is updated

SteveY · November 1, 2024, 11:44pm

@dsblank any plan to add a command to delete the “blob” representation?

dsblank · November 2, 2024, 1:43am

That’s a good idea. There is a work-around: create a new database, and import the data back in. But I can think of a few short scripts that would be useful to have a command-line API for.

dsblank · November 2, 2024, 1:46am

Indeed, you are correct. It took me a bit to figure out how to read the old data format, and write out the new one. But I think the solution (swapping serializers back and forth) is pretty robust. We may always have to keep that functionality, to allow being able to migrate old trees.

emyoulation · November 2, 2024, 3:16am

Gramps has 2 forms in which a person might inherit data and need to convert to a modern version: database files and an XML variant.

And there have been several instances where users have had databases the current version of Gramps could not longer read. They had to figure out which database engine and file version, then find the right installer for the last version of Gramps that would read it so they could make a .gramps/.gpkg backup.

See convert older database

Recently, the support for BSDDB was removed. Then had to be restored for a capacity to upgrade old databases.

So it will be important to think about the upgrade path. (At some point, it may become necessary to create a tool that is exclusively about identifying and recovering outdated Gramps formats… so that Gramps can be freed from constraints. BLOBs might become too onerous to carry conversion support forever.)

Has anyone asked @DavidMStraub about making Gramps (for desktops) able to recognize Gramps Web database formats?

DavidMStraub · November 2, 2024, 10:07am

Gramps Web uses Gramps core & Gramps addons for database access. The only addon that I specifically developed for Gramps Web is the SharedPostgreSQL addon, but it’s a normal addon that could also be used with desktop in principle (although I doubt anyone does).

dsblank · November 2, 2024, 11:27am

It is even worse than that. The “pickle” format has changed over the years, so one version of Python might not be able to read another Python’s pickled data, locking that database to a particular version. Another reason to get rid of pickled data.

dsblank · November 2, 2024, 12:28pm

@DavidMStraub, I hadn’t seen SharedPostgresSQL yet. This has some really interesting DB methods in it, and would great to be able to use those in more places. I see some good “business logic” functions there.

I also see that it would be hard to use outside this DB due to the introduction of the tree_id. That is a nice addition to solve the problem of managing more than one tree with one database.

But what if the tree_id was part of the “handle”? Maybe just “tree_id:handle” (like “1234:7787”). Or maybe it could be JSON (like “{‘handle’: 7787, ‘tree_id’: ‘1234’}”, if ever we needed to separate them at the SQL level. Then this code would be reusable with gramps core, and the code could be reused there. What do you think?

*After thinking about, I see that my suggestion wouldn’t work. But what if we included treeid in the new JSON representation? *

DavidMStraub · November 2, 2024, 1:27pm

*After thinking about, I see that my suggestion wouldn’t work. But what if we included treeid in the new JSON representation? *

For databases with many trees, it’s important to have an index for the tree ID. I don’t know much about indexing JSON values, is that possible e.g. in SQLite? What would be an advantage compared to a normal column?

ennoborg · November 2, 2024, 1:30pm

What is a tree, in this context?

Nick-Hall · November 2, 2024, 1:51pm

Yes. When the raw format changes to JSON we should remove the old pickled data.

Nick-Hall · November 2, 2024, 2:47pm

The implementation of a tree_id will be backend specific. For a SQL database a normal column would probably be simpler, but for a MongoDB backend it would go into the document. SQLite can index on JSON expression though.

Gramps objects don’t contain a reference to the database or tree that they reside in. Since the raw JSON format corresponds directly to the objects, it probably shouldn’t contain a tree_id or db_id either.

dsblank · November 3, 2024, 1:56am

One counter-argument that I had to deal with many years ago is copying items to and from the clipboard. When you copy an object to the clipboard, it keeps track of what database it came from. This allows items to stay in the clipboard even when you are using a different database.

After thinking about this more, yeah, I agree that we’ll have to keep treeid in the shareddbapi. I was hoping to keep nearly identical code from having to be copied in business logic modules, but I can’t think of a clean way to do it.

DavidMStraub · November 3, 2024, 10:13am

By the way: since we are doing all this refactoring now, I would appreciate if we could merge Add mypy static type checker to CI and necessary fixes to the source code by DavidMStraub · Pull Request #1717 · gramps-project/gramps · GitHub. Type hints will be quite useful to avoid accidental breakage. When 1717 is merged, I can start adding more type hints throughout gramps.gen.

SteveY · November 3, 2024, 10:26am

Random Sunday thought: if (when) this gets merged do we expect it to be more efficient to produce the JSON representation of an object than the XML representation (because the JSON representation is already stored in the DB)? If so, does this open up the opportunity to use JSON for the backup capability within Gramps?
As we know the current backup capability does take significant time\CPU resource

dsblank · November 3, 2024, 12:57pm

Looks good to me! This will be very helpful and make Gramps-core more solid.

dsblank · November 3, 2024, 1:02pm

I would think so. This will be a simple dump to disk. On that topic, I would recommend that we use JSONLines format (one JSON object per line) so that we don’t have to read in the entire file first to decode. Should make even large trees save and load fast.

Topic		Replies	Views
Understanding Gramps 6.0 Development roadmap	25	1652	February 19, 2025
Error using PostgreSQL as database backend with Gramps v5.2.2 Help	85	851	May 29, 2024
JSON schema specification Development metadata	17	1019	June 15, 2022
Making Gramps filters Faster, and then Superfast Development performance	15	305	February 1, 2025
Business Logic with overloaded methods Development performance	13	288	February 1, 2025

The Road Ahead for database representations

Related topics