JSON schema specification

I am presently experimenting to “unhide” record structure in database. The current way of storing data in SQLite is to take genelogical data and to stuff it into a BLOB. The record contains only auxiliary service fields (gramps_id, handle and a few index fields) besides this BLOB. As such, data is opaque and the DB can’t be used outside Gramps.

I have started to unravel some primary objects using their JSON schema to automatically generate TABLE SQL statements (or equivalently ALTER TABLE statements). Incidentally, this requires some modifications to the present table architecture: more TABLEs are needed to represent the array lists.

I work by guessing the JSON “components” of the schemas.

=> Is there somewhere a formal specification of the JSON used in Gramps?

This would greatly simplify implementation of a “pure” SQL database.

Incidentally, several questions about design arise. Maybe they should be asked in a separate post? One of them is about parson main name. In the vast majority of cases, a person is known under a single name and it is justified to store this name inside the Person record. However, a person may be designated by several names. These names are stored in an “array” of Name secondary objects, described by their own JSON.

Including the main name in Person needs to go through decoding a nested schema (possibly recursively) and this seriously complicates matters (notably assigning a user-visible name to the added fields). Notwithstanding performance issues, wouldn’t it be more consistent to have the main name as a Name record and only store an id or index in the Person record?

There are other examples of the same issue among other primary objects.

2 Likes

Why not using a MongoDB database and its collections to store Gramps data, as JSON documents, structured in some way as they appear in Gramps XML backup files, but translated in JSON? Does this will be a too complex operation/conversion of Gramps itself?

I was reading this some days ago and i was thinking it’ll be a beautiful DBMS to use with Gramps:

1 Like

Nick wrote a MongoDB addon but it is still experimental. Not sure how well it works or what his plans are for it longer term but that would probably be worth looking over.

3 Likes

Nick has created an experimental upgrade to Gramps that stores JSON objects in the db.

We don’t have a formal spec on the JSON in Gramps. The Source code has schema; see the get_schema() method in each of the objects in gen.lib.

I tested it quite a while back but found the performance disappointing. I made a few improvements myself, but it was still fairly slow compared to the current sqlite db. setup.

My main criticism against the present dbapi implementation is the multi-layer approach.

Gramps was built around BSDBD which is a key/value storage engine. Intrinsically, genealogical data is “relational”. The relational aspect of the DB is simulated in a software layer between Gramps Python code and BSDBD though still storing data in broad key/value records where key is generally a “handle” and value a pickled representation of a Gramps record (persone, family, place, …).

The new dbapi interface allows to store data in an SQL database (SQLite or other engine if a module/package is written). However, in order not to break compatibility and avoid to rewrite object management (in short, to keep the present successful code), key/value handling is simulated over the SQL DB. This is done easily by having only two columns in a TABLE: “handle” again and a BLOB for the pickled record. A few auxiliary columns are added for index retrieval, but basically we have a key/value pair.

MongoDB probably does not change this approach and perhaps adds another emulation layer by adding a step of encoding/decoding a JSON schema.

There may also be licensing issue because MongoDB is released under its own licence SSPL which presents a conflict with GPL in that SSPL seems to require that the whole code of the application be released under SSPL. In other words, using MongoDB components/library would require a change of licence for Gramps even where MongoDB is not referenced.

The Wikipedia MongoDB article claims that MongoDB has been removed from several distributions though I can still find it available in Fedora main repository (not RpmFusion which has relaxed criteria) but I have not installed it to check if it is really complete and operational.

As I already mentioned it:

Stuffing “external” data into Gramps objects (like the principal name in a person object while all others are in separate array) leads to questionable SQL architecture such as duplication of columns which belong “naturally” in other TABLEs.

I have now an valid recursive JSON translator which does the job but I consider that a simple index among the names would be more consistent. This trick already exist (in a Person record) for birth and death events. The index references these events in the arry event. This could be transposed to the principal name. Similarly for the main name of a place.

I read somewhere that it stores only the “string” for a Gramps type. I’m working on a “dual” implementation for my SQL DB where I only consider the index. My idea is to store the mapping between index and human-readable name in the DB-metadata. “System” type designations would be encoded in range 0-127 and custom type in 128-255 with the string in metadata. Presently, you can’t change globally wording for a custom type because this wording in the object occurrence. You must then track down all occurrences to change it or fix a typo. Of course, there are issues when importing data but it can be handled quite easily at import time. But perhaps limiting type encoding to 255 is bit too strict.

I’m following this discussion with interest because I currently use the SQL Export (with some patches) so that I can write various queries. (I’m not a Python developer but am comfortable writing SQL. I once tried to use the experimental Query add-on but got nowhere with it; I forget why.)

It would be nice if I could query the Gramps database directly, so that I could skip the export step. I haven’t tried Nick’s experimental version that Paul mentioned. I know that SQLite has some functions to deal with JSON, but I imagine the queries would be more difficult than they are with the current export.

So, I’m interested in any changes that might result in a “simpler” data model (in the sense of being pickle- and JSON-free), even though that would mean more tables. And meanwhile I will continue using the SQLite Export.

1 Like

I’ve always found Tony Proctor’s thoughts on this whole database topic to be interesting food for thought:

1 Like

Yes, that’s an interesting article. I’ve also used XSLT against the Gramps XML export, but I much prefer writing SQL.

Interesting article but I think it misses a fundamental point. There are two requirements about genealogical data: processing and storing.

Processing is the phase when you’re actively updating your data. This is done in memory on an ad-hoc internal representation. Basically what you’re dealing with is a direct acyclic graph (DAG). To be efficient the application should use algorithms and data structures adapted to this fact. There is an abundant literature on graph theory.

Side remark: I’m not convinced that Gramps-Python implementation is optimal here because multiple inheritance creates havoc in the data structures. It is used for sub-structure aggregation rather than for “pure object” inheritance and this creates duplication (see my complaint about the principal name included in a person while alternate names – absolutely identical in organisation – are stored in a separate array).

Storage is the repository of data between processing phases. Since data is the real asset of genealogical research (and very expensive to collect, at least in terms of man-hours), it should be stored in a “universal” way so that it can be reused with other engines (other applications), optionally capable of being updated outside applications. This calls for a different representation than the internal one. It should stand long-term storage (we’re talking here for decades which is eternity in software business), have the simplest structure (i.e. the DAG is flattened), be easily converted to standards should they exist, avoid containing non human-readable encoding (in other words, only use character or textual representation).

A DB with explicit columns for user data plus auxiliary “structural” fields (describing the DAG vertices) is presently the preferred choice. The BSDDB option doesn’t meet my criteria because of the pickled “value”, nor does the SQLite one because of the BLOBs.

The article seems to suggest the same tool for both processing and storing. I like Gramps approach where there is an adaptation layer between storage and memory. External data is converted to internal form. This may accidentally result from development history as the successful processing code was not changed when other storage options were offered. Probably more care should be granted to this conversion to improve the internal representation towards more uniformity.

Is there any specific reason why you’re following the JSON route and not the format used by the SQLite exporter? I tried that export, and imported the result into a new tree, and it looks like it’s rather complete, meaning that we do already have a full relational model.

For me, such a database would be a nice start to build queries against a Zotero DB, and for other things, like looking for duplicate sources, and merging them, although the latter might already be possible with the Isotammi Supertool.

3 requirements… processing, storing and presentation

Else I most of agree with everything you write… but I also think that a graph format e.g. network graph, is a great way of both process, store and present genealogy data.
as an example User-friendly and controlled data-management - Graphileon
I also believe in interoperability or interchangeability with other research tool (not other lineage-linked family tree software)…
i use the mongodb addon Nick developed, and it work, but for Windows it is aged, the “driver” is aged etc.

Good luck with your project

Sorry for late answer. I was away for private matters.

  1. I was not aware of this add-on. My first goal is to work on the core of Gramps where the main functionality is implemented. From this point of views, add-ons offer additional capabilities while relying on the core features (this can be also stated as: they present data in a different way without changing the base semantics of data).

  2. I want to make sure that any change in the Gramps schema is automatically translated into the DB structure (a single point of modification). The JSON schema looks like the pivot point for that.

  3. After reading your comment, I had a look at the SQLite exporter. Well … (no offence intended towards the author, nick hall or prculley, whoever contributed to it), I find it … “incomplete” (word may be ill-chosen). I mean not all Gramps information is exported to the SQLite database. Consequently, this export cannot serve as a substitute for Gramps DB. I have not experimented to check that all “arrays” are correctly exported into “ref” TABLEs. However I note an idea to evaluate: dates are all stored as records of a specific TABLE (due to their complex structure), but I didn’t see where “handles” to these records are stuffed into other TABLEs (very superficial reading).

From my point of view, there is a huge risk in this add-on. It is an independent manual transposition of the internal object organisation. It is vulnerable to any modification in the JSON schema or object definition. I prefer to dynamically generate the TABLEs and “sign” them with the JSON schema version so that I can detect possible discrepancies.

As I already mentioned, I want to make “visible” object contents and remove the key/value emulation over a relational DB emulation over the key/value BSDBD store so that data is directly stored into an SQL DB.

Once we have a neat and clean SQL store, interoperability and interchangeability is much easier.

1 Like

Hello Patrick,

Are you sure about that export being incomplete? I did a full round trip (export - import) with it before I wrote my message, and I ran another one today, on Windows, and as far as I can see, no information is lost. I checked this, by exporting my tree to GEDCOM, and to SQLite, importing the SQLite export into an empty tree, and exporting that to GEDCOM too, and then comparing the GEDCOMs. And when I do that, the only differences that I see are in the CHAN tags, meaning that all entries are marked as if they were changed today, most probably because the imported adds the current time, which IMO it should not do, because there was no manual change made by the user (me). All event dates are OK, so for me, functionally, the copy is identical to the original.

Now you can say that my comparison is incomplete, because some data is left out from GEDCOM exports. so I may have to repeat my comparison with backup files, comparing the XML. Either way, I don’t think it’s as bad as you suggest.

Anyway, the main reason why I brought this up, is that the SQLite exporter proves that we already know how to normalize the Gramps database, and you can already use the SQLite export for some interesting things. The only drawback is that this type of interoperability comes at a cost, because the exports and imports take time, and mess up the record changed timestamps.

As far as I’m concerned the exported model is good enough for any kind of data extraction, and for me personally, making changes to the SQLite tree is easier than working with the supertool.

I need to add that for the hacks that I like to do, it’s even easier to read the whole backup file into an XML tree in C#, and modify that in memory. I have a program that does that on Windows, running in the community edition of Visual Studio, and I can run that on Linux too, inside the Linux port of dotnet. I call that Gramps.Net.

Like you, I really like to have a normalized database, so that I can operate on that, and others may like that too, to link with Zotero and other tools. I may even use that to link data with RootsMagic, which is the program that I use to exchange data with FamilySearch, I sometimes also use Ancestral Quest for that, but I mention RootsMagic also because its database is fully documenten by users on the web, and they also use SQLite.

IMO, our SQLite model is as neat and clean as RootsMagic’s, meaning that it’s quite useful, if you invest some time in it.

For inspection purposes of that JSON structure, have you considered a generic import/export Shell Gramplet for a single record? Although a bit of a sidestep, it could allow you to proceed with confidence and more feedback between developers.

There are currently JSON import and export plugins. (2013-2017 by Doug Blank & Nick Hall.) If there was a Gramplet similar to Doug Blank’s Text Import Gramplet had a pop-up “Format” menu that allowed the User to select an Import/Export format, then a lot of opportunities open up. (The too-transparent automatic recognition of Text formats stifles some possibilities of that Gramplet.)

Initially, if this just had an additional Fetch button to complement the Import, then the user could Fetch the Active Object and display the export format to the Gramplet’s text window for inspection… without all the hassle of saving it to a file and loading into another application. Since the Text box allows direct editing, you could tweak the formatted object.

Such high visibility could accelerate the evolution of any particular Format’s support within Gramps or troubleshoot Import/Export tools for that Format. Your focus might be on the idiosyncrasies of the JSON import/export.

Adding an Include Secondary Objects option could even allow a small snippets of a Tree to be moved quickly and easily between Trees. But the Gramplet would have to not auto-refresh as the Active Object shifts focus or another Tree is loaded or the Gramplet would become a wallowing resource pig.

Users could copy the text window’s content to the OS clipboard or Notepads to move multiple objects. If it had the Copy to clipboard and File bug report like the “Send Bug Report” tool, then evolution could further accelerate.

Down the line, the Gramplet might be expanded to “save to”/“load from” or even use a pipe to another application. And the focus on “Active Record” could be expanded to “Active View records” or “Selected records of the Active View”. (Maybe that could lead to a KML/KMZ importer/exporter to complement the Geography View.)

@pgerlier I ran another round-trip test, this time looking at the XML files, and that test confirms that the SQLite export is complete, except for the bookmarks.

The SQLite export is also missing the surname groups (which are in the XML). I submitted a feature request for this.

1 Like

OK, I missed that, because I don’t use those groups, so they couldn’t be left out by the export either.

But on reading that, and the attached thread, in which Paul commented that the SQL exporter was experimental, I think that we need to ask ourselves about investments like these. And I call them investments, because I know that with limited resources, and effort that goes to a particular format may hurt the investment in another one, and Thomas Wetmore concluded years ago, that formats like GEDCOM, and JSON, and XML, are just formats, and are all equivalent in the sense that they can represent graphs, and any of these can be transformed to another without data loss. And if you know what you do, you can write it to SQL too, meaning that the whole no SQL craze is just a distraction.

What counts most is how fast it is, for storage, and how easy it is for interoperabily, or horizontal integration, as @emyoulation put it earlier in another thread.

And with interoperability, I don’t just mean address books, and mail storage, and Zotero, but also content management systems, like WordPress, or note keeping software like Evernote, or Google Keep, or Microsoft OneNote, and maybe also open document formats, like those in Libre Office.

1 Like