Proposed Tree Related Metadata Attributes / XML Changes

I have almost finished putting together an initial iteration at adding a number of new metadata attributes to the database and wanted to solicit feedback from @Nick-Hall and others here.

The following are what I have at the moment:

database_uuid - Unique to the database serving as backing store for a tree.
database_created - Timestamp when the database uuid is created, which for new databases reflects creation time.

The two above are intended to be immutable, the below are not.

database_last_transaction - Timestamp of the last completed database transaction.
tree_uuid - Unique to a tree. This differs from the database as the tree might be distributed across machines and changes synced, as @DavidMStraub has implemented for the Gramps WebAPI for Gramps.Js
tree_uuid_change - Timestamp of last known change to the tree uuid, as I am unsure it should be treated as immutable.
tree_name - The tree name will be stored here in addition to name.txt for sake of completeness.
tree_name_change - Timestamp of last known change to tree name.
tree_copyright - Copyright for the tree.
tree_copyright_change - Timestamp of last known change to copyright.
tree_description - The tree description. My intent is for CardView to use this for Gramps 5.2.
tree_description_change - Timestamp of last known change to description.
researcher_change - Timestamp of last known change to researcher stored in the database, not the preferences.

For the Gramps XML I propose the <header> be expanded to accomodate them as well. Below is a sample of what I currently have, with the <export-note> mocked up:

    <created date="2022-09-08" version="5.2.0"/>
      <content>This is a sample export note that might be present once the exporter is enhanced to enable it.</content>
      <copyright>2022 Lewis Anderson Garner Zelinski</copyright>
      <description>The sample example gramps tree with the famous Lewis Anderson Garner Zelinski</description>
      <resname>Alex Roitman,,,</resname>

I will add a tool to edit the tree copyright and description as well.

Enhancing the exporter dialog to add the export-note support would be a separate item, but the code will have the necessary support in place.

Does the above seem okay to everyone? Are there other tree related attributes that people think would be useful to add?



Is there not a chunk already defined in an existing XML database standard related to change management?

The nice thing about “standards” is that there are so many from which to choose.

Ah yeah I should be cleaner about it… this should do…

    <created date="2022-09-11" version="5.2.0"/>
      <dbid change="1662683177">601f53e0</dbid>
      <uuid change="0">7b2da2ca-5dca-4f17-a6bf-dbf69733457b</uuid>
      <uuid change="1662872674">04f6ef65-8dee-43f2-bbbc-70769eb6f3a3</uuid>
      <name change="1662869320">example_gramps_test</name>
      <copyright change="1662872415">2022 Lewis Anderson Garner Zelinski</copyright>
      <description change="1662872415">The sample example gramps tree with the famous Lewis Anderson Garner Zelinski</de
    <researcher change="1662687426">
      <resname>Alex Roitman,,,</resname>

I think that’s a brilliant idea and your proposal looks great to me.

By the way, I have also used something called a tree UUID for storing multiple trees in a single Postgresql database in the SharedPostgreSQL plugin (motivation being of course running multiple Gramps Webs from a single DB at some point in the future). When your proposal is implemented, this could be made the same as your tree UUID (or database UUID)?

Yes it could be that way with the tree uuid perhaps.

At the moment the design assumption in Gramps is 1:1 between a tree and a database. What if someday we want to allow a user to store multiple trees in the same database?

Why do that? Well then perhaps some objects like places, repositories, sources and perhaps media and tags could be managed in a way where they could be optionally shared across trees.

I know this is not likely to happen, it is just one of those crazy thoughts I have as to how things might be done differently.

But given those kinds of thoughts, I wonder if the SharedPostgeSQL plugin should maybe use a different concept such as “instance” instead of “tree”?

Also wouldn’t using a schema per “instance” possibly be a cleaner approach to multi-tenancy for the data?

1 Like

Some extra tree metadata seems like a good idea.

I can see why a tree UUID, name, description and copyright would be useful. Gedcom has a COPR tag for copyright. Do we need the timestamps to be so fine-grained? Perhaps just one for the whole section?

The schema version is already in the DOCTYPE tag.

I’m not sure why we need to store the dbid in the database or export it to the XML. It is just the local database directory.

Why do we need a database UUID? What do you intend to use it for? Would a unique researcher UUID be more useful?

1 Like

I can see why a tree UUID, name, description and copyright would be useful.

I am also thinking now maybe “license” and “contributors” may have a place there as well. I am not sure but one could apply a Creative Commons style license to shared genealogy data I should think.

Do we need the timestamps to be so fine-grained?

My thinking is that if a tree is distributed and you want to synchronize the attributes across systems you need to know which attribute is more current. I have not looked at David’s code for the Gramps WebAPI Sync process but imagine he is using the change timestamps on the primary objects for that.

The schema version is already in the DOCTYPE tag.

I can pull that.

I’m not sure why we need to store the dbid in the database or export it to the XML. It is just the local database directory.

We perhaps do not need it, but it might be useful if someone has forgotten and is trying to figure out where an export file came from.

Why do we need a database UUID? What do you intend to use it for?

Per above, I am thinking of the database as merely the backing store for a tree that may itself be replicated across systems. In that scenario having a database uuid separate of the tree uuid might be useful. I have not myself thought through exactly how a sync algorithm would work if you had multiple people interacting with the same remote copy of a tree. I figured we have one and there is no use for it then that is fine, the space is negligible, but if there is a need then it would be available.

I am more than willing to modify things as needed if you would like changes made. My primary goal was adding tree attributes, the uuid and timestamp stuff came to mind as potentially useful in that process.

Although the terms “tree” and “database” are currently used interchangeably, I think of a tree as some subset of the persons in my database – in other words, the results of a person filter that I would use in order to export them and their related data. These subsets might be overlapping.

Maybe the ability to set a global filter (perhaps not just in terms of persons), affecting all aspects of the interface, could be a way to implement “multiple trees within a database”.

I suppose I tend to think of that more as a branch, but yes I see how you could manage things that way.

Interesting idea. I guess for that you would extend the proxy database class to enable write through support and then provide a means to select and apply one or more of them to the session. Unsure how performant it would be. Have people discussed that sort of thing in the past?

Can we imagine that the objects overlap? That an object O1 in tree 1 is the same object O2 in tree 2 with different information?

For example to implement personae or the evolutions of a place over time (e.g map of a place in 1700 in tree 1 could be different from the map of the same place in 1900 in tree 2)

Some kind of multidimensional (temporally or not) database?

So a tree becomes a subset of a database, which you can share and sync? Why use a subset instead of the whole database?

Could we expand this discussion to include the potential for leveraging the database metadata information for the Backup Archives created by Gramps?

The introduction of GEDZIP (part of the GEDCOM7 prospective standard) reserves a META-INF/MANIFEST.MF folder/file combination to avoid conflicts with other file packing standards like Oracle’s JAR File Specification

That wasn’t what I was thinking per se but is an interesting thought.

1 Like

For me a tree is a logical abstraction that represents a collection of genealogical data. I use the term because it comes naturally thinking about genealogical data.

So why a subset? Once you start thinking of it that way a few possibilities arise:

The first is around data partitioning. Perhaps you absolutely need to keep a couple sets of person/family/event/media/note/citation data representing genealogies separated but still want to be able to share places/sources/repositories/tags to avoid having many of them duplicated in multiple databases. How would you do this? One approach might be to treat layer 0 as having global scope, any objects created in it are visible in all the other layers. The other layers are isolated from each other, but are in turn global to their children. In this approach a family tree represents the union of all the objects in a layer with it’s ancestors.

The second is for recovery points. If a tree is just a collection of pointers to a set of objects, why not apply a copy on write scheme so you can create snapshots? But I agree exports work well for this, and even with this you need exports anyway for backups.

The third use case is hypothetically for branching and merging. I have not thought enough about how it would work, but I feel like with copy on write and snapshot type capabilities there should be a means of devising the ability to create branches and either drop them or possibly merge them back in.

The fourth use case could involve the use of a set of filter rules to extract a set of objects into their own collection or tree. Having the concept of a tree lets you then open that subset that is now a tree to view as a self contained unit.

Having said all that, in the end I admit it is questionable how useful features like that would be for the majority of users who likely would never use or have need of them.

Regarding your earlier question about shared PostgreSQL,

The main reason I decided against that is migrations. I imagine handling a single database is much simpler for things like schema migrations; probably also for backups. Moreover, the underlying SharedDBAPI class can be used for all flavours of SQL, while schemas are specific to Postgres.

Since Gramps accesses all data through the database base classes and there is just a handful of raw SQL statements, I don’t see an elevated security risk.

Commercial databases like Oracle and DB2 have had schemas far longer than Postgres so I guess you mean with in the Open Source domain.

Yeah, I understand your thinking about migrations. I don’t think backups change one way or the other.

If users also managed their tree locally, and you migrated the central copy, the sync mechanism needs to detect and account for that if they try to do one and visa versa. I’m guessing you already check for that.

I still wondering how well that will work scaling wise. Not a few dozen trees, but once you hit a few thousand, or if such a service became wildly popular think ten or twenty thousand or more. At 2000 people per tree, with 10000 trees you are at the 20 million mark just for people. What does performance look like? Even more importantly what does a database migration look like?

Yes it’s a very interesting question and I might be wrong about my guesses which is better. We can try both and compare :slight_smile:

Naively I would have expected that, since there are indices for the relevant columns (tree UID, handles), there is no performance impact even if the database is very large.

Yes. They may well be useful.

The new database uuid seems to serve a similar purpose to the old dbid (the eight digit directory name in which the database is stored). Perhaps we could combine these two.

The tree uuid is more interesting. It identifies the data within the tree.

Suppose you have a family tree that you export, and then import into a new database. The two databases would have a different database uuid, but we would want them to share a tree uuid to identify that they share the same data. This could be useful for a sync tool. We could assign a new tree uuid when the data is imported.

Now consider the case where we import data from a third tree into the new tree. This tree now contains data from two other trees. One possibility would be to store an import log containing the database uuid and a timestamp for each import.

Then if the tree was merged back into the original, would we want to include information about the third tree in the original database?

I’m not quite sure how this new tree uuid is going to work yet.

Yes it could be that way.

As I see it on import into an empty database the tree uuid from the import would be preserved. If none was present in the import xml then the autogenerated one when the database was first opened after creation would be kept and used. That is what my code does now.

If the tree uuid for an import into an existing database matches the existing tree uuid then the user could have a choice of how to handle things. They could merge by favoring the objects with the most recent change timestamp, they could choose to overwrite any existing objects on the import, or they could choose to have any objects that differ during an import have a new handle generated and saved for manual examination/merging later on.

If the tree uuid for an import into an existing database differs then I think the user would be given a warning to confirm the action, and the option as to whether they want to keep one of the two existing tree uuids or maybe generate a new one for the resulting data set.

I agree there is a benefit to doing that, keeping an “ancestors list” for the tree. A import/merge/sync would check for a match with any of the tree uuid in the list as well. In addition to the tree uuid and first merge timestamp probably want to keep the other tree metadata attributes like name and description around as well.

Yes because it too would then be part of the combined data set, and the “ancestors list” would also be merged/synced so would include it.

Now, if the core data model was extended to add the tree concept to all the objects then this is also how we could potentially enable support for branching and merging back research lines. Branches would start as copies of an existing tree but with a new tree uuid. And if we use some kind of copy on write scheme then we don’t need to actually copy all the objects to create the branch, we only copy the ones that get edited or changed once the branch was created.

You very well may be right and it will be fine. I tend to worry about everything, and often times it is for naught.