Introducing UID's for new persons and families

Some years ago, I created a hack to add _UID’s to all persons in my database that I added after my conversion from PAF to Gramps. It worked well, but I lost the code, and today I had a nice conversation with ChatGPT, which wrote the code to create a PAF 5 compatible UID for me. It started creating one for PAF 4, because I didn´t mention the PAF version number, but after I told it, that I wanted code for version 5, it did the job. Very nice!

But this brings up an old question: Can we discuss adding _UID’s to new persons and families in say Gramps 5.2 or 5.3? These UID’s are a standard in commercial programs, and are also supported by Ancestry, and are easiest means to merge data after it’s made a round-trip from Gramps to Ancestry and back.

Our current import merge can´t do this, because it matches on handles, which don´t appear in GEDCOM exports. Imports from Ancestry (or other programs) generate new handles, so as far as I’m concerned, the only thing that you can rely on is the good old UID, or _UID. It’s an official tag in GEDCOM 7, called UID, and in that version, it even shows up in events.

I also found that most of the links to this subject on our wiki stopped working:

https://www.gramps-project.org/wiki/index.php/GEPS_009:_Import_Export_Merge#UID.2C_GUID_and_UID.2C_what_is_needed_in_Gramps.3F

Here’s part of the conversation:

@prculley already created a pull request for UID support:

OK, thanks. I noticed that #1000 does exactly what I want, so I will try that code in my fork.

In the code that I lost, I had a modified tool that checked for persons that had no _UID attribute and added that. And although I can understand Nick’s objections to letting the GEDCOM export add the missing _UID’s it is a perfect hack for me.

How is the _UID made of? (I have no idea about it)

If it is created like a UUID, it is strongly dependent on the time of creation and not on person/family properties (it is not a hash for them). Then it is conceptually the same as current “handle”. Apart from the checksum addition I see no advantage or difference with the present situation. The only bonus is an additional compatibility with GEDCOM.

Now suppose you have entered a “Given Surname” person in your Gramps tree with a _UID. Another researcher, independently from you, has committed a GEDCOM tree with _UID. You decide to import this latter tree to merge it with yours. Since _UID are time-based, they won’t help in identifying duplicates.

Eventually since there is nothing magical in “handles”, _UID could replace handles everywhere in Gramps (for new objects to avoid breaking existing relationships).

Yes, it’s a uuid with a checksum. Here’s the code that ChatGPT wrote for me:

import uuid

def generate_paf5_uid():
    uid = str(uuid.uuid4()).replace('-', '').upper()
    checksum = calculate_checksum(uid)
    return uid[:32] + checksum

def calculate_checksum(uid):
    # Calculate the checksum based on the first 32 characters of the UID
    uid_without_checksum = uid[:32]
    sum = 0
    for i in range(len(uid_without_checksum)):
        sum += int(uid_without_checksum[i], 16) * (i+1)
    checksum = (sum % 65536).to_bytes(2, byteorder='big').hex().upper()
    return checksum

# Example usage:
paf5_uid = generate_paf5_uid()
print(paf5_uid) # Output: "35B341D79E1B4712AE6920B92108C9A2D7825E8A29F1097DE063FE28748D47E5C24B5D5B5F5E5D5F862C8B249B02B43"

Note that the bot insisted that the length of the example output was right, even though It’s 95 characters, and it should be 36. That’s why I eventually told it to report to its maker.

You can see what it does when you paste the above code into a Python 3 prompt, and you can see similar code in Paul Culley’s repo, branch uid, or uid2.

Like the handles used in our own import & merge, UID’s only help to find duplicates of your own creations, like in my scenario of uploading to Ancestry, adding some new data from hints, which it now also gets from Geneanet, and then downloading an Ancestry GEDCOM. Ancestry exports those as UID, without the underscore, but they are the same format.

Note that GEDCOM 7 redefined the tag to UID, and advises to use the standard uuid format.

Here is a description of Gedcom (U)UID including PAF’s UID:

Yes, they are time dependant in general, depending on their version:

I just completed a simple tool that looks for persons without _UID, lists those, and creates one for each, when the user is OK with that, using the code shown above. I did not add an automatic creation for new persons yet, because I was in a lazy mood, and didn’t bother about families either. UID’s for persons are good enough to deal with Ancestry now.

Since _UIDs have no significance per se (you don’t use the internal structure of UUIDs) and only aim to provide a unique id for the record, why won’t you start from the handle which is guaranteed to be unique?

Truncate or expand it to 32 characters, add the 2-byte checksum and you have a _UID. Since all primary objects have handles, you can repeat the process for families, events, … Thus the _UIDs come for free and can be related to their objects.

You are right, and we even have a function for that:

def create_uid(self, handle=None):
    if handle:
        uid = uuid.uuid5(GRAMPS_UUID, handle)
    else:
        uid = uuid.uuid4()
    return uid.hex.upper()

It uses the handle if it exists, and otherwise it defaults to uuid4, which is exactly the same as the function suggested by ChatGPT.

Note that, when you import the same object back into your tree, like from an earlier backup, it will get a new handle, because handles must be unique, but the UID must be left alone, because it represents the original object.

The whole purpose of the UID is in a way that it is not related to any other thing that can change, and the handle is changed to avoid duplicates. In that case, only the UID can tell me that I imported a copy of something that I already had.

That PR does what Nick wanted, but I don’t think it is the right answer for the long run. The UID for GEDCOM 7 has some different rules that would require changes in that PR. If/when we get serious about GEDCOM 7 support (and doing the core object upgrades to support it) we might be able to revisit this.

1 Like

That’s right, and I am with you on that. I remade the tool to add the _UID’s because I had the time, and ChatGPT, and it can be a big time saver for me, but it’s not fit for GEDCOM 7. I must add that I see no reason for a fast migration yet, because the other programs that I use to complement Gramps still speak 5.5.1 quite well.

It would be very beneficial if Gramps could support UIDs. Gramps is one of the few programs that do not yet support UID.

In the past there were different methods to generate a UID. In GEDCOM 7 it has now been clarified that one should use ordinary UUID, according to RFC 4122: https://github.com/FamilySearch/GEDCOM/blob/main/specification/gedcom-03-datamodel.md#uid-unique-identifier-g7uid

With the UID, cross-tree connections also become possible: GEDBAS: Ernestine Caroline KÖHLER contains a link to a different file based on the UID.

It would therefore also be good to be able to enter a UID for a person by hand.

I know, and that’s why I changed GEDCOM import in such a way that _UID lines are stored in attributes, so that they don’t get lost when you import data from other programs. You can see that, when you import data from Ancestral Quest, Legacy, PAF, RootsMagic, and many others.

And because _UID’s are in attributes, you can also add them yourself, if you want, and they will then be exported too.

Should we also be exploring a variant of the GEDCOM7 tag “EXID” that relaces AFN, RFN, RIN in 5.5.1? The spec call for it to hold the URI of the authority for the identifier in the EXID.type payload

Of course, these EXID.types are intended for the standard public online genealogy databases: FamilySearch, WikiTree, FindAGrave and so forth. But it seems like the approach could be adapted to a LOCAL file or JSON Note embedded in the Tree database for private UUIDs.

Given the prevalence of linkrot over a 20 year period, I have very serious misgivings about the sustainability of the proposed URL approach. There is a lot of risk in creating an external dependency.

But couldn’t the URI for that EXID.type payload also be stoed as a GRAMPS:// with an internal handle to a Note of JSON type? Which could include a fallback to an external URL? In addition, there could be an interface to write the content of the JSON Note that described the particular UUID system.

One of the spots where the spec seems to fall short is in providing a direct access URL for the online database. The JSON for the FindAGrave includes a URL for their cemetery search webform: https://www.findagrave.com/cemetery . But it could have a placeholder pattern that leverages the EXID value: https://www.findagrave.com/cemetery/[EXID.value] so that FindAGrave Cemetery EXID.value 181852 would substitute the value into the URL to find https://www.findagrave.com/cemetery/181852

Yes, the Gedcom IDENTIFIER_STRUCTURE contains UID, EXID and REFN. We can probably include all these in our primary objects.

We can now use the Gedcom 7 specification as the basis for our design. I’ll have a look at this again.