Introducing UID's for new persons and families

ennoborg · April 9, 2023, 4:43pm

Some years ago, I created a hack to add _UID’s to all persons in my database that I added after my conversion from PAF to Gramps. It worked well, but I lost the code, and today I had a nice conversation with ChatGPT, which wrote the code to create a PAF 5 compatible UID for me. It started creating one for PAF 4, because I didn´t mention the PAF version number, but after I told it, that I wanted code for version 5, it did the job. Very nice!

But this brings up an old question: Can we discuss adding _UID’s to new persons and families in say Gramps 5.2 or 5.3? These UID’s are a standard in commercial programs, and are also supported by Ancestry, and are easiest means to merge data after it’s made a round-trip from Gramps to Ancestry and back.

Our current import merge can´t do this, because it matches on handles, which don´t appear in GEDCOM exports. Imports from Ancestry (or other programs) generate new handles, so as far as I’m concerned, the only thing that you can rely on is the good old UID, or _UID. It’s an official tag in GEDCOM 7, called UID, and in that version, it even shows up in events.

I also found that most of the links to this subject on our wiki stopped working:

https://www.gramps-project.org/wiki/index.php/GEPS_009:_Import_Export_Merge#UID.2C_GUID_and_UID.2C_what_is_needed_in_Gramps.3F

ennoborg · April 9, 2023, 5:02pm

Here’s part of the conversation:

Mattkmmr · April 9, 2023, 5:25pm

@prculley already created a pull request for UID support:

github.com/gramps-project/gramps

Better support for GEDCOM _UID

gramps-project:master ← prculley:uid2

opened 05:49PM - 22 Jan 20 UTC

prculley

+4876 -3272

As a result of Nick's comments in #1002, I closed that PR and #1000 to create th…is more comprehensive PR. - This adds a uid_list to both Person and Family objects. A single uid is attached to the object at creation (which required me to replace it on import to avoid an unnecessary extra uid). - As a result, it requires a database upgrade, which code is included. If no uids were found in attributes, a single uid is added to Persons and Families. - XML import/export and schema have been updated to deal with the new uid lists. - The XML import move any _UID attributes found in Persons or Families to the new uid list. - GEDCOM import/export also now stores _UID tags in the uid list and exports from there. - The FindDups tool now uses the uid list to match up people with very high confidence. - The FindDups tool also has a new "Very High" threshold level (the "Very High" was already in the gramps.po). This higher level will skip the more time consuming second pass of the tool if persons matching via uid are found. - carefully updated the import/export tests. Comments: When attempting to merge in a GEDCOM that contained _UID for the people, with a tree that already had some of the same people, I noted that our 'Find Possible Duplicate People' was ignoring the _UID entries. The tree also happened to have a set of actual duplicates which I could not get to show up. On investigation, I saw that the tool did not display all possible duplicates for a single person, only the one it thought had the highest score. And it did not look at _UID attributes at all. This PR contains several commits; - a change to GEDCOM import to remove the find_family_from_handle method; this was required because it was messing up the GEDCOm import by adding an additional uid. - the basic UID support - a pylint on finddups - the finddups patch to allow all duplicate pairs to show up - the finddups patch to allow matching uid on persons to give the highest score (10). **Note:** this PR is based on top of the #1010 database upgrade PR, as that is necessary to implement the upgrade. When looking at code, look at the individual commits.

ennoborg · April 9, 2023, 9:18pm

OK, thanks. I noticed that #1000 does exactly what I want, so I will try that code in my fork.

In the code that I lost, I had a modified tool that checked for persons that had no _UID attribute and added that. And although I can understand Nick’s objections to letting the GEDCOM export add the missing _UID’s it is a perfect hack for me.

pgerlier · April 10, 2023, 8:57am

How is the _UID made of? (I have no idea about it)

If it is created like a UUID, it is strongly dependent on the time of creation and not on person/family properties (it is not a hash for them). Then it is conceptually the same as current “handle”. Apart from the checksum addition I see no advantage or difference with the present situation. The only bonus is an additional compatibility with GEDCOM.

Now suppose you have entered a “Given Surname” person in your Gramps tree with a _UID. Another researcher, independently from you, has committed a GEDCOM tree with _UID. You decide to import this latter tree to merge it with yours. Since _UID are time-based, they won’t help in identifying duplicates.

Eventually since there is nothing magical in “handles”, _UID could replace handles everywhere in Gramps (for new objects to avoid breaking existing relationships).

ennoborg · April 10, 2023, 10:32am

Yes, it’s a uuid with a checksum. Here’s the code that ChatGPT wrote for me:

import uuid

def generate_paf5_uid():
    uid = str(uuid.uuid4()).replace('-', '').upper()
    checksum = calculate_checksum(uid)
    return uid[:32] + checksum

def calculate_checksum(uid):
    # Calculate the checksum based on the first 32 characters of the UID
    uid_without_checksum = uid[:32]
    sum = 0
    for i in range(len(uid_without_checksum)):
        sum += int(uid_without_checksum[i], 16) * (i+1)
    checksum = (sum % 65536).to_bytes(2, byteorder='big').hex().upper()
    return checksum

# Example usage:
paf5_uid = generate_paf5_uid()
print(paf5_uid) # Output: "35B341D79E1B4712AE6920B92108C9A2D7825E8A29F1097DE063FE28748D47E5C24B5D5B5F5E5D5F862C8B249B02B43"

Note that the bot insisted that the length of the example output was right, even though It’s 95 characters, and it should be 36. That’s why I eventually told it to report to its maker.

You can see what it does when you paste the above code into a Python 3 prompt, and you can see similar code in Paul Culley’s repo, branch uid, or uid2.

Like the handles used in our own import & merge, UID’s only help to find duplicates of your own creations, like in my scenario of uploading to Ancestry, adding some new data from hints, which it now also gets from Geneanet, and then downloading an Ancestry GEDCOM. Ancestry exports those as UID, without the underscore, but they are the same format.

Note that GEDCOM 7 redefined the tag to UID, and advises to use the standard uuid format.

PLegoux · April 10, 2023, 10:37am

Here is a description of Gedcom (U)UID including PAF’s UID:

Yes, they are time dependant in general, depending on their version:

ennoborg · April 10, 2023, 10:45pm

I just completed a simple tool that looks for persons without _UID, lists those, and creates one for each, when the user is OK with that, using the code shown above. I did not add an automatic creation for new persons yet, because I was in a lazy mood, and didn’t bother about families either. UID’s for persons are good enough to deal with Ancestry now.

pgerlier · April 11, 2023, 8:10am

Since _UIDs have no significance per se (you don’t use the internal structure of UUIDs) and only aim to provide a unique id for the record, why won’t you start from the handle which is guaranteed to be unique?

Truncate or expand it to 32 characters, add the 2-byte checksum and you have a _UID. Since all primary objects have handles, you can repeat the process for families, events, … Thus the _UIDs come for free and can be related to their objects.

ennoborg · April 11, 2023, 12:23pm

You are right, and we even have a function for that:

def create_uid(self, handle=None):
    if handle:
        uid = uuid.uuid5(GRAMPS_UUID, handle)
    else:
        uid = uuid.uuid4()
    return uid.hex.upper()

It uses the handle if it exists, and otherwise it defaults to uuid4, which is exactly the same as the function suggested by ChatGPT.

Note that, when you import the same object back into your tree, like from an earlier backup, it will get a new handle, because handles must be unique, but the UID must be left alone, because it represents the original object.

The whole purpose of the UID is in a way that it is not related to any other thing that can change, and the handle is changed to avoid duplicates. In that case, only the UID can tell me that I imported a copy of something that I already had.

prculley · April 11, 2023, 4:22pm

That PR does what Nick wanted, but I don’t think it is the right answer for the long run. The UID for GEDCOM 7 has some different rules that would require changes in that PR. If/when we get serious about GEDCOM 7 support (and doing the core object upgrades to support it) we might be able to revisit this.

ennoborg · April 11, 2023, 4:31pm

That’s right, and I am with you on that. I remade the tool to add the _UID’s because I had the time, and ChatGPT, and it can be a big time saver for me, but it’s not fit for GEDCOM 7. I must add that I see no reason for a fast migration yet, because the other programs that I use to complement Gramps still speak 5.5.1 quite well.

jze · April 23, 2023, 9:06am

It would be very beneficial if Gramps could support UIDs. Gramps is one of the few programs that do not yet support UID.

In the past there were different methods to generate a UID. In GEDCOM 7 it has now been clarified that one should use ordinary UUID, according to RFC 4122: https://github.com/FamilySearch/GEDCOM/blob/main/specification/gedcom-03-datamodel.md#uid-unique-identifier-g7uid

With the UID, cross-tree connections also become possible: GEDBAS: Ernestine Caroline KÖHLER contains a link to a different file based on the UID.

It would therefore also be good to be able to enter a UID for a person by hand.

ennoborg · April 23, 2023, 6:13pm

I know, and that’s why I changed GEDCOM import in such a way that _UID lines are stored in attributes, so that they don’t get lost when you import data from other programs. You can see that, when you import data from Ancestral Quest, Legacy, PAF, RootsMagic, and many others.

And because _UID’s are in attributes, you can also add them yourself, if you want, and they will then be exported too.

emyoulation · April 23, 2023, 6:57pm

Should we also be exploring a variant of the GEDCOM7 tag “EXID” that relaces AFN, RFN, RIN in 5.5.1? The spec call for it to hold the URI of the authority for the identifier in the EXID.type payload

Of course, these EXID.types are intended for the standard public online genealogy databases: FamilySearch, WikiTree, FindAGrave and so forth. But it seems like the approach could be adapted to a LOCAL file or JSON Note embedded in the Tree database for private UUIDs.

Given the prevalence of linkrot over a 20 year period, I have very serious misgivings about the sustainability of the proposed URL approach. There is a lot of risk in creating an external dependency.

But couldn’t the URI for that EXID.type payload also be stoed as a GRAMPS:// with an internal handle to a Note of JSON type? Which could include a fallback to an external URL? In addition, there could be an interface to write the content of the JSON Note that described the particular UUID system.

One of the spots where the spec seems to fall short is in providing a direct access URL for the online database. The JSON for the FindAGrave includes a URL for their cemetery search webform: https://www.findagrave.com/cemetery . But it could have a placeholder pattern that leverages the EXID value: https://www.findagrave.com/cemetery/[EXID.value] so that FindAGrave Cemetery EXID.value 181852 would substitute the value into the URL to find https://www.findagrave.com/cemetery/181852

Nick-Hall · April 23, 2023, 11:08pm

Yes, the Gedcom IDENTIFIER_STRUCTURE contains UID, EXID and REFN. We can probably include all these in our primary objects.

Nick-Hall · April 23, 2023, 11:21pm

We can now use the Gedcom 7 specification as the basis for our design. I’ll have a look at this again.

DavidMStraub · June 12, 2025, 10:57am

Very old thread I know, but since I’m working on a Gedcom 7 import library, I bumped into it again.

In Gedcom 7, the following objects can have a UID:

Family
Individual
Multimedia
Repository
Shared Note
Source
Submitter

But is it actually true that Gramps does not support a UID? In my opninion, this is the perfect use of attributes. The only thing we need to do IMHO is to define a default key for UUIDs and make sure the importers and exporters respect it.

The only small downside is that Gramps currently doesn’t allow attributes on repositories and (shared) notes, which I think it could/should, but honestly I think UIDs are mostly relevant for people and, perhaps, families. (Or places, but that’s not supported by Gedcom!)

Nick-Hall · June 12, 2025, 1:28pm

In Gedcom 7.0, an event detail structure can have UID tags and a place structure can have EXID tags.

DavidMStraub · June 12, 2025, 1:46pm

I don’t see that in the spec, am I missing something?

(EXID is not a UID)

Topic		Replies	Views
Leveraging external Tools Ideas data-import , uuid	20	202	August 13, 2024
Are GEDCOM imports reversible dialects? Help	12	505	March 30, 2024
Was the Gedcom7 IDENTIFIER_STRUCTURE extended into Gramps 5.2? Beta Testing	2	231	October 31, 2023
Importing GEDCOM from RootsMagic 9 yields lots of duplicates Help	13	950	March 10, 2024
Convenient and Efficient Exports/Updates from Gramps to RootsMagic Ideas uuid	10	182	February 21, 2025

Introducing UID's for new persons and families

Related topics