Are GEDCOM imports reversible dialects?

emyoulation · May 16, 2023, 6:48pm

When the Gramps GEDCOM importer plugin sees a custom tag, it adds the unrecognized content as a Note.

Is this content reversible?

That is to say, will the GEDCOM Export write the custom tag back to the same place in the structured data?

WikiTree is an example. They break out a Middle name in their internal data structure and do not include that name the standard GIVN Given Name value. Instead they write a _MIDN custom tag with that portion of the GIVN data.

This content is preserved as a Note for the Person. But it there enough context preserved to restore the data? Or even to do a meaningful post-processing of the ambiguous data.

I don’t see how since the Note is not attached as a secondary object to the same object. (The Note attached to the Person object, not to the more appropriate ‘preferred Name’ secondary object of that Person object.)

In the example of a WikiTree GEDCOM being imported, the “GEDCOM import” type Note is attached to the person with the content:

Records not imported into INDI (individual) Gramps ID I000003:

Line ignored as not understood                                      Line    93: 2 _MIDN Franklin

Meanwhile, the original context in the “WikiTree_McCulloughWm1775.ged” GEDCOM source file starts at line 90 and runs through line 155:

0 @I3@ INDI
1 NAME William Franklin /McCullough/
2 GIVN William
2 _MIDN Franklin
2 SURN McCullough
2 NSFX Sr
1 SEX M
1 BIRT
2 DATE 24 May 1841
2 PLAC Westmoreland, Pennsylvania, United States
1 DEAT
2 DATE 4 Sep 1917
2 PLAC 136 Boyles ave, New Castle, Lawrence, Pennsylvania, United States
1 WWW https://www.WikiTree.com/wiki/McCullough-1311 
1 FAMC @F2@
1 FAMS @F3@
1 NOTE == Biography ==
2 CONT 
2 CONT '''William F. McCullough, Sr.'''
2 CONT

lines 108-147 are just concatenated line from the Biography. Continue with line 148

2 CONC  />Lawrence county, Pennsylvania
2 CONT * {{FindAGrave|159862323}} for Wm. F. McCullough, Sr.
1 REFN 9567903
2 TYPE wikitree.user_id
1 REFN 10072967
2 TYPE wikitree.page_id
1 REFN 60
2 TYPE wikitree.privacy
0 @I4@ INDI
1 NAME Hollis Rushton /McCullough/

The Note about the misunderstood level ‘2’ Tag is attached to the level ‘0’ Individual but is actually secondary to the level 1 “NAME” for that individual. The Individual attachment is known in the imported data but the NAME attachment is lost.

The Gramps ID for the Individual is I0003 … but that correlation was contingent on being imported into an empty Tree. There is no direct reference to @I3@ of the GEDCOM.

Further, there are 3 REFN Attributes (9567903, 10072967, 60) added to Person for the final parts of the GEDCOM segment for this individual. But the wikitree.user_id, wikitree.page_id, and wikitree.privacy TYPE context is completely lost for those 3 REFNs.

emyoulation · May 16, 2023, 6:54pm

Reversing the import to a WikiTree dialect export is probably a pipe dream.

But adding more context to allow a post-processing tool to do meaningful work seems reasonable.

ennoborg · May 17, 2023, 11:41am

No. Why would you want to do that?

Post-processing would be nice, but I wonder how that can be done in a meaningful way. And that also triggers the question: What do you see as meaningful work?

ennoborg · May 17, 2023, 11:47am

Here’s another thought about this:

Ancestry exports UID’s with a UID tag, not _UID, like it should, and when Gramps see those, it creates UID events. They’re not events, because they have no dates, and they look pretty stupid to me. Post-processing could detect such events, and convert them to attributes.

Those REFN’s would also be perfect candidates for attributes, don’t you think?

emyoulation · May 17, 2023, 2:38pm

In a collaborative sense, it would be building tools that compare & contrast trees by different researchers to make a worklist. (Which probably means bring data in from diferrent programs without merging them. So post-processing workflows to harmonize GEDCOM dialect quirks will probably be needed repeatedly.)

So you both take exports from your respective programs, compare those to exports from your last colloborative session and see if you’ve both made the harmonizations agreed to from the previous worklist. (Because, if you are not following through on previous agreements, there is little sense in further collaboration.) Then compare & contast the current exports to generate a new worklist of stuff to be harmonized.

emyoulation · May 17, 2023, 2:50pm

I do. And they are already REFN type Attributes with a numeric UUID. But maybe they should be list values?

So, for the following GEDCOM lines that currently generates an attribute REFN=60

1 REFN 60
2 TYPE wikitree.privacy

It would instead generate an attribute REFN=60;TYPE,wikitree.privacy
or
GEDCOM=5.5.1;REFN,60;TYPE,wikitree.privacy
(But that structure does not handle the cases like more level 2 entries and level 3 under the unrecognized tag.)
So, maybe
GEDCOM=5.5.1;1,REFN,60;2,TYPE,wikitree.privacy

ennoborg · May 18, 2023, 9:54pm

I understand that, but I don’t believe that it’s the right way, unless I’m still misunderstanding a few things. And I think that, because my experience with GEDCOM comparison is quite bad. There are a few programs that work nice on small trees, but I haven’t found any that can deal with real trees, say of size 10,000 and larger, and allow you to focus on what’s really important. And I even know at least one, that was recommended on stack exchange, that can’t even load my tree within a reasonable time. And then my first question is, how can someone recommend such a thing?

The problem is, that most of these programs are quite dumb, assuming that the GEDCOM files that you work on have quite similar dialects, and that’s not the case with the files that I deal with, where one file comes from my own tree in Gramps, and the other from an on-line service, like Ancestry, My Heritage, or WikiTree, that doesn’t give a dime (replace with your favorite alternative four letter word) about standards.

My take would be different, and that is to run comparisons at the Gramps (XML) level, when most of the harmonization has already been done by the GEDCOM importer. Items will then still have different Gramps ID’s, and handles too, but most of the other differences between GEDCOM dialects can’t influence things then. And you can also rewrite the code that we already have in import & merge so that it can look for UID’s, or other ID’s that are more persistent than the standard Gramps/GEDCOM ID’s and handles. The latter sounds strange, because handles are also derived from uuid’s, but handles are designed to be unique, and persons with the same UID can co-exist in one database.

This requires some extra work on the GEDCOM importer, but that’s something that needs work anyway, to deal with Ancestry UID’s that are actually _UID’s, and the REFN examples that you gave in your other message.

There is another trick if the Gramps XML comparison hurdle is too big, and that is comparing GEDCOM files exported by Gramps, based on imports from different programs, because in that case, after correcting the importer, the Ancestry UID’s have already been transformed to _UID’s. I’m referring to these by GEDCOM tag name, meaning that a _UID is the standard set by PAF.

ennoborg · May 18, 2023, 10:09pm

I had a closer look at the REFN issue, and it looks like our GEDCOM importer is based on how REFN tags are exported by PAF, where the optional TYPE is not included. It is a legal tag, in GEDCOM 5.5.1, so one can consider this to be a bug. And to keep things simple, I would suggest that when a TYPE is specified, it replaces the standard attribute type for REFN, which is REFN. Using the TYPE value as the attribute type is the easiest way to prevent clashes with REFN values entered by users.

In GEDCOM 5.5.1, you have RIN tags too, which defined as

A unique record identification number assigned to the record by the source system. This number is
intended to serve as a more sure means of identification of a record for reconciling differences in data between two interfacing systems.

Unlike REFN, these things are defined to be unique, and their goal is pretty much what you wrote in your earlier message.

Note that GEDCOM 7 has a new tag, named EXID, for external ID, for which the TYPE won’t be optional in GEDCOM 8, so they say. That might then be the best place to store those wikitree things.

GeorgeWilmes · May 19, 2023, 1:13am

How about creating separate GEDCOM importers/exporters for Ancestry, WikiTree, etc.? (And leaving the current one for “generic” GEDCOM imports/exports.)

And how about a new tool to do consistency checks in terms of how/where different pieces of data are stored in Gramps (UIDs, etc.) based on some new preferences settings?

emyoulation · May 19, 2023, 1:57am

Since the current 5.5.1 GEDCOM import, export & library are built-in plug-ins, it would make sense to do adapt the 'built-in" to “add-on” form as a common base. A single “funny accent” importer that recognizes dialects by Application name and adapts to them should be less of a maintenance hassle. (Particularly since several share certain oddities… such as European numeric date formats.) An exporter would be more of a challenge.

Currently, the only GEDCOM add-on is the Export GEDCOM Extensions (GED2) plug-in.

ennoborg · May 19, 2023, 3:48pm

Sounds good. I have a few candidates, based on my experience with Ancestry and My Heritage. They’re all events that can be easily translated to attributes.

For Ancestry:

_FSID stores what was exported from Gramps as _FSFTID, the ‘official’ attribute for the FamilySearch Family Tree ID,
UID does the same for the _UID attribute, the uuid with checksum as standardized by PAF.

For My Heritage:

_UPD stores date and time for the last update of a person.

They’re not events, and shouldn’t have been imported as such in the first place, but if you have them, and like to keep them, it would be very nice to convert them to attributes.

ennoborg · May 20, 2023, 9:03pm

That’s right. And for a lazy person like me, it’s easier to create a filter that corrects the GEDCOM file before import. I just created one in Python, with a little help from my chatty friends, that does most of the work, including converting long American month names to their official abbreviations. I need that, because the Gramps GEDCOM importer has no idea about the meaning of June, or September.

Note that the European numeric dates that we discussed earlier were user made.

emyoulation · March 30, 2024, 6:13pm

While updating the “See Also” section of the GEDCOM wiki page, I noticed a list of custom GEDCOM tags where Gramps currently has crosswalk definitions:

There appear to be 4 modules related to GEDCOM import and export:

gramps/plugins/lib/libgedcom.py

gramps/plugins/importer/importgedcom.py

gramps/plugins/export/exportgedcom.py

addon GedcomExtensions/GedcomExtensions.py

The are Custom Event Tags listed at lines 735-754 in the libgedcom.py file.

CUSTOMEVENTTAGS = {
    "_CIRC": _("Circumcision"),
    "_COML": _("Common Law Marriage"),
    "_DEST": _("Destination"),
    "_DNA": _("DNA"),
    "_DCAUSE": _("Cause of Death"),
    "_EMPLOY": _("Employment"),
    "_EXCM": _("Excommunication"),
    "_EYC": _("Eye Color"),
    "_FUN": _("Funeral"),
    "_HEIG": _("Height"),
    "_INIT": _("Initiatory (LDS)"),
    "_MILTID": _("Military ID"),
    "_MISN": _("Mission (LDS)"),
    "_NAMS": _("Namesake"),
    "_ORDI": _("Ordinance"),
    "_ORIG": _("Origin"),
    "_SEPR": _("Separation"),  # Applies to Families
    "_WEIG": _("Weight"),
}

And other tokens at lines 317-366.

TOKENS = {
    "_ADPN": TOKEN__ADPN,
    "_AKA": TOKEN__AKA,
    "_AKAN": TOKEN__AKA,
    "_ALIA": TOKEN_ALIA,
    "_ANCES_ORDRE": TOKEN_IGNORE,
    "_APID": TOKEN__APID,  # Ancestry.com database and page id
    "_CAT": TOKEN_IGNORE,
    "_CHUR": TOKEN_IGNORE,
    "_COMM": TOKEN__COMM,
    "_DATE": TOKEN__DATE,
    "_DATE2": TOKEN_IGNORE,
    "_DETAIL": TOKEN_IGNORE,
    "_EMAIL": TOKEN_EMAIL,
    "_E-MAIL": TOKEN_EMAIL,
    "_FREL": TOKEN__FREL,
    "_FSFTID": TOKEN__FSFTID,
    "_GODP": TOKEN__GODP,
    "_ITALIC": TOKEN_IGNORE,
    "_JUST": TOKEN__JUST,  # FTM Citation Quality Justification
    "_LEVEL": TOKEN_IGNORE,
    "_LINK": TOKEN__LINK,
    "_LKD": TOKEN__LKD,
    "_LOC": TOKEN__LOC,
    "_MAR": TOKEN__MAR,
    "_MARN": TOKEN__MARN,
    "_MARNM": TOKEN__MARNM,
    "_MASTER": TOKEN_IGNORE,
    "_MEDI": TOKEN_MEDI,
    "_MREL": TOKEN__MREL,
    "_NAME": TOKEN__NAME,
    "_PAREN": TOKEN_IGNORE,
    "_PHOTO": TOKEN__PHOTO,
    "_PLACE": TOKEN_IGNORE,
    "_PREF": TOKEN__PRIMARY,
    "_PRIM": TOKEN__PRIM,
    "_PRIMARY": TOKEN__PRIMARY,
    "_PRIV": TOKEN__PRIV,
    "_PUBLISHER": TOKEN_IGNORE,
    "_RUFNAME": TOKEN__CALLNAME,
    "_SCBK": TOKEN_IGNORE,
    "_SCHEMA": TOKEN__SCHEMA,
    "_SSHOW": TOKEN_IGNORE,
    "_STAT": TOKEN__STAT,
    "_TEXT": TOKEN__TEXT,
    "_TODO": TOKEN__TODO,
    "_TYPE": TOKEN_TYPE,
    "_UID": TOKEN__UID,
    "_URL": TOKEN_WWW,
    "_WITN": TOKEN__WITN,
    "_WTN": TOKEN__WTN,

And a few others scattered in odd places:

EventType.DEGREE: "_DEG",
EventType.ELECTED: "_ELEC", # FTM custom tag
EventType.MED_INFO: "_MDCL",
EventType.MILITARY_SERV: "_MILT",
sattr.set_type("_APID")

Topic		Replies	Views
Leveraging external Tools Ideas data-import , uuid	20	203	August 13, 2024
GEDCOM related reports on Mantis Development	9	379	May 25, 2023
Import of GEDCOM file from MyHeritage fails Help bugs , gedcom	17	297	August 9, 2024
Migrating from Gensdata Pro Help	29	887	September 8, 2023
GEDCOM import - treatment of REFN and TYPE Help	9	669	March 9, 2021

Are GEDCOM imports reversible dialects?

Related topics