Are GEDCOM imports reversible dialects?

When the Gramps GEDCOM importer plugin sees a custom tag, it adds the unrecognized content as a Note.

Is this content reversible?

That is to say, will the GEDCOM Export write the custom tag back to the same place in the structured data?

WikiTree is an example. They break out a Middle name in their internal data structure and do not include that name the standard GIVN Given Name value. Instead they write a _MIDN custom tag with that portion of the GIVN data.

This content is preserved as a Note for the Person. But it there enough context preserved to restore the data? Or even to do a meaningful post-processing of the ambiguous data.

I don’t see how since the Note is not attached as a secondary object to the same object. (The Note attached to the Person object, not to the more appropriate ‘preferred Name’ secondary object of that Person object.)

In the example of a WikiTree GEDCOM being imported, the “GEDCOM import” type Note is attached to the person with the content:

Records not imported into INDI (individual) Gramps ID I000003:

Line ignored as not understood                                      Line    93: 2 _MIDN Franklin

Meanwhile, the original context in the “WikiTree_McCulloughWm1775.ged” GEDCOM source file starts at line 90 and runs through line 155:

0 @I3@ INDI
1 NAME William Franklin /McCullough/
2 GIVN William
2 _MIDN Franklin
2 SURN McCullough
2 NSFX Sr
1 SEX M
1 BIRT
2 DATE 24 May 1841
2 PLAC Westmoreland, Pennsylvania, United States
1 DEAT
2 DATE 4 Sep 1917
2 PLAC 136 Boyles ave, New Castle, Lawrence, Pennsylvania, United States
1 WWW https://www.WikiTree.com/wiki/McCullough-1311 
1 FAMC @F2@
1 FAMS @F3@
1 NOTE == Biography ==
2 CONT 
2 CONT '''William F. McCullough, Sr.'''
2 CONT 

lines 108-147 are just concatenated line from the Biography. Continue with line 148

2 CONC  />Lawrence county, Pennsylvania
2 CONT * {{FindAGrave|159862323}} for Wm. F. McCullough, Sr.
1 REFN 9567903
2 TYPE wikitree.user_id
1 REFN 10072967
2 TYPE wikitree.page_id
1 REFN 60
2 TYPE wikitree.privacy
0 @I4@ INDI
1 NAME Hollis Rushton /McCullough/

The Note about the misunderstood level ‘2’ Tag is attached to the level ‘0’ Individual but is actually secondary to the level 1 “NAME” for that individual. The Individual attachment is known in the imported data but the NAME attachment is lost.

The Gramps ID for the Individual is I0003 … but that correlation was contingent on being imported into an empty Tree. There is no direct reference to @I3@ of the GEDCOM.

Further, there are 3 REFN Attributes (9567903, 10072967, 60) added to Person for the final parts of the GEDCOM segment for this individual. But the wikitree.user_id, wikitree.page_id, and wikitree.privacy TYPE context is completely lost for those 3 REFNs.

Reversing the import to a WikiTree dialect export is probably a pipe dream.

But adding more context to allow a post-processing tool to do meaningful work seems reasonable.

No. Why would you want to do that?

Post-processing would be nice, but I wonder how that can be done in a meaningful way. And that also triggers the question: What do you see as meaningful work?

Here’s another thought about this:

Ancestry exports UID’s with a UID tag, not _UID, like it should, and when Gramps see those, it creates UID events. They’re not events, because they have no dates, and they look pretty stupid to me. Post-processing could detect such events, and convert them to attributes.

Those REFN’s would also be perfect candidates for attributes, don’t you think?

In a collaborative sense, it would be building tools that compare & contrast trees by different researchers to make a worklist. (Which probably means bring data in from diferrent programs without merging them. So post-processing workflows to harmonize GEDCOM dialect quirks will probably be needed repeatedly.)

So you both take exports from your respective programs, compare those to exports from your last colloborative session and see if you’ve both made the harmonizations agreed to from the previous worklist. (Because, if you are not following through on previous agreements, there is little sense in further collaboration.) Then compare & contast the current exports to generate a new worklist of stuff to be harmonized.

I do. And they are already REFN type Attributes with a numeric UUID. But maybe they should be list values?

So, for the following GEDCOM lines that currently generates an attribute REFN=60

1 REFN 60
2 TYPE wikitree.privacy

It would instead generate an attribute REFN=60;TYPE,wikitree.privacy
or
GEDCOM=5.5.1;REFN,60;TYPE,wikitree.privacy
(But that structure does not handle the cases like more level 2 entries and level 3 under the unrecognized tag.)
So, maybe
GEDCOM=5.5.1;1,REFN,60;2,TYPE,wikitree.privacy

I understand that, but I don’t believe that it’s the right way, unless I’m still misunderstanding a few things. And I think that, because my experience with GEDCOM comparison is quite bad. There are a few programs that work nice on small trees, but I haven’t found any that can deal with real trees, say of size 10,000 and larger, and allow you to focus on what’s really important. And I even know at least one, that was recommended on stack exchange, that can’t even load my tree within a reasonable time. And then my first question is, how can someone recommend such a thing?

The problem is, that most of these programs are quite dumb, assuming that the GEDCOM files that you work on have quite similar dialects, and that’s not the case with the files that I deal with, where one file comes from my own tree in Gramps, and the other from an on-line service, like Ancestry, My Heritage, or WikiTree, that doesn’t give a dime (replace with your favorite alternative four letter word) about standards.

My take would be different, and that is to run comparisons at the Gramps (XML) level, when most of the harmonization has already been done by the GEDCOM importer. Items will then still have different Gramps ID’s, and handles too, but most of the other differences between GEDCOM dialects can’t influence things then. And you can also rewrite the code that we already have in import & merge so that it can look for UID’s, or other ID’s that are more persistent than the standard Gramps/GEDCOM ID’s and handles. The latter sounds strange, because handles are also derived from uuid’s, but handles are designed to be unique, and persons with the same UID can co-exist in one database.

This requires some extra work on the GEDCOM importer, but that’s something that needs work anyway, to deal with Ancestry UID’s that are actually _UID’s, and the REFN examples that you gave in your other message.

There is another trick if the Gramps XML comparison hurdle is too big, and that is comparing GEDCOM files exported by Gramps, based on imports from different programs, because in that case, after correcting the importer, the Ancestry UID’s have already been transformed to _UID’s. I’m referring to these by GEDCOM tag name, meaning that a _UID is the standard set by PAF.

I had a closer look at the REFN issue, and it looks like our GEDCOM importer is based on how REFN tags are exported by PAF, where the optional TYPE is not included. It is a legal tag, in GEDCOM 5.5.1, so one can consider this to be a bug. And to keep things simple, I would suggest that when a TYPE is specified, it replaces the standard attribute type for REFN, which is REFN. Using the TYPE value as the attribute type is the easiest way to prevent clashes with REFN values entered by users.

In GEDCOM 5.5.1, you have RIN tags too, which defined as

A unique record identification number assigned to the record by the source system. This number is
intended to serve as a more sure means of identification of a record for reconciling differences in data between two interfacing systems.

Unlike REFN, these things are defined to be unique, and their goal is pretty much what you wrote in your earlier message.

Note that GEDCOM 7 has a new tag, named EXID, for external ID, for which the TYPE won’t be optional in GEDCOM 8, so they say. That might then be the best place to store those wikitree things.

How about creating separate GEDCOM importers/exporters for Ancestry, WikiTree, etc.? (And leaving the current one for “generic” GEDCOM imports/exports.)

And how about a new tool to do consistency checks in terms of how/where different pieces of data are stored in Gramps (UIDs, etc.) based on some new preferences settings?

Since the current 5.5.1 GEDCOM import, export & library are built-in plug-ins, it would make sense to do adapt the 'built-in" to “add-on” form as a common base. A single “funny accent” importer that recognizes dialects by Application name and adapts to them should be less of a maintenance hassle. (Particularly since several share certain oddities… such as European numeric date formats.) An exporter would be more of a challenge.

Currently, the only GEDCOM add-on is the Export GEDCOM Extensions (GED2) plug-in.

Sounds good. I have a few candidates, based on my experience with Ancestry and My Heritage. They’re all events that can be easily translated to attributes.

For Ancestry:

_FSID stores what was exported from Gramps as _FSFTID, the ‘official’ attribute for the FamilySearch Family Tree ID,
UID does the same for the _UID attribute, the uuid with checksum as standardized by PAF.

For My Heritage:

_UPD stores date and time for the last update of a person.

They’re not events, and shouldn’t have been imported as such in the first place, but if you have them, and like to keep them, it would be very nice to convert them to attributes.

That’s right. And for a lazy person like me, it’s easier to create a filter that corrects the GEDCOM file before import. I just created one in Python, with a little help from my chatty friends, that does most of the work, including converting long American month names to their official abbreviations. I need that, because the Gramps GEDCOM importer has no idea about the meaning of June, or September.

Note that the European numeric dates that we discussed earlier were user made.

While updating the “See Also” section of the GEDCOM wiki page, I noticed a list of custom GEDCOM tags where Gramps currently has crosswalk definitions:

There appear to be 4 modules related to GEDCOM import and export:

The are Custom Event Tags listed at lines 735-754 in the libgedcom.py file.

CUSTOMEVENTTAGS = {
    "_CIRC": _("Circumcision"),
    "_COML": _("Common Law Marriage"),
    "_DEST": _("Destination"),
    "_DNA": _("DNA"),
    "_DCAUSE": _("Cause of Death"),
    "_EMPLOY": _("Employment"),
    "_EXCM": _("Excommunication"),
    "_EYC": _("Eye Color"),
    "_FUN": _("Funeral"),
    "_HEIG": _("Height"),
    "_INIT": _("Initiatory (LDS)"),
    "_MILTID": _("Military ID"),
    "_MISN": _("Mission (LDS)"),
    "_NAMS": _("Namesake"),
    "_ORDI": _("Ordinance"),
    "_ORIG": _("Origin"),
    "_SEPR": _("Separation"),  # Applies to Families
    "_WEIG": _("Weight"),
}

And other tokens at lines 317-366.

TOKENS = {
    "_ADPN": TOKEN__ADPN,
    "_AKA": TOKEN__AKA,
    "_AKAN": TOKEN__AKA,
    "_ALIA": TOKEN_ALIA,
    "_ANCES_ORDRE": TOKEN_IGNORE,
    "_APID": TOKEN__APID,  # Ancestry.com database and page id
    "_CAT": TOKEN_IGNORE,
    "_CHUR": TOKEN_IGNORE,
    "_COMM": TOKEN__COMM,
    "_DATE": TOKEN__DATE,
    "_DATE2": TOKEN_IGNORE,
    "_DETAIL": TOKEN_IGNORE,
    "_EMAIL": TOKEN_EMAIL,
    "_E-MAIL": TOKEN_EMAIL,
    "_FREL": TOKEN__FREL,
    "_FSFTID": TOKEN__FSFTID,
    "_GODP": TOKEN__GODP,
    "_ITALIC": TOKEN_IGNORE,
    "_JUST": TOKEN__JUST,  # FTM Citation Quality Justification
    "_LEVEL": TOKEN_IGNORE,
    "_LINK": TOKEN__LINK,
    "_LKD": TOKEN__LKD,
    "_LOC": TOKEN__LOC,
    "_MAR": TOKEN__MAR,
    "_MARN": TOKEN__MARN,
    "_MARNM": TOKEN__MARNM,
    "_MASTER": TOKEN_IGNORE,
    "_MEDI": TOKEN_MEDI,
    "_MREL": TOKEN__MREL,
    "_NAME": TOKEN__NAME,
    "_PAREN": TOKEN_IGNORE,
    "_PHOTO": TOKEN__PHOTO,
    "_PLACE": TOKEN_IGNORE,
    "_PREF": TOKEN__PRIMARY,
    "_PRIM": TOKEN__PRIM,
    "_PRIMARY": TOKEN__PRIMARY,
    "_PRIV": TOKEN__PRIV,
    "_PUBLISHER": TOKEN_IGNORE,
    "_RUFNAME": TOKEN__CALLNAME,
    "_SCBK": TOKEN_IGNORE,
    "_SCHEMA": TOKEN__SCHEMA,
    "_SSHOW": TOKEN_IGNORE,
    "_STAT": TOKEN__STAT,
    "_TEXT": TOKEN__TEXT,
    "_TODO": TOKEN__TODO,
    "_TYPE": TOKEN_TYPE,
    "_UID": TOKEN__UID,
    "_URL": TOKEN_WWW,
    "_WITN": TOKEN__WITN,
    "_WTN": TOKEN__WTN,

And a few others scattered in odd places:

  • EventType.DEGREE: "_DEG",
  • EventType.ELECTED: "_ELEC", # FTM custom tag
  • EventType.MED_INFO: "_MDCL",
  • EventType.MILITARY_SERV: "_MILT",
  • sattr.set_type("_APID")