TreeMerge - porting an experimental GEDCOM matching tool

Question: How would you define an exact match?

Hints:

  1. In PAF, automatic merges are possible for persons that have the same UID and for whom all other data agrees,
  2. Other programs have similar features, which I can’t always test without paying,
  3. We support matching on handle in our diff and merge tool, but handles are not as persistent as UIDs, and can be changed on import, like when you import a backup into the database it was made of.

My idea for “certain match” is to use SVM categorisation. I want to be able to merge trees from different places/persons which I beleive excludes using UIDs.

SVM is a machine-learning tool that you train to recognize and group objects (feature vectors) into categories. In this case feature vectors are comparisons between persons and categories are ‘match’ and ‘no match’.
A feature vector consists of various aspect like name similarity, event similarity, etc (more details on Github).
SVM can give a probability that a match is an exact match and a ‘certain match’ would be where a SVM calculated probability is above 90 - 95 % of being an exact match.

I believe that this is better than require that all data match exactly. On the other hand there is always the possibility SVM is wrong and you get a false match - which must be handled is some way.

In the machine learning aspect, I wonder if a learned optimization might not be possible?

If we synchronize data from the same sources repeatedly, it is likely that our input streams will exactly match in sequential downloads more than what is finally stored after download is merged.

Perhaps some sort of audit trail custom attribute would help it learn not to do the same work twice?

Gramps stores a “Merged Gramps ID” as a primitive audit trail. But that’s not very useful since the IDs are so malleable. (I don’t know of any feature that has been created to leverage this attribute. Perhaps it is just a convenience feature?) The Merged Gramps ID doesn’t identify the modify date, handle, or the originating Tree. A Gramps object XML starts with the object category, handle, change epoch date value, & ID. That’s a bit more actionable.

(The Gramps XML format does not store a unique tree handle either. Gramps creates one for the database folder. Perhaps an internal audit trail chunk should be created that stores this, when the database changes and import file IDs?)

Since attributes are a type & value 2-tuple, maybe the Merged Gramps ID could store a CSV array as a value?

Then merges could become smart enough to see if the data has been merged before and handle it appropriately?

1 Like

Two notes containing a json of each object before their merge?

And some information like a merge timestamps.

I do that manually, not the json but adding a note with timestamps and reasons why i merge these records to be able to retrieve gramps backup file before this merge if needed, but if gramps could do some parts of this job that automatically it’ll be great

1 Like

OK, I understand. And as far as I’m concerned, there are at least two scenarios:

  1. When I upload my own tree to a site like Ancestry, it will have UIDs, because they were supplied by the program that I used before Gramps, which is PAF, and because I added some code to add UIDs to all persons that I created in Gramps.

Now, when I add information on Ancestry, like by confirming hints, only a small percentage of my data is changed, meaning that when I download a GEDCOM 99 % of the persons are not changed, and when I assume that added persons have unique UIDs, I can be sure that all persons can be matched on UID, and any person that has identical data can be merged, or maybe even ignored, if the merge is with another database, instead of being done inside my tree. It’s only when there is conflicting data, that I want to make a decision, and maybe when one person has more events than the other. I may also want automatic merges of extra attributes, like the IDs used by Ancestry, and FamilySearch.

If 99 % of the persons are unchanged, it means that I don’t want to be bothered about them, which also means that for all of those, I demand that the algorithm sees all of those as an exact match. And that is a requirement, because there are persons that only have a small vector to compare on, like only a name, without a known birth or death date, for whom I know their position in my tree.

When only 90 % of these are detected as exact matches, it means that I have decide for 10 % of the persons, even though I know (and can test with RootsMagic) that only 1 % of the persons has a relevant change. And in that case, it means that 90 % of my merge actions are a waste of time, which means that I will not use the tool, but rely on the report that I get from RootsMagic.

  1. When I receive a tree from a friend of relative, it will normally not have matching UIDs, unless that person received some data from me earlier. And in this case, I have no idea about the amount of people that match. In case of a relative however, there is quite a big chance that his/her tree has a cluster of persons that match, like a few generations of common ancestors. This means, that, IMO, SVM works best when it can also process relations, and maybe even that the matching can be seeded with IDs of persons for whom I know that they match, like that common ancestor. In that case, for all other persons their path to that common person can be part of the vector.

Would such an approach be possible? It can make the algorithm much faster, and I know examples on GitHub, where the algorithm may also tell where a new person connects to an existing one.

For use case 1, I prefer a match with an external database, like in the Import and Merge tool:

https://gramps-project.org/wiki/index.php/Addon:Import_Merge_Tool

Hi,
This is a slightly diffrenet use-case than I’ve been thinking about. It’s an interesting idea worth pursuing. However I guess you will also need to record and store manual decisions about non-matches for example for persons that have similar data but are not related.

2 Likes

Hi,
Your scenario 1 is most likely solved better using a tool like the Import and Merge tool.

Scenario 2 suits treemerge better and I think that using family relations is the way forward.

I agree. Scenario 1 could very well be done with a modified Import and Merge tool. And to make it work, we must first agree on setting UIDs in Gramps. I made an official patch to make sure that they are imported long ago, but I use a hacked tool to add them to my database, which is still Gramps 3.4, with a flat location model. And if we add automatic UID generation, we must also think about where they are used, for individuals only, or for families too.

For scenario 2, I would also like an option to use that tool with your algorithm, if I can use that to create a summary that tells me how many people have a ‘certain’ match, so that I can get an idea about the amount of overlap.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.