TreeMerge - porting an experimental GEDCOM matching tool

andersardo · January 27, 2022, 11:36am

Hi,
I’ve been working on this “Merging data from two databases” problem on and off. There is an early prototype (unfortunately only with a Swedish UI) available at
https://rgd.dis.se/
There you can load 2 or more gedcom-files, match them automatically with uncertain matches marked as ‘manual’ where you have to decide if it’s a match or not by yourself.
After all manual matches are processed you can merge the 2 trees and export the merged tree as a new Gedcom-file.

The system has been used to merge trees with several 10s of thousands persons, so it works fairly good.
Matching is based on a kind of AI (machine learning using SVM).

I would love to have something similar as a Gramps plugin and have started to see if I could do it (with a simpler version of the matching), but soon got stuck because my UI skills (specially Gtk etc) and my knowledge of Gramps internals is not very good. I came a bit on the way but it’s not ready for use yet.

I’m still playing with this Gramps plugin from time to time but it’s slow work …

emyoulation · January 27, 2022, 3:33pm

Hi Anders… welcome!

Noticed the following in the DIS project roadmap :

Have you considered sharing an alpha version via a GitHub repository? I expect we would have volunteers to help with the GUI & Gramps internals. Whether that was ‘feedback’ or ‘coding help’ would be up to you.

ennoborg · January 27, 2022, 9:12pm

Hi Anders,

Nice! My Swedish goes no further than IKEA product names, but I discovered that when I try to imagine how words sound, I hear enough similarities with my own language (Dutch), English, and German, to get an idea of the workflow, which is uploading a file, and process it, following the steps in paragraph 1-3, and then following the same procedure for another file, before moving to 4, for automatic matching, and then move to 8 to download a merged GEDCOM. Is that right?

One question for now: Can the code be run in a Linux shell, without a web service?

Thanks,

Enno

andersardo · January 28, 2022, 6:11am

Hi

Hi Anders… welcome!

Thanks

Noticed the following in the DIS project roadmap :

Unfortunately the DIS project has stopped
But I have been playing with the ideas on and off for the last few years.

Have you considered sharing an alpha version via a GitHub repository? I expect we would have volunteers to help with the GUI & Gramps internals. Whether that was ‘feedback’ or ‘coding help’ would be up to you.

I will clean up the code a little (it’s full of out-commented experiments) and put put it in my Github repository, if thats OK?

And I’d be very happy to work in a collaborative effort to get this to something usefull.

andersardo · January 28, 2022, 6:31am

Hi

…

Nice! My Swedish goes no further than IKEA product names, but I discovered that when I try to imagine how words sound, I hear enough similarities with my own language (Dutch), English, and German, to get an idea of the workflow, which is uploading a file, and process it, following the steps in paragraph 1-3, and then following the same procedure for another file, before moving to 4, for automatic matching, and then move to 8 to download a merged GEDCOM. Is that right?

Almost right

upload your Gedcom files by selecting a local file with "Browse" and then press "Starta bearbetning" in steps 1-3
choose 2 uploaded Gedcoms and run "Matcha!" in step 4
choose 2 matched Gedcoms in step 5 and fix all maybe matches
merge those 2 Gedcoms in step 7
download the merged Gedcoms in step 8

One question for now: Can the code be run in a Linux shell, without a web service?

Not really - but there is a built-in web server in the code. I have continued to do smaller updates and fixes to the code after DIS stopped the project, for people who are using the service in various projects.

I could look into making that version available if someone is interested.

On the other hand, I see this code as a dead end - it’s developed as a proof of concept and contain a lot of experimentation, fixes, convoluted code, etc.

I would prefer to work on making a proper Gramps plugin using the good parts and experiences from the DIS project.

andersardo · January 28, 2022, 10:25am

Hi,

…

Have you considered sharing an alpha version via a GitHub repository? I expect we would have volunteers to help with the GUI & Gramps internals. Whether that was ‘feedback’ or ‘coding help’ would be up to you.

I will clean up the code a little (it’s full of out-commented
experiments) and put put it in my Github repository, if thats OK?
And I’d be very happy to work in a collaborative effort to get this to
something usefull.

The code, as it is, is now avilable at

I’m not sure it is possible to run it right now

There are some experimentation with ways to do detailed matching
between 2 persons including SVM-based matching (not integrated yet
also needs training-examples in order to generate a good
decision-model) and matching that takes ancestors into account.

I am sure it needs a lot of work to be usefull!

ennoborg · January 28, 2022, 4:04pm

Not here. It complains about a missing index:

2022-01-28 16:59:00.806: ERROR: grampsapp.py: line 157: Unhandled exception
Traceback (most recent call last):
  File "/home/enno/.gramps/gramps52/plugins/Treemerge/treemerge.py", line 204, in do_match
    matcher.do_find_matches()
  File "/home/enno/.gramps/gramps52/plugins/Treemerge/match.py", line 99, in do_find_matches
    self.setup_data_structures()
  File "/home/enno/.gramps/gramps52/plugins/Treemerge/match.py", line 146, in setup_data_structures
    self.ftdb = fulltextDatabase(clean=True) # remove old and generate new ft database
  File "/home/enno/.gramps/gramps52/plugins/Treemerge/ftDatabase.py", line 16, in __init__
    os.mkdir(directory)
FileNotFoundError: [Errno 2] Bestand of map bestaat niet: '/home/anders/.gramps/gramps52/plugins/Treemerge/ftindex'

Can I create an empty file with that name to get started? It’s a typical thing that you won’t discover, because you probably have that file from previous runs.

andersardo · January 28, 2022, 6:38pm

Hi,
Well yes and no:
The problem is that I’m using a hardcoded path for that directory in the file
ftDatabase.py line 14

directory = ‘/home/anders/.gramps/gramps52/plugins/Treemerge/ftindex’ #FIX!
so that should be changed to a relative location probably something like
directory = os.path.abspath(os.path.dirname(file))

ennoborg · January 29, 2022, 11:38am

OK, I hacked that with my own name, and leave it up to you to decide whether it is a good idea to store data in a plugin folder. I don’t know whether we have guidelines for that.

I will also put an error message on GitHub, so that you can look at that. I get it, when I run the matching with a database that holds two copies of my tree, which are just slightly different.

I will also run another match on the site, using the instructions that you gave.

Note that my personal wish looks much like the original question, meaning that I also like to have a match like I had in PAF, meaning that I like the 99 % that hasn’t changed to be merged automatically, and concentrate on the 1 %. That is similar to his wish to merge persons automatically when they are the same by name and DOB, and one has extra residence events and accompanying sources as gathered on Ancestry, and another may have extra notes and sources added inside Gramps. Such kinds of extra info gathered on either site don’t normally cause conflicts, and can hence be merged automatically.

And with that done, what’s left are the persons for whom a vital date or a name has changed on either side, and the user has to decide which is right.

emyoulation · January 29, 2022, 3:45pm

The usual thing would be to use either:

the OS temp directory (so that normal housekeeping will eventually dispose of it), or
the Database folder defined in Preferences. (However, that should probably use the same seeded foldername generator as Gramps. We’ll need to find a developer reference to that for you.)

andersardo · January 29, 2022, 4:05pm

Hi,

…

I will also put an error message on GitHub, so that you can look at that. I get it, when I run the matching with a database that holds two copies of my tree, which are just slightly different.

Commented on Github

I will also run another match on the site, using the instructions that you gave.

If you need any help please contact me directly.

Note that my personal wish looks much like the original question, meaning that I also like to have a match like I had in PAF, meaning that I like the 99 % that hasn’t changed to be merged automatically, and concentrate on the 1 %. That is similar to his wish to merge persons automatically when they are the same by name and DOB, and one has extra residence events and accompanying sources as gathered on Ancestry, and another may have extra notes and sources added inside Gramps. Such kinds of extra info gathered on either site don’t normally cause conflicts, and can hence be merged automatically.

Please note that automatic merging is not implemented in the plugin,
it just runs the “standard” Gramps merge persons. Sofar at leaste -
it’s on my TODO list.

ennoborg · February 2, 2022, 7:04pm

I know, and I also know that loads of people would love to have something like that. I used it in PAF long ago, where it was based on UIDs, and I saw the research report that you wrote in 2015.

grocanar · February 9, 2022, 1:48pm

Hi
that plugin seem very promising and i really would like to help in any way i can do.
It could be a cool replacement to dind duplicate people.

i have just installed it but i should have miss something
i don’t see a way to choose the two database i want to compare people with.

ennoborg · February 9, 2022, 3:16pm

That’s right. The current version works just like the standard tool to find duplicates, meaning that you have to import the other database into the main one. And given that it’s still experimental, and it does not do automatic merges of identical people and families, you should really test this on a new database, and not on your working tree.

So, if you want to help to test the algorithm, what you need to do is to create a new database, and then import the two that you want to test the tool on.

grocanar · February 9, 2022, 3:47pm

Hi
Well as i would like to test in a kind of replacement "of find duplicate people’ i don’t care about the two database.
but i m still puzzled
i know i have duplicate in the database i open but nothing happened the list of people remain empty
how can i force the generation of the text database?
Is this done at the importation?

grocanar · February 9, 2022, 3:56pm

Forget it i should have clicked on matched

andersardo · February 9, 2022, 4:49pm

Hi,

I’ve just switched from Gramps score-based matching to SVM-based
classification matching (SVM is a machine learning technique for
classification).

I’ve done a preliminary model using a few examples of good and bad
matches but it needs to be refined using more examples (I’m working on
it).

Next (big) items to work on are:

Add a possibility to select algorithm (Gramps/score-based/SVM)
I got stuck in trying to understand how use the tool.ToolOptions
class together with Gtk.
Automatic merging
Match two Gramps databases (instead of loading two Gedcom-files in
one database).
Is possible/feasible in current Gramps?

Great!
What are you most interested/knowledgeable in?

ennoborg · February 9, 2022, 8:08pm

Possible yes. We already have a diff and merge that works on person handles, and I’d love to see one that uses UIDs.

emyoulation · February 10, 2022, 7:47am

Gramps has Preferences options (in the General tab) to Tag or Source a batch of imported records.

Perhaps the list of Matches could have a switch to limit/filter results where at least one record has an Import Tag or a Source? (Can’t rely on modification Date filtering since Import preserves the epoch change date if the file provides that for each record.)

(Note: around line 1539, the clipboard.py has a context menu option to create a Custom Filter related to the selected clipboard object. Although you cannot Clipboard a Tag, you can do so with a source. Perhaps you could references the clipboard module for examples of creating a custom filter?)

(Also Note: new Tags added default to Black and lowest Priority. But the Views will color code by the color of the highest priority Tag. Perhaps you could push a top priority & contrasting color to the Import tag and use that to highlight Imported records in the Compare list?)

emyoulation · February 10, 2022, 7:54am

github.com

andersardo/gramps_Treemerge/blob/main/README.md

# gramps_Treemerge
A gramps plugin to merge data from 2 trees

# DESIGN

  * Generate a text-representation of a person (possibly including parents) with names, dates, places
      and index that in a free-text database
  * Use a person text-representation as a query to the free-text database
  * Test the top X results more detailed for a possible match

The above design avoids the need to compare all persons to all other persons thus cutting the algorithm complexity from
n-squared to X * n.

Matches can be grouped in 3 categories 'certain match', 'maybe', 'certain nomatch' where only 'maybe'
needs to be inspected manually.

The implementation borrows a lot from GraphView and Gramps 'Find Possible Duplicate People'.

## TODO/IDEAS

This file has been truncated. show original

Topic		Replies	Views
Merging data from different sources Help merging	15	1313	March 18, 2022
Leveraging external Tools Ideas data-import , uuid	20	202	August 13, 2024
Merging old GEDCOM files Help	4	1368	June 1, 2020
How Compare Myheritage GEDCOM with gramps db? Help	16	211	July 10, 2024
Can I use MyHeritage and Ancestry's tree-to-tree merge features with Gramps? Help gedcom	2	71	August 7, 2024

TreeMerge - porting an experimental GEDCOM matching tool

Related topics