Hi,
I’ve been working on this “Merging data from two databases” problem on and off. There is an early prototype (unfortunately only with a Swedish UI) available at https://rgd.dis.se/
There you can load 2 or more gedcom-files, match them automatically with uncertain matches marked as ‘manual’ where you have to decide if it’s a match or not by yourself.
After all manual matches are processed you can merge the 2 trees and export the merged tree as a new Gedcom-file.
The system has been used to merge trees with several 10s of thousands persons, so it works fairly good.
Matching is based on a kind of AI (machine learning using SVM).
I would love to have something similar as a Gramps plugin and have started to see if I could do it (with a simpler version of the matching), but soon got stuck because my UI skills (specially Gtk etc) and my knowledge of Gramps internals is not very good. I came a bit on the way but it’s not ready for use yet.
I’m still playing with this Gramps plugin from time to time but it’s slow work …
Noticed the following in the DIS project roadmap :
Have you considered sharing an alpha version via a GitHub repository? I expect we would have volunteers to help with the GUI & Gramps internals. Whether that was ‘feedback’ or ‘coding help’ would be up to you.
Nice! My Swedish goes no further than IKEA product names, but I discovered that when I try to imagine how words sound, I hear enough similarities with my own language (Dutch), English, and German, to get an idea of the workflow, which is uploading a file, and process it, following the steps in paragraph 1-3, and then following the same procedure for another file, before moving to 4, for automatic matching, and then move to 8 to download a merged GEDCOM. Is that right?
One question for now: Can the code be run in a Linux shell, without a web service?
Noticed the following in the DIS project roadmap :
Unfortunately the DIS project has stopped
But I have been playing with the ideas on and off for the last few years.
Have you considered sharing an alpha version via a GitHub repository? I expect we would have volunteers to help with the GUI & Gramps internals. Whether that was ‘feedback’ or ‘coding help’ would be up to you.
I will clean up the code a little (it’s full of out-commented experiments) and put put it in my Github repository, if thats OK?
And I’d be very happy to work in a collaborative effort to get this to something usefull.
Nice! My Swedish goes no further than IKEA product names, but I discovered that when I try to imagine how words sound, I hear enough similarities with my own language (Dutch), English, and German, to get an idea of the workflow, which is uploading a file, and process it, following the steps in paragraph 1-3, and then following the same procedure for another file, before moving to 4, for automatic matching, and then move to 8 to download a merged GEDCOM. Is that right?
Almost right
upload your Gedcom files by selecting a local file with "Browse" and then press "Starta bearbetning" in steps 1-3
choose 2 uploaded Gedcoms and run "Matcha!" in step 4
choose 2 matched Gedcoms in step 5 and fix all maybe matches
merge those 2 Gedcoms in step 7
download the merged Gedcoms in step 8
One question for now: Can the code be run in a Linux shell, without a web service?
Not really - but there is a built-in web server in the code. I have continued to do smaller updates and fixes to the code after DIS stopped the project, for people who are using the service in various projects.
I could look into making that version available if someone is interested.
On the other hand, I see this code as a dead end - it’s developed as a proof of concept and contain a lot of experimentation, fixes, convoluted code, etc.
I would prefer to work on making a proper Gramps plugin using the good parts and experiences from the DIS project.
Have you considered sharing an alpha version via a GitHub repository? I expect we would have volunteers to help with the GUI & Gramps internals. Whether that was ‘feedback’ or ‘coding help’ would be up to you.
I will clean up the code a little (it’s full of out-commented
experiments) and put put it in my Github repository, if thats OK?
And I’d be very happy to work in a collaborative effort to get this to
something usefull.
The code, as it is, is now avilable at
I’m not sure it is possible to run it right now
There are some experimentation with ways to do detailed matching
between 2 persons including SVM-based matching (not integrated yet
also needs training-examples in order to generate a good
decision-model) and matching that takes ancestors into account.
2022-01-28 16:59:00.806: ERROR: grampsapp.py: line 157: Unhandled exception
Traceback (most recent call last):
File "/home/enno/.gramps/gramps52/plugins/Treemerge/treemerge.py", line 204, in do_match
matcher.do_find_matches()
File "/home/enno/.gramps/gramps52/plugins/Treemerge/match.py", line 99, in do_find_matches
self.setup_data_structures()
File "/home/enno/.gramps/gramps52/plugins/Treemerge/match.py", line 146, in setup_data_structures
self.ftdb = fulltextDatabase(clean=True) # remove old and generate new ft database
File "/home/enno/.gramps/gramps52/plugins/Treemerge/ftDatabase.py", line 16, in __init__
os.mkdir(directory)
FileNotFoundError: [Errno 2] Bestand of map bestaat niet: '/home/anders/.gramps/gramps52/plugins/Treemerge/ftindex'
Can I create an empty file with that name to get started? It’s a typical thing that you won’t discover, because you probably have that file from previous runs.
Hi,
Well yes and no:
The problem is that I’m using a hardcoded path for that directory in the file
ftDatabase.py line 14
directory = ‘/home/anders/.gramps/gramps52/plugins/Treemerge/ftindex’ #FIX!
so that should be changed to a relative location probably something like
directory = os.path.abspath(os.path.dirname(file))
OK, I hacked that with my own name, and leave it up to you to decide whether it is a good idea to store data in a plugin folder. I don’t know whether we have guidelines for that.
I will also put an error message on GitHub, so that you can look at that. I get it, when I run the matching with a database that holds two copies of my tree, which are just slightly different.
I will also run another match on the site, using the instructions that you gave.
Note that my personal wish looks much like the original question, meaning that I also like to have a match like I had in PAF, meaning that I like the 99 % that hasn’t changed to be merged automatically, and concentrate on the 1 %. That is similar to his wish to merge persons automatically when they are the same by name and DOB, and one has extra residence events and accompanying sources as gathered on Ancestry, and another may have extra notes and sources added inside Gramps. Such kinds of extra info gathered on either site don’t normally cause conflicts, and can hence be merged automatically.
And with that done, what’s left are the persons for whom a vital date or a name has changed on either side, and the user has to decide which is right.
the OS temp directory (so that normal housekeeping will eventually dispose of it), or
the Database folder defined in Preferences. (However, that should probably use the same seeded foldername generator as Gramps. We’ll need to find a developer reference to that for you.)
I will also put an error message on GitHub, so that you can look at that. I get it, when I run the matching with a database that holds two copies of my tree, which are just slightly different.
Commented on Github
I will also run another match on the site, using the instructions that you gave.
If you need any help please contact me directly.
Note that my personal wish looks much like the original question, meaning that I also like to have a match like I had in PAF, meaning that I like the 99 % that hasn’t changed to be merged automatically, and concentrate on the 1 %. That is similar to his wish to merge persons automatically when they are the same by name and DOB, and one has extra residence events and accompanying sources as gathered on Ancestry, and another may have extra notes and sources added inside Gramps. Such kinds of extra info gathered on either site don’t normally cause conflicts, and can hence be merged automatically.
Please note that automatic merging is not implemented in the plugin,
it just runs the “standard” Gramps merge persons. Sofar at leaste -
it’s on my TODO list.
I know, and I also know that loads of people would love to have something like that. I used it in PAF long ago, where it was based on UIDs, and I saw the research report that you wrote in 2015.
That’s right. The current version works just like the standard tool to find duplicates, meaning that you have to import the other database into the main one. And given that it’s still experimental, and it does not do automatic merges of identical people and families, you should really test this on a new database, and not on your working tree.
So, if you want to help to test the algorithm, what you need to do is to create a new database, and then import the two that you want to test the tool on.
Hi
Well as i would like to test in a kind of replacement "of find duplicate people’ i don’t care about the two database.
but i m still puzzled
i know i have duplicate in the database i open but nothing happened the list of people remain empty
how can i force the generation of the text database?
Is this done at the importation?
I’ve just switched from Gramps score-based matching to SVM-based
classification matching (SVM is a machine learning technique for
classification).
I’ve done a preliminary model using a few examples of good and bad
matches but it needs to be refined using more examples (I’m working on
it).
Next (big) items to work on are:
Add a possibility to select algorithm (Gramps/score-based/SVM)
I got stuck in trying to understand how use the tool.ToolOptions
class together with Gtk.
Automatic merging
Match two Gramps databases (instead of loading two Gedcom-files in
one database).
Is possible/feasible in current Gramps?
Great!
What are you most interested/knowledgeable in?
Gramps has Preferences options (in the General tab) to Tag or Source a batch of imported records.
Perhaps the list of Matches could have a switch to limit/filter results where at least one record has an Import Tag or a Source? (Can’t rely on modification Date filtering since Import preserves the epoch change date if the file provides that for each record.)
(Note: around line 1539, the clipboard.py has a context menu option to create a Custom Filter related to the selected clipboard object. Although you cannot Clipboard a Tag, you can do so with a source. Perhaps you could references the clipboard module for examples of creating a custom filter?)
(Also Note: new Tags added default to Black and lowest Priority. But the Views will color code by the color of the highest priority Tag. Perhaps you could push a top priority & contrasting color to the Import tag and use that to highlight Imported records in the Compare list?)