OK, I understand. And as far as I’m concerned, there are at least two scenarios:
- When I upload my own tree to a site like Ancestry, it will have UIDs, because they were supplied by the program that I used before Gramps, which is PAF, and because I added some code to add UIDs to all persons that I created in Gramps.
Now, when I add information on Ancestry, like by confirming hints, only a small percentage of my data is changed, meaning that when I download a GEDCOM 99 % of the persons are not changed, and when I assume that added persons have unique UIDs, I can be sure that all persons can be matched on UID, and any person that has identical data can be merged, or maybe even ignored, if the merge is with another database, instead of being done inside my tree. It’s only when there is conflicting data, that I want to make a decision, and maybe when one person has more events than the other. I may also want automatic merges of extra attributes, like the IDs used by Ancestry, and FamilySearch.
If 99 % of the persons are unchanged, it means that I don’t want to be bothered about them, which also means that for all of those, I demand that the algorithm sees all of those as an exact match. And that is a requirement, because there are persons that only have a small vector to compare on, like only a name, without a known birth or death date, for whom I know their position in my tree.
When only 90 % of these are detected as exact matches, it means that I have decide for 10 % of the persons, even though I know (and can test with RootsMagic) that only 1 % of the persons has a relevant change. And in that case, it means that 90 % of my merge actions are a waste of time, which means that I will not use the tool, but rely on the report that I get from RootsMagic.
- When I receive a tree from a friend of relative, it will normally not have matching UIDs, unless that person received some data from me earlier. And in this case, I have no idea about the amount of people that match. In case of a relative however, there is quite a big chance that his/her tree has a cluster of persons that match, like a few generations of common ancestors. This means, that, IMO, SVM works best when it can also process relations, and maybe even that the matching can be seeded with IDs of persons for whom I know that they match, like that common ancestor. In that case, for all other persons their path to that common person can be part of the vector.
Would such an approach be possible? It can make the algorithm much faster, and I know examples on GitHub, where the algorithm may also tell where a new person connects to an existing one.
For use case 1, I prefer a match with an external database, like in the Import and Merge tool:
https://gramps-project.org/wiki/index.php/Addon:Import_Merge_Tool