Find duplicates feature is too slow

Urchello · May 23, 2024, 7:38pm

I’ve run the feature and it working still during about 2 hours. I can not stop it, I receive a message “Please dont force closing…”. So, Gramps is blocked.
And I have 9700 people only. I turned on soundex.

How much combinations are comparing? Each to each? It will be about 9700*9700 = 94090000 combinations. Does the script ignoring comparing between men and women? Does it calculate soundex each time of 94090000? I think all calculations should be made 1 time only for one person.
So, the main question is: is this feature script optimized? Maybe it is OK and impossible make it faster. Just could you check it pls? Or maybe anybody already knows why it happenning. Thanks a lot.

ennoborg · May 23, 2024, 7:50pm

It is slow, but 2 hours is crazy. It takes a few minutes on my tree, which is about 50 % larger than yours.

For quick results, I suggest that you export a GEDCOM and import that into RootsMagic Essentials. It’s free, and finds duplicates fast.

https://rootsmagic.com/Try/RootsMagic/

Urchello · May 23, 2024, 7:52pm

Additional thoughts and questions on this topic:

Are people compared who clearly lived in different generations or who have different exact birth or death dates? Perhaps it doesn’t make sense to compare such individuals?
Maybe it make sense to add filters? For example, I can filter all residents of a single locality by tag, and I don’t need to compare all the records in my database.
Are hash tables used to store calculations for each person?
Perhaps it makes sense to calculate such tables gradually in the background when Gramps is not actively being used?
Is it possible to make calculations in parallel in multiple threads? I have 16 cores, but only 1 is being used.

ennoborg · May 23, 2024, 8:09pm

I can’t answer either, and thought that you ran Windows. But even though I see Linux, I still recommend that you try RootsMagic to see how fast it can be. It runs quite fine in Wine, up to version 7 (of Wine).

Urchello · May 23, 2024, 8:33pm

Got it. Thank you for the advice. I will try it. At least I’ve received the calculated result. There are hundreds or thousands of rows with a matching rates. I checked several matches with the biggest rate and several with the lowest and all the matches are far to be duplicates. I think it could be OK, but now it is really interesting make the same on another software. But now Im sure: if another app can make it fast, it means that this Gramps feature has a big potential to be much more fast.

RenTheRoot · May 23, 2024, 8:45pm

This is possible using multithreading. I have used multithreading in my programs before and it is actually pretty intuitive to implement in python. I’ll add it to my personal “development ideas” list and if it hasn’t been implemented I will eventually get around to it.

ennoborg · May 23, 2024, 9:05pm

That’s right. I have a reference tree with more than 600,000 persons, and even in that, RootsMagic can find duplicates in a few minutes. And that’s not the only place where Gramps is slow, because I have the same problem with deep connections in Gramps, which needs hours to find such things in that tree.

Some say that Gramps is slower, because it’s based on an interpreter, and RootsMagic uses compiled code, but I don’t believe that that’s the real cause, because when it comes to deep connections, Geneanet can find several ones in a second, and their software, GeneWeb, runs in an interpreter too.

Multithreading can sometimes help, but even when you can speed up Gramps with a factor of 16, by using all the cores that you and I have, it’s much slower than I’m willing to accept.

This is why my conclusion is, that the real problem is in the algorithm. And I know that, because I also know that a site like Geni can calculate deep connections in under a minute, for a tree that has 189,445,077 persons.

ennoborg · May 24, 2024, 3:02pm

Another question: Do you have many names that look similar? Or many persons without dates?

Urchello · May 24, 2024, 4:19pm

I have about 25% people without dates currently. And I think yes, I have a lot of similar names (surnames). Additionally I dont use Alternative names yet, all my surnames and middlenames are listed as “preferred name” surnames. Maybe such approach could make such behaviour.

ennoborg · May 24, 2024, 5:22pm

Maybe so, and when your screen shows a typical name, it also suggests that your names are much longer, in the sense of the number of words. It might very well be, that on average, they’re twice as long as the names that I have.

In my tree, many people have either a family name, or a patronymic name, which means that their surnames are most often one word, except in cases where there are pefixes like ‘van’ or ‘van der’. Some of my noble relatives have more, but they’re a minority. Most people also have only one or two given names, which may lead to the conclusion that in my database, the average name length is 3 or 4 words.

Another difference is, that in my tree, there are not many inflections, because many of those disappeared after 1812, when patronymic names became redundant.

The last thing that I’m thinking about is a difference in hardware, but since you have 16 cores, your hardware is probablly pretty powerful anyway. I have an HP Envy, with an AMD Ryzen 7, 16 GB RAM, and Linux Mint and Windows 11 running on an Intel SSD, that also stores the Gramps data.

What about yours?

Urchello · May 24, 2024, 6:05pm

I tested it on:
Ubuntu 22.04.2 LTS
CPU: 8-core Intel Core i9-9900KF (-MT MCP-)
Memory: 4 * 8GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 2667 MHz (0,4 ns)
Storage: Samsung SSD 980 500GB

Urchello · May 24, 2024, 6:31pm

I also tried RootsMagic on MAC OS:
MacBook PRO
Processor: 2.8 GHz, Intel core i7
Memory: 16 GB.

And calculation speed various from 1sec to 45+ minutes, depending on initial settings. But RootsMagic has at least 2 advantages:

I can cancel comparing
Right after cancelling I receive list of already calculated found results

I think this is what we could make better in Gramps find duplicates.
One more. I think, this feature needs more settings to show more exact results according to user expectations. Like these:

settings how compare people if both dates are full. For example in my case there are 99.99% different people who has different full births or deaths dates.
settings how compare surnames. In my case only women could change their surnames. So, if men have different surnames they are also 99.99% not duplicates.
analize parents - different parents means that comparing people are not duplicates. This is easy to say, but not so easy in implementation.
births or death places. As for me, all my places are taken from documents, so I can use places for duplicates comparing.
…

We could make brainstorm and think how make this tool more exact and maybe more fast.

emyoulation · May 24, 2024, 7:24pm

There is only Gramps feature (that I am aware of) that has a progress bar with a functional Cancel. (That’s the Test run button in the FilterParams addon.) All the rest of the progressbars say the operation is too important to be interrupted. More bailout points sure would be nice though.

And being able to return partial results sounds interesting … but also suggests lots of complications.

Urchello · May 24, 2024, 8:31pm

I see. But are you agree, users who use it at first dont expect wait for results several hours and they probably prefer stop it and continue working with Gramps. But they have only 3 ways:

wait
force stop Gramps
reboot PC

This is only my mind, I dont insist .

Nick-Hall · May 24, 2024, 10:28pm

In the first pass we group people by sex and surname. The soundex is calculated once per person, if selected. Only the main surname parts of the primary name are used - prefixes and connectors are ignored.

Then we loop through each person again and only match them to people in the same sex/surname group. It looks like we may be matching twice though - person A is matched against person B and then later person B is matched against person A.

Using LRU caches and changing the order of the second loop would significantly improve performance.

I see no problem in adding a “Cancel” button and allowing the user to exit the search loop early. As we just add matches to a list, we could still display the results so far.

In looking through the code I also noticed a problem with place names. The place displayer is not used, and the contents of the title field is used instead, which is likely to be empty.

emyoulation · May 24, 2024, 11:45pm

Would performance also be improved by looking for duplicates of a Custom Filter?

For instance, after importing a twig from another source, look for duplicates of those imports in the main tree.

Or after merging a specific duplicate, search the merged person’s Ancestors/Descendants to zip up the Family duplicates (preferably in increasing degrees of separation order).

Sometimes a distant cousin provides their pedigree to a common Ancestor. So it would be helpful to quickly stitch duplicates in that data into the existing Tree. (This is so Gramps can be used to coordinate research in attacking our common brickwalls.)

Urchello · May 25, 2024, 11:15am

@Nick-Hall thank you for the research!

Wow! Looks like it can save a lot of time.

It sounds great!

ennoborg · May 25, 2024, 3:17pm

How does the focus on the main surname parts work out for Russian (or Ukranian) names as shown by @Urchello ? I can imagina that it doesn’t work too well for those, and maybe names from other cultures too.

I’m assuming that these name parts are also the ones in the index. Is that right?

If so, it might partly explain why RootsMagic is so much faster, since it has all name parts in a table with an index, and without the need to extract pickled data. And if that’s a valid reason, adding more names to the Gramps indexes may speed up things quite a bit too.

MarkusB · June 8, 2024, 7:19am

Yes, the merging duplicates tool is slow, but I find it only mildly irritating as it is not a frequent action. More irritating is the fact that it does not merge identical events and notes of a merged person, ie you end up with two birth- and death events and a duplicated note for a merged person. The tool should at least add a tag or similar for the duplicated, so one can easily clean up.

ennoborg · June 8, 2024, 3:23pm

That’s right, and I know other programs that do this, like RootsMagic, at least for events.

Another thing is, that when events are not exactly the same, I often run in a situation where I know that the original is right, in the sense that I checked that particular birth, and have a source, that is either attached to the event, or to the person. This is often the case when I import data from FamilySearch, but it may also happen after importing a GEDCOM received from a relative. In such a case, it often happens that the duplicates are the persons that I already have, and for which I tend to want to keep my own events, and discard the imported ones. In that case, the duplicate persons just work as the connections to the new persons, which I will often accept as-is, at least during the merging process.

Topic		Replies	Views
TreeMerge - porting an experimental GEDCOM matching tool Development	28	1240	March 14, 2022
About Gramps usage and organization to store and research Help	27	908	November 14, 2020
Improving performance in different areas Ideas	64	979	July 2, 2022
Merge > 2 Places? Development merging	16	1286	May 14, 2021
Datamining standard libraries for Gramps? Development prerequisite-mgmt	44	1852	November 15, 2020

Find duplicates feature is too slow

Related topics