A new (probably temporary) user on Facebook needs to move a 161,279 file from PAF to Gramps via GEDCOM. Import took 17 minutes. They wondered if Gramps would be so slow everywhere.
After thinking about it, it seemed like the only feature likely to take longer would (understandably) be the Narrated Website Report.
Are there any other features that take are likely to take minutes to run? (Backups on a 50,000 person tree takes less than a minute on my Linux box.)
Working on the same user’s problem, discovered that the Add/Remove Tag using the Ancestors of <person>
is unbearably slow tagging 48,819 people. (It’s at 30 minutes and about halfway done. took 75min)
The next step will be to use the same tool remove that tag from people with a parent… which should leave just the earliest known ancestor in each direct line. Took 95 minutes
Once again, I would like to draw the community’s attention to the growing demand from users for improved application performance. Nowadays, people are not willing to spend a lot of time waiting as they did decades ago when PCs were just beginning to appear in homes; they want to be efficient in their work. I have raised the issue of query processing speed in GRAMPS repeatedly. I understand databases, and I know that by changing approaches, significantly better performance results can be achieved. Particularly, I want to emphasize once again that the size of genealogical researchers’ databases is now scaling rapidly, being merged, and becoming quite large. I already see users on this forum mentioning databases with 400,000 people. How will they work with GRAMPS? I sincerely hope that the developers’ focus will at least partially shift toward improving performance.
1 Like
I think we can can improve that substantially. I have some ideas. Where is the best place to discuss such ideas these days? Mailing list, here, in github?
Preliminary Discussion here seems to be more “findable” when blue-skying histories for similar issues in another feature. When the plan hits its 1st commit, discussion should shift to GitHub. You’re right, there are a lot of Developers who do not monitor Discourse.
(We’ll put a Feature Request in MantisBT … so it gets in the Roadmap and Release Notes… and link all discussions.)
I was wondering if this was similar to the problem corrected for deleting large quanities of objects… where signals were making it try to refresh the GUI too much?
See Faster Multiple Person Delete by prculley · Pull Request #997 · gramps-project/gramps · GitHub
I guess as we move forward, there are general solutions across all of Gramps, and particular solutions for specific issues. For this one, let’s get specific:
@emyoulation, can you describe the process you took to apply the tag to the filter? I want to make sure I can replicate the situation.
1 Like
For me, the most annoying is the Deep Connections Gramplet, which needs multiple minutes for searches that take only 1 or 2 seconds on Geneanet.
The user’s problem can be handled with Gramps, and probably better than with PAF, although its advanced filters are quite good, and more interactive than ours. The problem is, that selecting people who are ancestral related in that does not the same as Gramps does, and selects way more people than one might need.
As far as I’m concerned there’s not much need for tagging, and I only use tags for things like marking people as family who are actually inlaws of inlaws. The user’s problem, selecting people who share an ancestor with him, is quite easy to solve, also if he wants to include spouses, which is normally what you want. It may take a while, but since it normally is a one time operation, it’s not that bad.
I have some big trees, one over 600,000 people, and Gramps is slow. And TBH, I would never recommend Gramps to anyone with over 100,000 people. because they will definitely run into problems with the Deep Connection Gramplet, and a few other tools maybe.
2 Likes
Starting point: Testing a workflow for a new User that has a 161,000 tree grown (in PAF) over 19 years and imported as a GEDCOM. The tree has a LOT of intertwined trees and research entered for exclusionary purposes. He wants to trim to the descendants (with spouses) of all ancestors of one person.
So I began with a copy of my 50,057 person tree with myself (GrampsID HJS-4203-4
) as the Home Person. Thought I’d try tagging the related (48,819) people
The tree has the following pedigree gramplet report:
Generation 1 has 1 of 1 individual (100.00% complete)
Generation 2 has 2 of 2 individuals (100.00% complete)
Generation 3 has 4 of 4 individuals (100.00% complete)
Generation 4 has 8 of 8 individuals (100.00% complete)
Generation 5 has 16 of 16 individuals (100.00% complete)
Generation 6 has 32 of 32 individuals (100.00% complete)
Generation 7 has 55 of 64 individuals (85.94% complete)
Generation 8 has 87 of 128 individuals (67.97% complete)
Generation 9 has 112 of 256 individuals (43.75% complete)
Generation 10 has 168 of 512 individuals (32.81% complete)
Generation 11 has 205 of 1024 individuals (20.02% complete)
Generation 12 has 177 of 2048 individuals (8.64% complete)
Generation 13 has 152 of 4096 individuals (3.71% complete)
Generation 14 has 108 of 8192 individuals (1.32% complete)
Generation 15 has 83 of 16384 individuals (0.51% complete)
Generation 16 has 63 of 32768 individuals (0.19% complete)
Generation 17 has 45 of 65536 individuals (0.07% complete)
Generation 18 has 41 of 131072 individuals (0.03% complete)
Generation 19 has 10 of 262144 individuals (0.00% complete)
Generation 20 has 9 of 524288 individuals (0.00% complete)
Generation 21 has 8 of 1048576 individuals (0.00% complete)
Generation 22 has 7 of 2097152 individuals (0.00% complete)
Generation 23 has 8 of 4194304 individuals (0.00% complete)
Generation 24 has 4 of 8388608 individuals (0.00% complete)
created 1 tag
created 3 rules in the custom_filters.xml
:
<filters>
<object type="Person">
<filter name="Common Ancestry" function="and" comment="Common ancestry with Home Person">
<rule class="HasCommonAncestorWith" use_regex="False" use_case="False">
<arg value="HJS-4203-4"/>
</rule>
</filter>
<filter name="HasParent" function="and" invert="1">
<rule class="MissingParent" use_regex="False" use_case="False"> </rule>
</filter>
<filter name="Related" function="and" comment="related to Home Person">
<rule class="IsRelatedWith" use_regex="False" use_case="False">
<arg value="HJS-4203-4"/>
</rule>
</filter>
</object>
</filters>
Add Tagging of 48,819 people with Filter Related
took 75 minutes.
Remove Tagging of 31,240 people with Filter HasParent
took 95 minutes.
PS: I realize that the experiments were not the right way to approach the objective. The Common Ancestry
filter is closer. Although I would still need to do a 2nd pass to include Spouses of direct descendants.
Although for a Genealogy tool, it would be more reasonable to export “Families” for HasCommonancestorWith
rather than Person objects. But there is no Family rule to filter for common ancestry, no Family filter for export, no rule to select all (Father, Mother, offspring) family members :
1 Like
How are you tagging?
I routinely run a filter to find those related to each of my grandparents (their branch).
For one branch, the filter found 129,554 relatives out of 240,369 total people in 141.84 seconds
Then to tag that list it took 5.05 minutes.
5.2.3-2 on Win10
I used the Add/Remove Tag tool with a new tag and the IsRelatedWith
rule.
Are you tagging from the filtered view instead?
<filter name="Related" function="and" comment="related to Home Person">
<rule class="IsRelatedWith" use_regex="False" use_case="False">
<arg value="HJS-4203-4"/>
</rule>
</filter>
I tagged the selected list in the People view.
1 Like
Does View tagging work when there are more than 1,000 records in the View?
Gtk had a hard limit of 1,000 records for tables. Some features (like SuperTool by @kku) have made special accommodations to get past that limit.
Tagging in the view seems to be taking a while too. (On track for about 40 minutes with 8996 persons in a filtered view. Spouses of People with Common Ancestors)
And an interesting distribution of resources.
I just tested this, with my Karel de Grote tree, and it’s quite easy, and reasonably fast. It takes two steps:
- Create a filter that selects all persons that share ancestors with me,
- Add another one that selects the output of the 1st filter, and the spouses of the people selected by that same filter, where at least one rule must apply.
I normally negate that, so that I can select all people who are not related in this manner, but for speed, that does not really matter, and the positive one can be used to export the true relatives and their spouses to a new tree, and keep the original.
On this tree, this filter needs less than 5 minutes in 5.1.7, running on an HP Envy with an AMD Ryzen 7, 16 GB RAM, and a 500 GB SSD.
P.S. with these filters in place, I see no need for tagging.
1 Like
I see some low-hanging fruit that we could start to make some progress. Looking over the code (it’s been awhile!) I see some things that haven’t changed. Consider this (a version of which appears in something like 20 views):
def get_handle_from_gramps_id(self, gid):
"""
Return the handle of the person having the given Gramps ID.
"""
obj = self.dbstate.db.get_person_from_gramps_id(gid)
if obj:
return obj.get_handle()
else:
return None
This is sooo expensive for what it does (unless there is a caching db proxy):
- First it finds the row in the database
- Then it converts it to a Gramps primary object
- Then It calls a function to get the handle (an attribute on the object)
- And then it throws the object away, returning just the handle
This is all done for a reason: to make an abstraction layer that will work on a variety of different backends. But there is another way: if the database supported a better method, then let’s use that, and only fall-back to this otherwise.
Imagine replacing all of the above in the database layer, with something specific to that DB (here DBAPI):
"""Select handle from person where gramps_id = ?"""
This is just one small example, but if we can do the same for each query one at a time, then we can incrementally make Gramps faster and faster. BSDDB users can do the same for their layer.
2 Likes
I was using Tagging to do some manual sanity checks. ( Just spot checks.)
Colorizing Records in unfiltered list views, using Chart views that leverage Tags, filtering in Chart Fan Views
OK, I get it. I mentioned it, because in a previous message, it looked like you suggested that tags are a part of the process.
I just ran the test above, and the direct SQL approach is almost 4x faster than operating on Python objects. I think this difference would be even bigger on a larger database. (Tested on 2,575 people, looking up ids 200,000 times). Let me know if you want the test script.
1 Like
This is especially actual when filters are used. Filters are built on Python, not on sql-queries. So, Python makes a huge job with objects, iterates them many times. Im not sure that such complicated deep-nested filters can be built via sql-queries, but it would really fast.
2 Likes
To follow up on the performance of the Deep Connections Gramplet, which has annoyed me for years, I need to add that I did some analysis of the code, which has a few errors that cause it to process the same person multiple times, but after correcting those, it’s still slow.
And now that I know that the algorithm itself is better, I’m beginning to think that the real problem is in the database, which has no quick way to retrieve persons’ relatives. And that’s because there’s no table for that, like in PAF or RootsMagic, so that for each person, you need to retrieve the blob data from the database, unpickle that, and process the Python object to find the handles of associations, which often aren’t there. So, for most relaties, you need to find the family handles, retrieve the family objects, and unpickle those too. And that’s much much slower than with a database that stores all such relations in a separate table, like the other programs do.