That’s a good one, which I missed, because when I moved the results window to the left, the dialog still seemed to be blocking access to Gramps, because it likes to stay on top.
But when I shift that to the left too, I can indeed switch views.
That’s a good one, which I missed, because when I moved the results window to the left, the dialog still seemed to be blocking access to Gramps, because it likes to stay on top.
But when I shift that to the left too, I can indeed switch views.
I’ve analysed the relativly poor performance of the Verification tool.
My discoveries - the internal cache usage in verify.py is far from perfect:
This is because the internal cache was cleared after each person and family while looping through them to apply the rules. I then moved the cache-cleanup at the end of the verification tool and the caches worked much better in regards of hits:
The misses are basically the number of entites I have in my database. I further removed all the custom cache logic and migrated the verification tool to use CacheProxyDb. As assumed, that gave no additional performance but reduces code duplication a bit. I figured that since I need all the data anyway cache preloading could be a way to speed things up once more. Unfortunally CacheProxyDb has no builtin method to preload any of the entities so I ended up putting back the own custom cache logic in verify.py and started to preload all the caches before running all the rules:
for person in self.db.iter_people():
_person_cache[person.get_handle()] = person
for family in self.db.iter_families():
_family_cache[family.get_handle()] = family
for event in self.db.iter_events():
_event_cache[event.get_handle()] = event
That gave me an additional performance boost.
I went down from ~16 seconds to ~8 seconds with no preloading and finally ended up with ~6 seconds which is a 267% performance gain.
Edit: I found 4 rules directly accessing the DB to get events bypassing the cache - fixed those - now ended up with 5 seconds
I was looking for a db function which would return back all persons, families or events at once (list or dictionary) hopefully reading them in one big bulk read to boost the performance once more - but unfortunally there seems to be no db function available within Gramps db API.
@Nick-Hall What is your opinion about that subject?
Loading every person, family and event into memory is obviously going to make the tool run faster.
Our FAQ has a section What are the Minimum Specs to run Gramps? which lists the minimum memory requirement as 512MB or lower. This perhaps needs updating, but we should be aware that some people may be running Gramps on lower end spec machines.
The LRU used in the proxy cache has 2^17-1 (about 130,000) entries. Your cache is bigger if you don’t clear it between groups of rules.
Perhaps we can change the order in which the rules are evaluated in order to maximise cache hits? A tree walk would be worth considering.
I also notice that there are 4 rules which don’t use the cache when accessing families and events.
Yeah - I discovered those 4 rules by using pyinstrument too - fixing those gains 1 second more. Also a lot of trouble causes get_date_from_event_type because it gets called many times. preloading the event handles for every person and their primary baptism, christen and burrial event saves again 1 more second (with my tree). So I finally went from 16 to 4 seconds which is well 400%
I checked the memory usage (by using the CLI) and it looks like its surprisingly not using that much more memory overall comparing cleaning up the cache after every person/family and only once at the end. (heap peak is not that much varrying). maybe because the persons/events/families are in the memory already anyway? I’m not sure how pythons memory model is working if for example only references are held or if every person/event/family gets cloned memory-wise
So - it looks like the one-cache-for-all solution needs around 4 MiB more of heap which i would consider like nothing comparing to the gain of speed.
original version:
run 1)
Memory usage summary: heap total: 303609330, heap peak: 25619203, stack peak: 75968
total calls total memory failed calls
malloc| 747872 294707485 0
realloc| 52949 6979678 0 (nomove:17292, dec:3724, free:0)
calloc| 13793 1922167 0
free| 774849 283082604
run 2)
Memory usage summary: heap total: 299353188, heap peak: 24234781, stack peak: 75968
total calls total memory failed calls
malloc| 747216 291094499 0
realloc| 52849 6506466 0 (nomove:18094, dec:3712, free:0)
calloc| 13578 1752223 0
free| 773978 278909754
enhanced cache version:
run 1)
Memory usage summary: heap total: 113511300, heap peak: 28103066, stack peak: 75968
total calls total memory failed calls
malloc| 78995 104760563 0
realloc| 54229 6995618 0 (nomove:17760, dec:3716, free:0)
calloc| 13582 1755119 0
free| 105761 92960345
run 2)
Memory usage summary: heap total: 118310176, heap peak: 28105378, stack peak: 75968
total calls total memory failed calls
malloc| 79746 108853215 0
realloc| 54331 7504210 0 (nomove:18459, dec:3727, free:0)
calloc| 13834 1952751 0
free| 106767 97776336
but you are right - an immense big tree could start causing problem because the cache has no upper limit in the current solution.
My tree has 14,671 people, 53,329 events and 6,018 families
I’ve used LRU for the caches now setting it up for 100,000 persons and families each and 200,000 events. Performance is still good with 5 seconds and it will fall back to direct DB accesses for very very large trees.
What do you think @Nick-Hall ?
Thats all my changes in here now
Why not have a variable rule that would allow maximum memory cache usage for large virtual memory configurations and minimum on small ones using a psutil.virtual_memory() test somewhere?
And at worst, OSes are virtual memory; if a slight error is made it will swap a bit
Just for information:
We also had a lot of sailors and fishermen where the death date and registered burial date could be far apart, and the funeral could be years later (think whalers that was away 12-18 month or longer), many lost at sea didn’t have a burial, only a funeral when the ship or the message reached home.
Yes, some of them got buried in Brazil or on one of the islands and even on the land around the South Pole.
BUT, in my search regarding sailors in the Norwegian mercantile fleet, even in the 1920s, I have found sailors family having funerals multiple years after the sailor’s death date, same for War Sailors that went missing, it could go a long time from missing to declared death and then again to a funeral, without burial…
PS.
I am not talking about cases where the sailors were buried at sea.
I don’t know if this is something to take into account for in this tool, just wanted to make a note about it.
@daleathan has marked a bunch of feature requests for the Verify tool as ‘related’ to a master feature request.
Advancements from this project might allow closing most or all of them:
Feature Request 0005114 : suggestions to improve Tools → Utilities → Verify the data
nice feature request. I’m also thinking that saving verification results selections for “i checked it, please don’t show that to me any more” is an important and missing feature I’d like to work on. But I piled up so many changes for now, waiting for the very first pull request to be processed that I’ll pause for now until all the current changes make their way into the codebase. Looks like it’s a process involving patience
By the way - I did some memory profiling. Usecase was Start Gramps, run verify, exit (wanted to check with GUI instead of CLI)
Here’s another note:
I just cloned your repo to a new folder named olli, so that I can do some quick testing here, and see what I can break. Some colleagues didn’t like that, but random testing often helps a lot to see things that other people don’t think about.
And since I can’t file an issue in your repo on GitHub, my 1st finding will be by PM.
You might want to consider cloning your repository’s built-in “Verify the Data” tool’s verify.py, verify.glade GUI design file and its segment (lines 440-462) in the combined Tools plugin registration file, renaming them (perhaps as “Sanity Check”) and converting the built-in to an add-on.
This would free your from the limits of the Gramps release cycle and beneficent tyranny of the Pull Request approval system. It would be under your control. An addon would also let you make more radical changes. (And well-tested portions could be spun out into the built-in “Verify the Data” tool for each Gramps enhancement release.)
If you chose to adapt it as a 5.2 Tool plugin, you could take advantage of the new audience (‘EXPERT’, ‘DEVELOPER’) and status (‘EXPERIMENTAL’, ‘BETA’) registrations too.
One of the other addon tools (Calculate Estimated Dates,GitHub, wiki) adds data where it flags itself as the source. So it can also clear that data for a subsequent recalculation. That tool also allows narrowing the scope of the process with a Custom Filter for People.
Somthing that makes the Import Merge (addon tool: GitHub, Wiki) less usable is that it create a worklist that can be too massive to resolve in a single session. But there is no way to save that worklist to continue.
Do you have any experience with “dynamic load balancing” or “task scheduling with load balancing” libraries like Celery? It seems like this might be an ideal candidate for getting the work off the core being used for GUI.
To be honest - forking the tool instead of enhancing the built-in-one is not a good road to go in my opinion. Especially if its because the procedure how to get things into the main code is too complex/slow-moving/whatever. Thats not reallly encouraging for people who want to help and feels more as a drawback.
Basically: Forking it to work around that issue just doesn’t feel the right way to go.
I’m basically completly new to Python, so no - I don’t know Celery. But asynchronous processing is a known concept to me. Trying to process the verification rules in a multithreaded fashion to speed things up was my first try - but it came to a fast stop because the database access is bound to just one thread. And Nick also posted in another Thread that Gramps design had never multithreading in mind and it would mean a great overhaul. With my latest changes in my tree with all the DB work done before the actual data processing I could try multithreading the rules again but I’d like to see some code move first.
The main benefit for me for working on it is that the Verification Tool is a built-in tool in Gramps. So every single Gramps user could directly benefit from it. If there is stuff in my changes which prevent it from being merged I’m looking forward to fix it. If its just bad timing because yall busy with preparing the next release thats also fine and I’ll happily wait until the dust settles. Its just all about communication
In the end I rather start developing a new application working with Gedcom as input which would also open the user-base to those not using Gramps at all before I fork verify.py or develop a new plugin for Gramps. I still feel way more comfortable in Java which pays my bill for over 20 years than in Python which I started basically together with writing that thread…
Thats much appreciated. One way to get better (the code and the person ) is by getting to know what can be improved. I’m grateful for any message about found bugs, or advices on how things should have been be done instead.
Enhancing the core code is the best way to go in this case. I can’t see why anyone would object to your changes.
You have obviously spent time investigating the trade-offs between performance and memory, and safeguarded against excessive memory usage by using LRU cache.
In this particular case I am willing to consider including this in v5.2 if it is ready. The changes are self-contained within a single plugin are are highly unlikely to affect any other code.
@Nick-Hall I removed the LRU cache and replaced it again by a regular dict because the custom made objects I made are now greatly reducing memory consumption in that cache by more than 140 MB (for my tree) and comparing to the original “bad caching” solution which is in master right now only add like 20 MB to the applications memory footprint
Do you want me to update the existing Pull request which currently only covers the TreeView (instead of ListView) or should we go step by step like that (my proposal):
Probably depends on how much effort your QA process for each pull request takes versus the ability to review the code changes easily
OK, great! I just found another interesting result, no exception this time …
I checked some families that are listed as having events of type Unknown, and found that they had a marriage and a divorce, nothing more.
And further testing shows that even a family that has a single Marriage event can be marked as such, when their role in the event is Unknown. When I set their role to Primary, they’re OK.
Can you provide some test data by any chance? I have 6018 families in my tree with one having a marriage of role Unknown - and only that family gets listed for me. It should only list families who have at least one event with the role “Unknown”. Are you sure the families you get listed don’t fall into that category?
Maybe its a misunderstanding and the error message should be rephrased - its all about the role (Unknown, Primary, Family), not the event type (marriage, divorce, …)
Edit: I rephrased the error message from “Family/Person has events of type Unknown” to “Family/Person has events with role Unknown”. Thanks for pointing that out!