Interest in enhancing verify.py

That’s a good one, which I missed, because when I moved the results window to the left, the dialog still seemed to be blocking access to Gramps, because it likes to stay on top.

But when I shift that to the left too, I can indeed switch views.

I’ve analysed the relativly poor performance of the Verification tool.

My discoveries - the internal cache usage in verify.py is far from perfect:

  • Person cache statistics: 173,236 hits, 58,554 misses - 75% hit ratio
  • Family cache statistics: 1,437 hits, 25,460 misses - 5% hit ratio
  • Event cache statistics: 958,881 hits, 183,811 misses - 84% hit ratio

This is because the internal cache was cleared after each person and family while looping through them to apply the rules. I then moved the cache-cleanup at the end of the verification tool and the caches worked much better in regards of hits:

  • Person cache statistics: 217,119 hits, 14,671 misses - 94% hit ratio
  • Family cache statistics: 20,879 hits, 6,018 misses - 78% hit ratio
  • Event cache statistics: 1,089,427 hits, 53,265 misses - 96% hit ratio

The misses are basically the number of entites I have in my database. I further removed all the custom cache logic and migrated the verification tool to use CacheProxyDb. As assumed, that gave no additional performance but reduces code duplication a bit. I figured that since I need all the data anyway cache preloading could be a way to speed things up once more. Unfortunally CacheProxyDb has no builtin method to preload any of the entities so I ended up putting back the own custom cache logic in verify.py and started to preload all the caches before running all the rules:

        for person in self.db.iter_people():
            _person_cache[person.get_handle()] = person
        for family in self.db.iter_families():
            _family_cache[family.get_handle()] = family
        for event in self.db.iter_events():
            _event_cache[event.get_handle()] = event

That gave me an additional performance boost.

I went down from ~16 seconds to ~8 seconds with no preloading and finally ended up with ~6 seconds which is a 267% performance gain.
Edit: I found 4 rules directly accessing the DB to get events bypassing the cache - fixed those - now ended up with 5 seconds :smiley:

I was looking for a db function which would return back all persons, families or events at once (list or dictionary) hopefully reading them in one big bulk read to boost the performance once more - but unfortunally there seems to be no db function available within Gramps db API.

@Nick-Hall What is your opinion about that subject?

Loading every person, family and event into memory is obviously going to make the tool run faster.

Our FAQ has a section What are the Minimum Specs to run Gramps? which lists the minimum memory requirement as 512MB or lower. This perhaps needs updating, but we should be aware that some people may be running Gramps on lower end spec machines.

The LRU used in the proxy cache has 2^17-1 (about 130,000) entries. Your cache is bigger if you don’t clear it between groups of rules.

Perhaps we can change the order in which the rules are evaluated in order to maximise cache hits? A tree walk would be worth considering.

I also notice that there are 4 rules which don’t use the cache when accessing families and events.

Yeah - I discovered those 4 rules by using pyinstrument too - fixing those gains 1 second more. Also a lot of trouble causes get_date_from_event_type because it gets called many times. preloading the event handles for every person and their primary baptism, christen and burrial event saves again 1 more second (with my tree). So I finally went from 16 to 4 seconds which is well 400% :slight_smile:

I checked the memory usage (by using the CLI) and it looks like its surprisingly not using that much more memory overall comparing cleaning up the cache after every person/family and only once at the end. (heap peak is not that much varrying). maybe because the persons/events/families are in the memory already anyway? I’m not sure how pythons memory model is working if for example only references are held or if every person/event/family gets cloned memory-wise

So - it looks like the one-cache-for-all solution needs around 4 MiB more of heap which i would consider like nothing comparing to the gain of speed.

original version:

run 1)
Memory usage summary: heap total: 303609330, heap peak: 25619203, stack peak: 75968
         total calls   total memory   failed calls
 malloc|     747872      294707485              0
realloc|      52949        6979678              0  (nomove:17292, dec:3724, free:0)
 calloc|      13793        1922167              0
   free|     774849      283082604
run 2)
Memory usage summary: heap total: 299353188, heap peak: 24234781, stack peak: 75968
         total calls   total memory   failed calls
 malloc|     747216      291094499              0
realloc|      52849        6506466              0  (nomove:18094, dec:3712, free:0)
 calloc|      13578        1752223              0
   free|     773978      278909754

enhanced cache version:

run 1)
Memory usage summary: heap total: 113511300, heap peak: 28103066, stack peak: 75968
         total calls   total memory   failed calls
 malloc|      78995      104760563              0
realloc|      54229        6995618              0  (nomove:17760, dec:3716, free:0)
 calloc|      13582        1755119              0
   free|     105761       92960345
run 2)
Memory usage summary: heap total: 118310176, heap peak: 28105378, stack peak: 75968
         total calls   total memory   failed calls
 malloc|      79746      108853215              0
realloc|      54331        7504210              0  (nomove:18459, dec:3727, free:0)
 calloc|      13834        1952751              0
   free|     106767       97776336

but you are right - an immense big tree could start causing problem because the cache has no upper limit in the current solution.

My tree has 14,671 people, 53,329 events and 6,018 families

I’ve used LRU for the caches now setting it up for 100,000 persons and families each and 200,000 events. Performance is still good with 5 seconds and it will fall back to direct DB accesses for very very large trees.
What do you think @Nick-Hall ?

Thats all my changes in here now

  • more rules
  • fixed rules (direct db access)
  • Tree View (already Pull request open)
  • performance (cache preloading)

Why not have a variable rule that would allow maximum memory cache usage for large virtual memory configurations and minimum on small ones using a psutil.virtual_memory() test somewhere?

And at worst, OSes are virtual memory; if a slight error is made it will swap a bit

Just for information:

  • In Scandinavia it could go months between death, burial and the funeral ceremony, because we had ambulating priest and harsh winters, still we can have exact dates for all 3 events.

We also had a lot of sailors and fishermen where the death date and registered burial date could be far apart, and the funeral could be years later (think whalers that was away 12-18 month or longer), many lost at sea didn’t have a burial, only a funeral when the ship or the message reached home.
Yes, some of them got buried in Brazil or on one of the islands and even on the land around the South Pole.

BUT, in my search regarding sailors in the Norwegian mercantile fleet, even in the 1920s, I have found sailors family having funerals multiple years after the sailor’s death date, same for War Sailors that went missing, it could go a long time from missing to declared death and then again to a funeral, without burial…

PS.
I am not talking about cases where the sailors were buried at sea.


I don’t know if this is something to take into account for in this tool, just wanted to make a note about it.

@daleathan has marked a bunch of feature requests for the Verify tool as ‘related’ to a master feature request.

Advancements from this project might allow closing most or all of them:

Feature Request 0005114 : suggestions to improve Tools → Utilities → Verify the data

nice feature request. I’m also thinking that saving verification results selections for “i checked it, please don’t show that to me any more” is an important and missing feature I’d like to work on. But I piled up so many changes for now, waiting for the very first pull request to be processed that I’ll pause for now until all the current changes make their way into the codebase. Looks like it’s a process involving patience :wink:

1 Like

By the way - I did some memory profiling. Usecase was Start Gramps, run verify, exit (wanted to check with GUI instead of CLI)

  • original verify.py (no tree, no preloading…): 121 MiB and 10 seconds runtime
  • verify.py with tree and preloading everything (person, events, family): 261 MiB and 6 seconds runtime
  • my newest verify.py version with private Person and Family objects only containing the data needed by verify.py to do its job. Only those objects are cached now - no events are cached at all. This gave again a nice performance increase and memory consumption with that is: 148 MiB and 4 seconds runtime
2 Likes

Here’s another note:

I just cloned your repo to a new folder named olli, so that I can do some quick testing here, and see what I can break. Some colleagues didn’t like that, but random testing often helps a lot to see things that other people don’t think about.

And since I can’t file an issue in your repo on GitHub, my 1st finding will be by PM.

1 Like

You might want to consider cloning your repository’s built-in “Verify the Data” tool’s verify.py, verify.glade GUI design file and its segment (lines 440-462) in the combined Tools plugin registration file, renaming them (perhaps as “Sanity Check”) and converting the built-in to an add-on.

This would free your from the limits of the Gramps release cycle and beneficent tyranny of the Pull Request approval system. It would be under your control. An addon would also let you make more radical changes. (And well-tested portions could be spun out into the built-in “Verify the Data” tool for each Gramps enhancement release.)

If you chose to adapt it as a 5.2 Tool plugin, you could take advantage of the new audience (‘EXPERT’, ‘DEVELOPER’) and status (‘EXPERIMENTAL’, ‘BETA’) registrations too.

One of the other addon tools (Calculate Estimated Dates,GitHub, wiki) adds data where it flags itself as the source. So it can also clear that data for a subsequent recalculation. That tool also allows narrowing the scope of the process with a Custom Filter for People.

Somthing that makes the Import Merge (addon tool: GitHub, Wiki) less usable is that it create a worklist that can be too massive to resolve in a single session. But there is no way to save that worklist to continue.

Do you have any experience with “dynamic load balancing” or “task scheduling with load balancing” libraries like Celery? It seems like this might be an ideal candidate for getting the work off the core being used for GUI.

To be honest - forking the tool instead of enhancing the built-in-one is not a good road to go in my opinion. Especially if its because the procedure how to get things into the main code is too complex/slow-moving/whatever. Thats not reallly encouraging for people who want to help and feels more as a drawback.
Basically: Forking it to work around that issue just doesn’t feel the right way to go.

1 Like

I’m basically completly new to Python, so no - I don’t know Celery. But asynchronous processing is a known concept to me. Trying to process the verification rules in a multithreaded fashion to speed things up was my first try - but it came to a fast stop because the database access is bound to just one thread. And Nick also posted in another Thread that Gramps design had never multithreading in mind and it would mean a great overhaul. With my latest changes in my tree with all the DB work done before the actual data processing I could try multithreading the rules again but I’d like to see some code move first.

The main benefit for me for working on it is that the Verification Tool is a built-in tool in Gramps. So every single Gramps user could directly benefit from it. If there is stuff in my changes which prevent it from being merged I’m looking forward to fix it. If its just bad timing because yall busy with preparing the next release thats also fine and I’ll happily wait until the dust settles. Its just all about communication :slight_smile:

In the end I rather start developing a new application working with Gedcom as input which would also open the user-base to those not using Gramps at all before I fork verify.py or develop a new plugin for Gramps. I still feel way more comfortable in Java which pays my bill for over 20 years than in Python which I started basically together with writing that thread…

Thats much appreciated. One way to get better (the code and the person :smile:) is by getting to know what can be improved. I’m grateful for any message about found bugs, or advices on how things should have been be done instead.

Enhancing the core code is the best way to go in this case. I can’t see why anyone would object to your changes.

You have obviously spent time investigating the trade-offs between performance and memory, and safeguarded against excessive memory usage by using LRU cache.

In this particular case I am willing to consider including this in v5.2 if it is ready. The changes are self-contained within a single plugin are are highly unlikely to affect any other code.

@Nick-Hall I removed the LRU cache and replaced it again by a regular dict because the custom made objects I made are now greatly reducing memory consumption in that cache by more than 140 MB (for my tree) and comparing to the original “bad caching” solution which is in master right now only add like 20 MB to the applications memory footprint :slight_smile:

Do you want me to update the existing Pull request which currently only covers the TreeView (instead of ListView) or should we go step by step like that (my proposal):

  1. Pull request for “Tree View”
  2. Pull request for “New Rules” (there might be discussions about new options or just go by wll thought defaults (might be easier for users?))
  3. Pull request for “Performance enhancements” (there might be discussions about caching)

Probably depends on how much effort your QA process for each pull request takes versus the ability to review the code changes easily

OK, great! I just found another interesting result, no exception this time …

I checked some families that are listed as having events of type Unknown, and found that they had a marriage and a divorce, nothing more.

And further testing shows that even a family that has a single Marriage event can be marked as such, when their role in the event is Unknown. When I set their role to Primary, they’re OK.

Can you provide some test data by any chance? I have 6018 families in my tree with one having a marriage of role Unknown - and only that family gets listed for me. It should only list families who have at least one event with the role “Unknown”. Are you sure the families you get listed don’t fall into that category?

Maybe its a misunderstanding and the error message should be rephrased - its all about the role (Unknown, Primary, Family), not the event type (marriage, divorce, …)

Edit: I rephrased the error message from “Family/Person has events of type Unknown” to “Family/Person has events with role Unknown”. Thanks for pointing that out! :+1:

1 Like