I’ve analysed the relativly poor performance of the Verification tool.
My discoveries - the internal cache usage in verify.py is far from perfect:
- Person cache statistics: 173,236 hits, 58,554 misses - 75% hit ratio
- Family cache statistics: 1,437 hits, 25,460 misses - 5% hit ratio
- Event cache statistics: 958,881 hits, 183,811 misses - 84% hit ratio
This is because the internal cache was cleared after each person and family while looping through them to apply the rules. I then moved the cache-cleanup at the end of the verification tool and the caches worked much better in regards of hits:
- Person cache statistics: 217,119 hits, 14,671 misses - 94% hit ratio
- Family cache statistics: 20,879 hits, 6,018 misses - 78% hit ratio
- Event cache statistics: 1,089,427 hits, 53,265 misses - 96% hit ratio
The misses are basically the number of entites I have in my database. I further removed all the custom cache logic and migrated the verification tool to use CacheProxyDb. As assumed, that gave no additional performance but reduces code duplication a bit. I figured that since I need all the data anyway cache preloading could be a way to speed things up once more. Unfortunally CacheProxyDb has no builtin method to preload any of the entities so I ended up putting back the own custom cache logic in verify.py and started to preload all the caches before running all the rules:
for person in self.db.iter_people():
_person_cache[person.get_handle()] = person
for family in self.db.iter_families():
_family_cache[family.get_handle()] = family
for event in self.db.iter_events():
_event_cache[event.get_handle()] = event
That gave me an additional performance boost.
I went down from ~16 seconds to ~8 seconds with no preloading and finally ended up with ~6 seconds which is a 267% performance gain.
Edit: I found 4 rules directly accessing the DB to get events bypassing the cache - fixed those - now ended up with 5 seconds
I was looking for a db function which would return back all persons, families or events at once (list or dictionary) hopefully reading them in one big bulk read to boost the performance once more - but unfortunally there seems to be no db function available within Gramps db API.
@Nick-Hall What is your opinion about that subject?