Interest in enhancing verify.py

I’ve analysed the relativly poor performance of the Verification tool.

My discoveries - the internal cache usage in verify.py is far from perfect:

  • Person cache statistics: 173,236 hits, 58,554 misses - 75% hit ratio
  • Family cache statistics: 1,437 hits, 25,460 misses - 5% hit ratio
  • Event cache statistics: 958,881 hits, 183,811 misses - 84% hit ratio

This is because the internal cache was cleared after each person and family while looping through them to apply the rules. I then moved the cache-cleanup at the end of the verification tool and the caches worked much better in regards of hits:

  • Person cache statistics: 217,119 hits, 14,671 misses - 94% hit ratio
  • Family cache statistics: 20,879 hits, 6,018 misses - 78% hit ratio
  • Event cache statistics: 1,089,427 hits, 53,265 misses - 96% hit ratio

The misses are basically the number of entites I have in my database. I further removed all the custom cache logic and migrated the verification tool to use CacheProxyDb. As assumed, that gave no additional performance but reduces code duplication a bit. I figured that since I need all the data anyway cache preloading could be a way to speed things up once more. Unfortunally CacheProxyDb has no builtin method to preload any of the entities so I ended up putting back the own custom cache logic in verify.py and started to preload all the caches before running all the rules:

        for person in self.db.iter_people():
            _person_cache[person.get_handle()] = person
        for family in self.db.iter_families():
            _family_cache[family.get_handle()] = family
        for event in self.db.iter_events():
            _event_cache[event.get_handle()] = event

That gave me an additional performance boost.

I went down from ~16 seconds to ~8 seconds with no preloading and finally ended up with ~6 seconds which is a 267% performance gain.
Edit: I found 4 rules directly accessing the DB to get events bypassing the cache - fixed those - now ended up with 5 seconds :smiley:

I was looking for a db function which would return back all persons, families or events at once (list or dictionary) hopefully reading them in one big bulk read to boost the performance once more - but unfortunally there seems to be no db function available within Gramps db API.

@Nick-Hall What is your opinion about that subject?

1 Like