Making Gramps filters Faster, and then Superfast

I thought I’d describe the status of work on JSON and speeding Gramps’ filters up. None of these steps have been incorporated into the code yet and are still being reviewed. The steps so far:

  1. Convert internal database format to JSON, an expressive format - under review
  2. Create a method for developers to write fast code without changing core code - under review
  3. Refactor the filter system - under review
  4. Optimize the filter system - under review (included in 3)
  5. Add “business logic” where appropriate - not started

I want to describe steps (3) and (4). First, the filter system is very well designed in its flexibility, and is at least 18 years old. At it heart is simplicity. Here is some pseudo-code to show how it works. Every filter is composed of a set of rules:

results = []
for person in db.get_all_people():
    if all(rule.apply(person) for rule in rules):
        results.append(person.handle)
return results

That’s the basics anyway. One of the most expensive parts is hidden in the “get_all_people”. Turning Gramps’ raw JSON data from the database into a full Person() object is expensive, and it is better that we don’t have to do it. So, the main part of step (3) is to remove that part, using the new JSON data instead of objects, changing the code to something like:

results = []
for data in db.get_all_people_data():
    if all(rule.apply(data) for rule in rules):
        results.append(data["handle"])
return results

That seems like a very small change, but the original filter system called the above code recursively, so the Person() object was getting re-created over and over. With just this refactor step, every filter will run anywhere from about 30% to 200% faster. (Some rules still need a Person() object, and they’ll make one. But only if they need to. And steps are taken to never re-create it once made).

But we can do even better. Consider that we want to filter on all people tagged with the “ToDo” tag. There are only 5 people tagged “ToDo” but we have to go through all of the people to find them. Let’s say there are 10,000 people in the database.

Gramps already has already addressed this scenario in the filters. Some rules, like isAncestorOf(), pre-computes the ancestors, and so the rule merely checks to see if a person’s handle is in the pre-computed set of handles (called rule.map by convention). The big insight was to use these maps rather than going through the entire database.

The optimized version now looks something like:

results = []
handles = get_maps(rules) # recursively searches filters/rules
for handle in handles: # only 5 handles in the HasTag example
    data = db.get_person_data_from_handle(handle)
    if all(rule.apply(data) for rule in rules): # check all rules
        results.append(data["handle"])
return results

If rules like HasTag() or isAncestorOf() have the appropriate map (pre-selected handles) then the loop goes from having to look at 100,000 people to only looking at 5. Superfast!

Step (4) is a bit complex because in Gramps, a rule can actually be another filter with rules, and so on. But at the very bottom is a set of rules. Also, some of these filters can be inverted. That is, the map is actually a set of handles that should not be considered. In addition, rules can be combined using either “and”, “or”, or “only one”. So the optimization has to get the right maps, otherwise it could lead to incorrect data—wrong is worse than being slow! Lot’s of testing to be done to prove this code is correct.

There is also a caveat: creating maps takes memory. We have to be careful about how and when we create these maps. There is a trade-off of more speed for memory. It may not be appropriate for all uses, and we may want to have a checkbox to turn it on/off.

Finally, why did the above code go for 18 years without being revised? The code is fairly complicated, but it worked. I suspect that most of the developers don’t have trees that have 10,000 or more people in them, and the old system was fast enough. Also, it wasn’t clear what the problem was. The third step fixes the implementation, and (4) does something new.

I hope that this was useful. Do you have any questions?

4 Likes

Thank you for looking at these internals of gramps.
I hope there is someone who could speed up the way screens are loaded when opened. One that bugs me every day is the loading of all the source/citations when the window is opened.

Case: new event is created, new citation added.

Problem: When you select the sources tab your entire source /citation database table is loaded. For me that is 2 seconds and is getting longer as more are added.

Fix: As a minimum a list of the sources that will fill the screen should be loaded, more if the screen is scrolled. If a search is used, search the table in the DB using your new methods and return the results. If the user selects a source in the window, they can expand it and the screen will populate with the required citation rows.

@Davesellers, I feel your pain. In fact, this was the first thing I looked at earlier this year to try to speed up Gramps. But this is almost impossible to fix with the current architecture. Why? Because the full objects are needed (in many cases) in order to sort them. You can’t load just a page unless the items have already been sorted. New implementations of Gramps (like gramps-web) can re-think this basic operation (get page of sorted data) and operate as you describe.

Not to say that there aren’t gains to be made, but they are unlikely to be fast like the ones for filters.

Kari created an experimental Recent Items library addon. It adds a small (6 item) table at the top of the object selector dialog.

This addon does not reduce the re-populate latency for the dialog. But it does dramatically reduce the navigation time when repeatedly selecting the same Citation or Place during data entry.

1 Like

I don’t think this would help me much. I use the clipboard as much as possible. However I haven’t found a way to use it when adding citations because they are so diverse.
Instead of all the sources and citations being loaded on opening the window, why not just the sources (for me this is about 200), these can be sorted quickly. Then if you expand on a source, those rows loaded and sorted. A search should only return a few rows and can be sorted quickly. This needs to be broken down into smaller chunks as a DBs grow in size.

I guess I should say in my professional life I was use to working with sql data and with indexed tables, and gramps is not there yet. :weary:

It isn’t exactly that. A lot of code has been written in order to be able to sort by complicated issues. Take for example names. How do you sort a name? There are many issues, like whether you have prefixes, suffixes, and name preference types. Does “von Trapp” appear in the T’s or the V’s. Many people care about such details. The only way to do that in SQL (or another backend) is to re-do that logic in SQL.

Anyway, this thread is about filter optimization.

1 Like