How to make gramps-web fast

@DavidMStraub, and other gramps-web developers, I’ve been looking at what gramps-web can do differently from gramps-desktop in terms of speed. Specifically, the /get endpoint does what you currently have to do: get all of the primary objects (Person, for example) and then do the filtering, and ordering, and then select a page full to render. Very expensive!

But the prospect of JSON in the database makes a big difference (if we are willing to make some small sacrifices).

To make a big difference, we want to select only a page of data from the DB. But we can’t do that until we have the data in sorted order. Can we get all of the data for a page in sorted order without going to Python objects? Yes!

The Person data is probably the hardest, but is possible. The birth and death representations make this a little tricky. Here is a pure SQL selection query that is sortable, and can get just a page of data:

SELECT json_data->>"$.gramps_id" as "Gramps ID",
       json_data->>"$.primary_name.surname_list[0].surname" as Surname,
       json_data->>"$.primary_name.first_name" as "Given Name",
       birth_date->>"$.dateval" as "Birth Date",
       death_date->>"$.dateval" as "Death Date",
       json_data->>"$.change" as "Last changed"
   FROM (SELECT person.json_data,
                birth_event_ref,
		birth_json_data->>"$.date" as birth_date,
		death_event_ref,
		death_json_data->>"$.date" as death_date
	   FROM (SELECT person.json_data,
			birth.value as birth_event_ref,
			death.value as death_event_ref
		    FROM person,
		         json_each(person.json_data, "$.event_ref_list") as birth,
		         json_each(person.json_data, "$.event_ref_list") as death
		       WHERE person.birth_ref_index = birth.key and person.death_ref_index = death.key) as person
           LEFT JOIN (SELECT handle,
	                json_data as birth_json_data
		      FROM event) as birth_event
		ON birth_event.handle = birth_event_ref->>"$.ref"
           LEFT JOIN (SELECT handle,
	                json_data as death_json_data
		      FROM event) as death_event
		ON death_event.handle = death_event_ref->>"$.ref"
   )
;

My SQL is a little rusty, but I think that is correct. The data looks correct:

(Once we have the data selected, then we can do more expensive operations on the objects, such as getting the full name, date formats, etc.)

What are the compromises? Well, we’ll have to do some additional work to get the ordering of surnames to exactly match the possibilities in gramps-desktop. I think just the unicode sorting in SQLite works as expected. We can add some support for more complex surnames, if needed.

With this approach, we should be able to handle databases with millions of rows.

5 Likes

Another SQLite feature that we may want to investigate is Indexes On Expressions. Using this it should be possible to directly index a JSON element.

3 Likes

Absolutely! That will be an easy addition once we are done with the JSON representation.

1 Like

Hi @dsblank, very good point. Actually already now we’re not leveraging any of the auxiliary columns (which contain all of the “first-level” properties) for searching or filtering.

Let me use this thread to share some learnings & thoughts on this question (how to make Gramps Web fast).

Private proxy database

For users that are not allowed to view private records, a private proxy database is used, but it turned out to be incredibly slow since the iter_people and related methods did one individual SELECT for each object (see Performance issues with PrivateProxyDb · Issue #417 · gramps-project/gramps-web-api · GitHub). I solved this by using the auxiliary column private but this can’t be directly backported to Gramps as-is (see the discussion in Speed up iter handles in private proxy db by DavidMStraub · Pull Request #418 · gramps-project/gramps-web-api · GitHub).

Searching and filtering in-memory

Searching and filtering all objects (as is necessary for the paged list views) was also way to slow because the code in Gramps relied on iter_..._handles and then did an individual SELECT on each object. I solved that by loading all objects into memory and sorting there, see Sorting and date filtering is too slow · Issue #433 · gramps-project/gramps-web-api · GitHub.

General considerations: SQL vs memory

There are some cases (like the private proxy example or some of your JSON examples) where more use of SQL or SQL-JSON could speed up queries.

But for complex queries on individual (or few) object types - like e.g. relationship calculations which are still way to slow - I have come to the conviction that it is simpler to just load everything into memory. Gramps databases with tens of thousands of people are typically only a couple of MB in size. And for complex queries (such as relationship) it’s quite likely that the database will anyway do a full table scan.

That’s why the next step for speedup that I was personally thinking about is to monkey-patch the database class with something like a cache_people etc. method which would fetch the whole people table (like iter_people does) and relegate get_person_by_... methods to this cache (kind of similar to Gramps’ CacheProxyDb). Then, for endpoints where we know lots of objects need to be fetched (list endpoints, relationship calculator, …), the appropriate methods could be called, while on others e.g. (getting a single object) we wouldn’t have to do it.

Backend caching

Right now, only thumbnails are cached, but none of the API calls (ok, one exception, the face detection endpoint). This would speed up many things, in particular the first loading of the interface, as it always makes the same calls.

One of the reasons this is not implemented yet is that cache invalidation is a bit tricky because right now a Gramps database doesn’t know when it was last modified. This would change if @cdhorn’s tree metadata proposal (Tree metadata tool and related changes by cdhorn · Pull Request #1403 · gramps-project/gramps · GitHub) is merged, but right now, I think the only way to know whether a databse changed with respect to last time we looked at it is to get the total number of objects and the most recently changed object time stamp for each of the primary object types.

Frontend caching

This is not about the performance of Gramps Web API itself, but using ETags, we could allow the frontend to cache API calls as well and thus be faster in particular for slow internet connections (Implement ETags for most or all operations in the webapi · Issue #213 · gramps-project/gramps-web-api · GitHub).

1 Like

Thanks @DavidMStraub for the update. I’ve got some PRs to look at.

At a high-level, I see that we can break this down into two parts:

  1. The object lists (people, events, families, etc). That includes sorting (collation issues), filtering (need new SQL-based QL), and paging. This all seems doable and independent of the rest.
  2. All of the other “actions” (reports, imports, exports, etc). I agree with much of what you wrote above. But, perhaps we can think about generating them differently. What if creating these was done in a queue (jobs), and the results appeared in a list. That way you could continue working and the action could be generated in the background.