Collaborate on Optimizing a new Custom Rule

Inquiries about how to find People with events at a specific Place is a recurrent question in Gramps support forums. The usual suggestion is a 3-stage Custom Filter method.

It is slow. And the speed worsens if the scope is expanded/constrained… to optionally include Persons related to families events or Places enclosed by the specific place. Or if the user only wants to consider birth (including birth-like) or death (including death-like) Events.

Perhaps we could collaborate on a faster add-on rule. (One that might be a candidate for inclusion in the Person, Relationship & Family filter gramplets.)

2 Likes
  • Persons with events matching the <event filter> : GitHub codewiki
    Description: Matches persons who have events that match a certain event filter based on MatchesEventFilterBase
  • Events of places matching the <place filter> : GitHub codewiki
    Description: Matches events that occurred at places that match the specified place filter name based on event.get_place_handle()
  • Places enclosed by another place : GitHub codewiki
    Description: Matches a place enclosed by a particular place based on located_in(db, place.handle, self.handle)

Dec 2023 - There is a simple, 2-rule, single-stage filter for this search. It looks in the Place [more precisely,the Place Title] for a String in both Personal and Family events.
(In the example.gramps tree, USA states are entered with the 2-letter abbreviation. The Place Title searching ignores Alternative names. So to avoid places like Florence, Italy in searches for FL [Florida], you only have to include the Comma. Search the Place Title in the Places category to see what is the minimum string pattern is needed to disambiguate.)

A new filter rule: All persons with events of [event type] and [place] & a checkbox “all places enclosed by [place]” ?

1 Like

I think that would be good but…

Could it run through some evolutionary steps through the Python Gramplet and/or SuperTool?

Maybe start by consolidating the rules 3-stages into a single module. Just too see the performance difference between a cascade of 3 separate modules and a single one.

It would be helpful to begin generating some documentation on developing code to find records. Then contrasting the performance for alternative methods of doing the same things. (Such as using traditional Database queries vs. leveraging the handles stored as references & backlinks in Gramps objects.)

This code experiment could be applied as a Custom Filter Rule, a Report, a QuickView report and a Gramplet. See the differences could jumpstart development of small add-ons.

The exploration may seem to be a waste of effort when the problem is fairly well-defined for creating an add-on rule.

But if you look at just the lowest stage, there is also a Place Gramplet which shows all the Places the active Place record encloses.

The Encloses Gramplet (wikilocations.py on GitHub) pops up this information almost instantly. But running a single stage Custom Filter (or using the Place Filter Gramplet’s with just the ‘Encloses’ option) is significantly slower. Why?

Also, if we explored making a Quick View of the Encloses functionality (with the gramplet’s recursively expanding list or a simple alphabetical list) then that opens an opportunity for amplifying what can be done with using Places on the clipboard.

This is a type of problem that would have easily been resolved by a network graph that shows all Gramps objects as nodes and all relations between gramps objects as edges and a then used a “show nearest neighbor” with 2 hops algorithm.

It’s not for nothing I kept saying network graphs would be perfect for genealogy research.

Maybe you can use networkx and write a script to use in the Python Gramplet or SuperTool…

In looking at this problem and how several other tools approach, I wonder if optimization doesn’t mean a tiered attack?

The References.py quickview module can assume a single object and a flat list of objects referencing it. So it can just grab a list of the Referencing objects and display them in a ‘QuickTable’.

If our problem did NOT include Enclosed Places, the process becomes considerably more simplified. The list of unique people can be built from iteratively adding the references of Events referencing the Place. (Although it’d take an extra pass to break the Family objects into individual People too.) Maybe there’s already a callable functionality for that?

(Ran across a Date range intersection routine in the locations.py so that it could deliver the appropriate Place name for an era. Wonder if there is a similar date routine elsewhere for delivering the appropriate Personal Name?)

That should be efficient enough for a single Place or a small number of places. But at what point does a database query become the better choice? Maybe when you start qualifying Roles, date ranges, or Types for Events?

Once the “Encloses” aspect is introduced, the hierarchical list of Places might have too many Places or collect too many referencing Events to be efficient at collating in this References collation manner.

Would it help to start at a higher level? It seems that multi-stage filters in general, not just those related to enlosing/enclosed places, have the potential for being “slow” (meaning “not as close to instantaneous as we’d like them to be”).

How could a database query’s “JOIN” and “WHERE” clauses access the necessary details in the pickled BLOB columns?

I don’t know enough about how filters work, but maybe there could be some caching of results (if there isn’t already) that could be available for reuse until they need to be refreshed due to changes in the data that they reference.

Even single stage filters using a single preset field in the Filter Gramplet are dramatically slower than some QuickViews filtering on the same criteria. So the hope is to discover where the tradeoffs intersect,

If the following is accurate, there might be a workaround to pickling. (Wonder if using the workaround means being more aware of which DB backend is in use?):

From the “Using database API
To be compatible with BSDDB, DB-API stores Gramps data in an identical manner (pickled tuples). However, to allow for fast access, DB-API also stores “flat” data (such as strings and integers) in secondary SQL fields. These are indexed so that data can be selected without having to traverse, unpickle, initialize objects, and compare properties.

Maybe dear developers are interested to learn some about the technical structure of Isotammi service.

We export the data from Gramps in Gramps XML file and import it into Neo4j. It is no-sql graph database engine with nodes and relations just as StoltHD dreams of.

Gramps and Neo4j datamodels are very close relatives. Isotammi is written mostly with Python and its source may be studied freely in Github. We have also ongoing efforts of functionalities to update Gramps from Isotammi side online.

The website likes visitors !
https://isotammi.net

Best Regards,
Pekka Valta

3 Likes

Even though a graph database as backend storage would be a great feature for Gramps, in my comment I was more thinking about using network graphs as a research tools.

Even the mongodb backend is a great feature if it was being updated.


But as a research tool, network graphs has an even stronger capacity when it comes to find “hidden” relations, like what the problem in this post describe.

You don’t need the graphical view of the network, even though it is great to visualize most “problems” and the “answers” of a research question.
It is also possible to just use the resulting dataset in a table, or in a multi-table Gramps View.

There is also possible to add graph functionality to sqlite, it will work similar to the node and edge tables in Gephi, Cytoscape or other network graph tools that can show the data in a table view.


I don’t know if it is possible to run these tools in SuperTool or the Python Shell Gramplet, but by using panda and networkx (or any other graph library), this type of problems should be easily resolved by creating a graph dataset and a graph algorithm on that data e.g., a nearest neighbor with 2-3 hop, it would also be possible to build sub-graphs with more even more hop than that if needed, e.g., if you need to traverse through a full place hierarchy.

At the end, the data could be presented both as a visual network graph, and as table views.


But, my comment was just a tip about how it might be done if it was possible to use panda and e.g., networkx via SuperTool or the Python shell gramplet.

Personally, I have stopped using Gramps for any research, it is just a storage of already confirmed data for me. I find the near unlimited possibilities to use plain text markdown notes and tools like Foam for VSC and other extensions for VSC or Obsidian with its large number of plugins a lot easier to use when it comes to research and finding relation across objects/subject in my research. With those tools it is possible to find relations in both structured and unstructured text.


I did not know that you used neo4j and imported from the xml file, so I will, even though I am not a developer, look at that, it really interest me, so thank you for that infortmation :slight_smile:

As a data point; my own family research is 4000 people, 13000 Events, 2500 places.

My machine running Ubuntu,Intel(R) Core™2 Duo CPU T5470 @ 1.60GHz, 4Gb RAM

Since I choose to have places in a hierarchy all the way down to house number (number->street->district->town->county->country) I chose a village with 15 places.
This leads to 96 Events, which leads to 69 people (to save keying I share Residence Events)

Rough timings (fresh Gramps start each time)
Places: 3
Events: 6.8
People: 5.0

(No, I don’t understand why the Events filter is slower than the People filter which literally USES the Event filter)

1 Like

Which filters are you using?

(from my custom_filters.xml)

People filter

  <filter name="People @ Place" function="and">
      <rule class="MatchesEventFilter" use_regex="False">
        <arg value="@myPlace"/>
      </rule>
    </filter>

Event Filter

 <filter name="@myPlace" function="and">
      <rule class="MatchesPlaceFilter" use_regex="False">
        <arg value="myPlace"/>
      </rule>
    </filter>

Place Filter

 <filter name="myPlace" function="or">
      <rule class="IsEnclosedBy" use_regex="False">
        <arg value="P0255"/>
        <arg value="1"/>
      </rule>
    </filter>

This might be related. The builtin HasEventBase rule might be part of the slowdown. There might be a workaround that is slower.

While experimenting with a new feature for the Filter Gramplets (which scrapes the parameters in the Filter Gramplet into equivalent rules for a new Custom Filter, allowing Report and Export filters to be created more efficiently), Kari diagnosed a bug suspected to be in his add-on experiment and found that it was in built-in Gramps filter support instead. He bypassed the problem with an alternative add-on custom filter Rule.

Maybe someone else had to do workarounds too? But those workaround are a bit slower?

Sun 23 Oct. 2022 at 4:09 Emyoulation wrote:
Hi Kari,

Think I found a bug.

I tried making a Filter via the Filter2 gramplet in Events category.

The match was using Regular Expressions to match any character for Place … a single decimal point. When the Define Filter dialog appears, the Rule List is Events matching parameters Places :=“.” and the Option changed to “Return values that do not match”

The Custom Filter is added to the list. But the GUI shows no Rules when the filter is edited.

Sunday, October 23, 2022 at 12:30:00 PM CDT, Kari Kujansuu wrote:

Yes, that is a bug. Actually it seems that event filters [In the experimental Filter2 Gramplet] mostly don’t work at all. Thanks for letting me know.

This was actually quite interesting. The Event sidebar filter has an entry for participants. There is a builtin rule (HasEventBase) that implements the “participants” rule - but that rule is not supported by the custom_filters.xml file and the regular Filter Editor!

A simple fix was just to replace the original rule with a new custom rule that is a copy of HasEventBase. There was also a similar problem with Sources.

It doesn’t matter in this instance what makes (some) Event filters slow.

My people filter literally uses the Event filter, and yet my people filter runs faster than my Event filter.

2 Likes

I was very interested but immediately became lost because of the language barrier. And for some reason, my browser’s translation feature was not working.

The new Filter+ add-on is a replacement for the various built-in Filter gramplets.

Among the features is a timer for how long the set of filter take to get results.

When Gramps was making the transition to Sqlite, we decided to use “pickled blobs” for the hierarchical data for a couple of reasons. First, the goal was to make as small a step as possible away from the previous database default (BSDDB). Also, pickled blobs are small, and it was thought that it would be good to keep the database small.

There are some serious downsides to keeping pickled blobs. The most important is that the pickle format changes with different versions of Python. Pickled object are not backwards compatible. This is bad for data that we want to keep for long periods of time, like genealogical data. Pickled blobs can also pose security issues.

Fast forward to 2024. Storing JSON data in Sqlite is now in common use, and Sqlite has adapted to have some very nice query properties.

  1. Current: JSON Functions And Operators
  2. In development: The SQLite JSONB Format

How to query pickled blobs? You can’t. But if we changed pickled blobs to JSON, you can. And we gain some nice future-proofing and security as well.

As to how to make Gramps faster with filters is a different question. It may be time to revisit that issue :slight_smile:

2 Likes