Threading (or multiprocessing), performance and refactoring

Hello !
I am starting and looking at a little bit refactoring and performance issues on an old 3rd party addon. It was generated some years ago and tested with old hardwares and few resources. So, the idea was to generate (Yet) another alternated interface to Relationships Calculator module and to provide an output (via an hack of standard way for OpenDocument Spreadsheet exporter).


Performances sounded good by using example.gramps data (less than 5 secondes for calculations and display) with default values (generations depth).

As filter rules performances have been improved on Gramps 6.x, (and before making migration from 5.2 to 6.0), I started to look at ways for improving this addon tool. As it re-uses many times the Relationships module and as this could be expensive (memory, calculation), maybe the first step might be a quick refactoring. Also, the more number of columns will increase, the more it needs time for calculations and display. So, maybe something around this too.

The first exploration (on my side) is to try to look at threading and multiprocessing modules. It makes sense for more modern configurations and more and more large dataset. It is the first time that I use/call it. The basic test was to replace some possible expensive sections and put it them to their threads.
Well, the total time process increases from 5 secondes to 12 secondes with a simple database and default values (~2000 individuals)! I did not test with large databases. So, is it normal or I generated more mistakes with threads? Here a link to these first changes (tests, experimentations):

Something looks wrong to my thread usage and implementation.

-            filter.add_rule(related)
-            self.progress.set_pass(_('Please wait, filtering...'))
-            filtered_list = filter.apply(self.dbstate.db, plist)
+            t_filter = Thread(target=self.t_filter_rules(related, plist))
+            t_filter.start()
+            t_filter.join()

+def t_filter_rules(self, related, plist):
 +       """
+        """
+        self.filter.add_rule(related)
+        self.progress.set_pass(_('Please wait, filtering...'))
+        self.filtered_list = self.filter.apply(self.dbstate.db, plist)

Should I rather try to make a pool for threading? Or should rather improve the monitoring[1]? It looks like iteration and filtering might be different on Gramps 6.0, but I need to first improve it on 5.2.x, for understanding ways and modifications.

[1] $ gramps -d “relation_tab”

Best regards,
Jérôme

Maybe, I should also add a close event to threads?

It seems that I should not use return into the thread section dedicated to rank calculation?

def t_rank(self, dist, max_level):
    self.rank = dist[0][0]
    if self.rank == -1 or self.rank > max_level: # not related and ignored people            
        return

This might be into the loop (with continue) to avoid extra calculations (not related and ignored people).

There are better ways of handling threads than just starting 3. But in general, you’re going to run into issues with incompatibilities in accessing the database from different threads. I worked on trying to have a generic interface for parallel processing a couple of weeks ago, and I don’t think this is a good solution in general. Not to say that we can’t do parallel processing, but we have to do it in an abstract manner so that each database backend has options.

In any event, using Gramps 6.0.5 is going to be faster than any parallelism because of the Optimizer. You’ll want to move to it to get ready for 6.1.

Do you mean a Pool or a high-level stuff like concurrent module?

The primary idea was to move some basic calculations to functions and their related threads. I made a mistake on one of the function and the test for limiting entries was broken into the loop, generating extra calculations and increasing time process. So, testing my primary idea around threading should be now complete. After looking at some samples, I just understood that dealing with lock() and playing with location of join() might not provide the expected result. The thread on filter could be ignored anyway as it is outside of my iteration and loop.

I suppose that I got some of these incompatibilities…

Exception in thread Thread-56:Traceback (most recent call last):
File “/usr/lib/python3.6/threading.py”, line 916, in _bootstrap_innerself.run()
File “/usr/lib/python3.6/threading.py”, line 1182, in runself.function(*self.args, **self.kwargs)
TypeError: ‘NoneType’ object is not callable
Exception in thread Thread-58:Traceback (most recent call last):
File “/usr/lib/python3.6/threading.py”, line 916, in _bootstrap_innerself.run()
File “/usr/lib/python3.6/threading.py”, line 1182, in runself.function(*self.args, **self.kwargs)
TypeError: ‘NoneType’ object is not callable

File “/usr/lib/python3/dist-packages/gramps/plugins/db/dbapi/dbapi.py”, line 1002, in _get_raw_data
self.dbapi.execute(sql, [handle])
File “/usr/lib/python3/dist-packages/gramps/plugins/db/dbapi/sqlite.py”, line 136, in execute
self.__cursor.execute(*args, **kwargs)
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 140130013738816 and this is thread id 140129322649344.

1 Like

Nothing currently common (python, others applications, etc.) or a limitation to gramps database backends?

About Gramps 6 installation, I need a large OS upgrade, before…

Well, after some refactoring, I looked at additionnal (or advanced) features like family network centrality, or shared subtree size… It sounds very good!

(sorry, I did not force english locale for the screenshot)

You can see two additionnal columns. That’s something that I cannot add alone on one day (maybe on one week!)…

I was behind instructions to the “copilot” and the spirit and ideas are still present.:wink:

Maybe I can polish the RelID (ID Rel) map design. Anyway, by re-using most core modules from gramps, we could quickly go very far on relationships analysis. I will not make a PR. This analyse (analyze) tool has some custom behaviors like asking to select a folder for the save .ods action. Sure, after refactoring, the code is more pythonic and clean, but some sections are very experimental or still pending (DNA stuff?):

name = name_displayer.display(person)
# pseudo privacy; sample for DNA stuff and mapping
import hashlib
no_name = hashlib.sha384(name.encode() + handle.encode()).hexdigest()
_LOG.info(no_name) # own internal password via handle

the model should very close to the TreeView, but I cannot maintain a new View (e.g., Relation Views Category) or make some deep interactions with filter rules (e.g., like on some Graphical Views).

Note, during experimentations, I also got a ProgressMeter window via CLI…

$ gramps -O ‘example’ -a tool -p name=relationtab -d “relation_tab”

I made a draft Pull Request against gramps60.

Copyright (C) 2000-2006 Donald N. Allingham

It was the plugin environment (and global gramps application).

Copyright (C) 2008 Brian G. Matherly

It was around the GUI stuff, maybe the hack for the folder selector (gtk2 to gtk3).

Copyright (C) 2010 Jakim Friant

It was the filter rules handling into the tool.

Copyright (C) 2012 Doug Blank

Might be related to the TreeView model, the ODS file format support or the Tools options.

I kept way to generate issues around threading, via a dedicated function! So, it will return errors (console) without crashing the tool. The addon is not listed (include in listing = False) and there is an additional file (import number ; numbers already exist as built-in python set).

@romjerome I am not sure what you are doing. At one point you have:

1. thread = Thread(target=self.long_running_task, args=(default_person, person,))
2. thread.start()
3. dist = self.relationship.get_relationship_distance_new(
       self.dbstate.db, default_person, person, only_birth=True)

where line 1 calls long_running_task that does this:

        dist = self.relationship.get_relationship_distance_new(
            self.dbstate.db, default_person, person, only_birth=True)

But you call that again on line 3. I don’t think this code is doing what you think it is doing. And this doesn’t look like proper thread management. But even if it were done properly, Gramps makes no guarantees that you can handle database actions in a thread.

BTW, there is a “git blame” button and cli:

But in general it is impossible to associate a copyright with particular lines (because they may be revised or removed).

Yes, it was the plan! I just call it twice to keep a trace of the incompatibilities you warned me and also listed on:

Something like a testing without crash because the thread section does not really generate a useful data. The second one does. Sure, a final version (polish outside draft PR) should remove the thread “lines”.

I remember that one script on experimental Gram.pycould generate something very close (at least the primary idea).

By looking at some comments on code, history and documentation, there was some performance issues in the past (e.g., SQLite backend 30% slower than bsddb3 backend). As it was some years ago, I just try to improve a little bit any possible way to limit any slow down or extra time process with large database.

It seems that I just added a copyright when I looked at an other module, or section (piece or part) of code into Gramps, then re-used it into the addon.

Maybe it makes sense for the core set of plugins. Does it still make sense when we create an addon? It seems that I re-used the GUI logic, Plugins classes for Tools, part of the OpenDocument logic and just hide my name when it should be added. For me, it is a plugin for gramps. So, the licence, maybe the copyright, are related to development around gramps. As you pointed out git blame (or svn blame !) could find the history of commits and authors.

My problem is now on coding Policy, AI, copyright & co. I should be able to quickly add advanced calculations (via new columns on table) related to Relationships, DNA, Families, Surnames, Statistics, etc. Most of them will be basic calculations (few lines), but what about any addition from copyright(ed) code and provided by an AI. Do not worry, I checked before adding any lines from the AI suggestions. :wink: The problem might be to include a “real” algorithm. I remember the Lunisolar calendar issue (e.g., for Chinese dates, see bug tracker). It was not included, despite some patchs (even by myself!). Today, any AI will include it (at a glance).

Is it dangerous to keep it? Can this crash more than Gramps? Can this corrupt the database (in read only state/mode)? Python uses its pseudo-sandbox (GIL environment), isn’t it?

I saw some projets like this one:

but I thought that warnings were only for printing informations and threads were closed before.

update: I suppose that I understand now what you wrote about a proper implementation… For testing threading, I should always (or often) skip and limit use of return into the related function and prefer yield for performances issues.

An ambitious feature could be: “Person Comparison: Add a feature to compare relationships and metrics between two specific individuals.”

So, it does not assign UUID for records into a Big Tree, like FamilySearch or Ancestry. It only checks our Relationships maps. I am not sure that we could fully hide surnames, maybe the soundex could be an alternative. Anonymized matrix(es) means decentralyzed stuff and less ressourses too. In the 90th (1990), ‘Tafel Matching System’ was very popular, at least in France, and ‘Geneanet’ more and less has improved it.

Doug, I suppose I get it more clear and clean now (yes, it is possible!) and I could provide a simple test or proof-of-concept of these testing.

How make such code fit for a Family Tree with more than 200 000 individuals ?

from gramps.gen.filters import GenericFilterFactory, rules
filter = FilterClass()
self.filter = FilterClass()
default_person = self.dbstate.db.get_default_person()
plist = self.dbstate.db.iter_person_handles()
if default_person: # rather designed for a run via GUI...
    root_id = default_person.get_gramps_id()
    ancestors = rules.person.IsAncestorOf([str(root_id), True])
    descendants = rules.person.IsDescendantOf([str(root_id), True])
    related = rules.person.IsRelatedWith([str(root_id)])
    self.filter.add_rule(related)
    _LOG.info("Filtering people related to the root person...")
    self.progress.set_pass(_('Please wait, filtering...'))
    self.filtered_list = filter.apply(self.dbstate.db, plist)
    for handle in self.filtered_list:
       ...

Someone reported that this (or on an other piece of code into the addon) could take more than 2 hours.

Sure, to call Relationships will use some ressources, but the poor timer issue seems (for me) rather on the filter rule.

Checking and retrieving informations for all filtered individuals via person_handle iteration (without filtering), seems rather to take around maximum 15 minutes. Does ‘related’ filter rule have been improved/optimized on Gramps 6.0.5?

1 Like

Right, it seems that I did some strange experimentations!

I just see now, that the IsRelatedWith filter rule is already using a pseudo-parallel processing!

I was wondering why I get some blocs of records after the filtering pass. In my mind, the filter rule was only for limiting the dataset. Currently the progress meter seems (for me) to list a part of the filtered people while (or during) iteration on the first matching handles! Looking at the recursive code on filter rule I cannot really monitor this from the addon, which is only an interface. This cannot explain the performance issue with large database (I suppose), maybe it just only explains why I wanted to explore the threading during the iteration pass.

1 Like

I get more difficulties to properly implement custom filter rules support on tools, than use of yield and generators on code!

Finally, the workflow, for human with or without “machine” support, did not really change! And I still find possible issues never reported before and outside AI monitoring or check passes…

Sure, AI’s answer sounds good and solution seems logical. I skiped now the threading experimentations, even indirectly tested with yield, stack, lists and gtk events. So, code should be more pythonic and modern, and there is no real features addition, except maybe more columns.

One issue is still pending. It is specific to GUI (Gtk TreeView model). No crash or error via CLI. It is on gtk window used for displaying the list of results. As a pseudo-sosa/kékulé numbering was expected, an experimental numbering for descendants or cousins has been tested. The main problem is to have a real number, not too complicated to understand at a glance. Maybe a positive one, not used by sosa/kékulé (so, not from 2 to infinite) and uniq. Maybe something around 0 to 1? To have a float value will crash the gtk model

self.model = ListModel(treeview, self.titles)
for entry, sort_key in batch: 
    self.model.add(entry, sort_key)

It is a design issue. So no more related to refactoring, polishing or cleanup, but still around performance and maybe threading with gtk window.

ps : there is also a cache issue, which is for re-using a list. This might display an incomplete list (signal, active person and database, etc.)

1 Like