Datamining standard libraries for Gramps?

While looking at a new feature for Gramps, it struck me that the add-on was storing CSV data in a Gramps Note and some of the bugs were standard parsing problems. And that this new Note type probably needed to embed headers for future-proofing. Particularly since the DNA segment map data could come from different sources… where the same (untransformed) data can be expected but in differing order and with differing header labels. Which is the most simple crosswalk situation.

Then I thought about the CSV Import module for Gramps that already did a lot of flexibility work with external CSVs: parsing handling blanks lines, multi-section with differing content validation, header recognition, and so forth.

With a little bit of surfing I discovered Python has CSV libraries that build header dictionaries. Should we be using those?

Are there standard libraries Gramps already uses that developers should be leveraging rather than writing their own parsers?

Data Mining Libraries

Data Processing & Modeling Libraries

Data Visualization Libraries

Genealogy or Genetics libraries

1 Like

I have asked something similar before…
There are also python modules for working with DNA…
like lineage

And there are libraries that can be used to import data to i.e. Cytoscape, that are a full blown bio-research tools… just saying…

Interoperability is dangerous!

1 Like

In some ways, this an interoperability question. (Standards conformance and such) But it is actually more basic than that since it doesn’t (yet!) stray into piping data with flow control between APIs.

This is mostly about simplifying development by re-use of existing code. Becoming proficient with passing data INSIDE Gramps is fundamental to fleshing out the API for interoperability.

I know some people have doing impressive things in Gramps with the visualizations. And from the postings about prerequisites & the gramplet docs, that includes leveraging GraphViz.

Well, by using many of those existing libraries, interoperability comes as a positive addition.

And I asked about it only for use in Gramps, LONG before I started to understand all that about lack of interoperability in the export formats and the benefit of using other tools in addition to gramps…

2 Likes

FYI Gramps has been around a lot longer than some of those libraries; just so you understand why some things seem to be done less than optimally.

Since Gramps for Windows and OSX needs to include in the bundle any library used (at least ones that use compiled code), we want to use care in adding new dependencies. For instance, NumPy is a huge dependency, probably near the same size as the rest of Gramps, so we have resisted adding it even though at least one addon could use it.

We recently have started using a couple of pure Python dependencies in addons, the addon itself loads the dependency during its first run. Even there the Python module cannot depend on having dependencies of its own available; this can get out of hand quickly.

But this would be quite difficult to do in the more general case.

If you have a good use case for a library that is not already present, discuss it with the developers on the developers email list; if it is small, or might be more generally useful, we may want to make it part of the standard set.

2 Likes

You don’t need new libraries for that.
An exceptions in the DNA gramplet code would be enough to handle bad formatted data in the note and stop Gramps or the addon from crashing. Ignore/skip the data and inform the user with a message which note causes problems.

1 Like

Seriously, I have got the really good answer “fork it and do it your self” enough times now, trying to ask for updated libraries or inclutions to the windows AIO… so I do not see the point anymore!

That Gramps predates libraries isn’t much of a surprise.

Data Mining and ‘Big Data’ have become major buzz words in the last decade. I expect that a lot of the functions that used to be built from scratch are now simply standardized library calls.

And those routines have probably been through quite a few optimization cycles for advanced interactions like multi-threaded CPUs. Although the opposite can happen where a custom coded routine is more agile. It may not be buried under all the bells & whistles. Error handling for edge cases can add a lot of overhead.

Well, there’s the other approach.

If you have a local college doing Python programming or data analysis courses, the professors are often looking for good, limited-scope projects to assign to advanced students. (Projects that are not “yet another” game!) They cannot use last semester’s assignments because students have always been very efficient cheaters.

Besides a grade, a sole coder credit for a project add-on (where there are 10s of thousands of users) is a good resume item for a student. With Open Source, a prospective employer can look at their code and documentation. Plus they can review their interactions with other developers & clients.

So, if you have people skills (something I lack), try promoting Gramps as a project platform. Ask the professor to provide an example of what they hand out as a Design Spec. Then write up a design spec for what you want.

1 Like

sorry, don’t know anyone that want to use time on a lineage-linked system…
I have already tried, but when they needed to also create import/exports for interchangeable formats, just to be able to use Gramps with other tools (they are like me, they seriously think that gedcom should be deprecated from all systems), they just wasn’t interested as long as their Excel Sheets and networkx and plotly scripts did the job for them.

And I have already started convert my projects that I was going to share to other formats, so I stopped the job getting it to Gramps xml, it was to much hassle for me since I’m neither a xml developer nor a python programmer.
so my hacks will stay private

I can only contribute ideas, but when those ideas is not of interest to implement, then its not much more to do.

You making an incorrect assumption about the approach.

The teachers & student don’t need to have an interest in the final productivity in Genealogy. It is simply a gradeable exercise in a real-world task. A class of 60 could each have different small exercises or larger 3-5 person exercises.

The modularity of Gramps makes it an ideal free platform for a wide variety of exercises.

1 Like

I dont think that’s how it works in Norway… and anyway, as I say I dont know any “professors”

1 Like

Mathplotlib overhead

For us Windows users, 60-100MB software installations are not much… even my Notebook software are bigger than that…

But I do agree that it’s good to try to limit the size… But 50MB extra for great functionality are not a problem for most researchers, when other software packages takes 4-500MB and some of them up over 1,5GB.
I have photography and video software on my computer that each takes 1,5-3GB when installed…

I don’t see the point why people here are arguing over whether a dependency is too big of an overhead or not. Let the user decide!

It’s totally obvious to me that all the packages mentioned by @emyoulation in the original post would not be dependencies of the core Gramps, but of addons. So where is the problem? Why can’t we simply allow arbitrary Python dependencies in addons? Just declare the dependencies in the register method and let the user take care of the install. Something like:

import pkg_resources
from importlib.metadata import PackageNotFoundError, version

for requirement in addon.requirements:
    try:
        version(pkg_resources.Requirement.parse(requirement))
    except PackageNotFoundError:
        raise SomeAppropriateGrampsError(
            "This addon requires the installation of the {} package"
            .format(requirement)
        )    

If a user wants to use scipy for something, I’m pretty sure they also know how to install it!
You could even go one step further and install the package programmatically. This is done e.g. by one of my favourite Python packages, where I also took the inspiration for the above snippet: see here.

4 Likes

@DavidMStraub Addon devs yes, but I’m afraid that many Gramps users just might be overchallenged with additional installations. It would be great if we could provide the installation of addon dependencies together with the download in the addon manager.

2 Likes

Yes, that would certainly be optimal. But that only makes sense when installing the dependencies with pip. But I don’t see a problem with that, even on Windows (even though I have never tried).

By the way, concerning

I don’t quite agree because every single package mentioned by @emyoulation can be installed with pip without any problems nowadays, thanks to wheels. Even PyTorch and Scipy! This is very different from the situation 3 or 4 years ago when one had to compile C extensions.

I’m not saying that we should start pip-installing dependencies of the core Gramps package - that would be a bad idea. But having an addon manager that automatically installs addon dependencies with pip in the background would be perfectly possible and very useful in my opinion.

3 Likes

My initial interest was in what functionalities were being done manually by Add-ons but are now available in Python? (And where the built-ins might have more extensive error handling and multi-core threading support.)

(Is Gramps 5.1.3 at Python 3.3? I see GitHub is at 3.5)

The particular functionality that that inspired question is the redundantly implemented variations of CSV import & export.

My thought was that we might at least encourage new add-on development into leveraging those built-ins.

@emyoulation regarding multi-core threading support:
I don’t think there is a single addon right now doing so heavy calculations that multi-core threading is needed. The other addons are limited by Gramps and the database read/write, so multi-core threading added in addons wouldn’t speed anything up.

I actually opened a pull request (still open) with a very small change allowing to use the CLI database handler class to open the database in read-only mode, which would allow multi-threaded read-only operations on an SQLite Gramps DB. Feel free to add your :+1: :wink:

The motivation for this is using multithreading in a web application (https://github.com/gramps-project/web-api/issues/16).

3 Likes