Datamining standard libraries for Gramps?

emyoulation · October 19, 2020, 6:57pm

While looking at a new feature for Gramps, it struck me that the add-on was storing CSV data in a Gramps Note and some of the bugs were standard parsing problems. And that this new Note type probably needed to embed headers for future-proofing. Particularly since the DNA segment map data could come from different sources… where the same (untransformed) data can be expected but in differing order and with differing header labels. Which is the most simple crosswalk situation.

Then I thought about the CSV Import module for Gramps that already did a lot of flexibility work with external CSVs: parsing handling blanks lines, multi-section with differing content validation, header recognition, and so forth.

With a little bit of surfing I discovered Python has CSV libraries that build header dictionaries. Should we be using those?

Are there standard libraries Gramps already uses that developers should be leveraging rather than writing their own parsers?

Data Mining Libraries

BeautifulSoup
Scrapy (Structured)

Data Processing & Modeling Libraries

Pandas
NumPy
SciPy
TensorFlow
(example: Using Nucleus and TensorFlow for DNA Sequencing Error Correction)
PyTorch

Data Visualization Libraries

Genealogy or Genetics libraries

SNPs : tools for reading, writing, merging, and remapping SNPs
Lineage : tools for genetic genealogy and the analysis of consumer DNA test results. (Prerequisites: Python 3.5, numpy, pandas, matplotlib, atomicwrites, snps)

StoltHD · October 19, 2020, 7:14pm

I have asked something similar before…
There are also python modules for working with DNA…
like lineage

And there are libraries that can be used to import data to i.e. Cytoscape, that are a full blown bio-research tools… just saying…

Interoperability is dangerous!

emyoulation · October 19, 2020, 7:22pm

In some ways, this an interoperability question. (Standards conformance and such) But it is actually more basic than that since it doesn’t (yet!) stray into piping data with flow control between APIs.

This is mostly about simplifying development by re-use of existing code. Becoming proficient with passing data INSIDE Gramps is fundamental to fleshing out the API for interoperability.

I know some people have doing impressive things in Gramps with the visualizations. And from the postings about prerequisites & the gramplet docs, that includes leveraging GraphViz.

StoltHD · October 19, 2020, 7:48pm

Well, by using many of those existing libraries, interoperability comes as a positive addition.

And I asked about it only for use in Gramps, LONG before I started to understand all that about lack of interoperability in the export formats and the benefit of using other tools in addition to gramps…

prculley · October 19, 2020, 9:44pm

FYI Gramps has been around a lot longer than some of those libraries; just so you understand why some things seem to be done less than optimally.

Since Gramps for Windows and OSX needs to include in the bundle any library used (at least ones that use compiled code), we want to use care in adding new dependencies. For instance, NumPy is a huge dependency, probably near the same size as the rest of Gramps, so we have resisted adding it even though at least one addon could use it.

We recently have started using a couple of pure Python dependencies in addons, the addon itself loads the dependency during its first run. Even there the Python module cannot depend on having dependencies of its own available; this can get out of hand quickly.

But this would be quite difficult to do in the more general case.

If you have a good use case for a library that is not already present, discuss it with the developers on the developers email list; if it is small, or might be more generally useful, we may want to make it part of the standard set.

Mattkmmr · October 19, 2020, 11:51pm

You don’t need new libraries for that.
An exceptions in the DNA gramplet code would be enough to handle bad formatted data in the note and stop Gramps or the addon from crashing. Ignore/skip the data and inform the user with a message which note causes problems.

StoltHD · October 20, 2020, 12:07am

Seriously, I have got the really good answer “fork it and do it your self” enough times now, trying to ask for updated libraries or inclutions to the windows AIO… so I do not see the point anymore!

emyoulation · October 20, 2020, 12:16am

That Gramps predates libraries isn’t much of a surprise.

Data Mining and ‘Big Data’ have become major buzz words in the last decade. I expect that a lot of the functions that used to be built from scratch are now simply standardized library calls.

And those routines have probably been through quite a few optimization cycles for advanced interactions like multi-threaded CPUs. Although the opposite can happen where a custom coded routine is more agile. It may not be buried under all the bells & whistles. Error handling for edge cases can add a lot of overhead.

emyoulation · October 20, 2020, 12:30am

Well, there’s the other approach.

If you have a local college doing Python programming or data analysis courses, the professors are often looking for good, limited-scope projects to assign to advanced students. (Projects that are not “yet another” game!) They cannot use last semester’s assignments because students have always been very efficient cheaters.

Besides a grade, a sole coder credit for a project add-on (where there are 10s of thousands of users) is a good resume item for a student. With Open Source, a prospective employer can look at their code and documentation. Plus they can review their interactions with other developers & clients.

So, if you have people skills (something I lack), try promoting Gramps as a project platform. Ask the professor to provide an example of what they hand out as a Design Spec. Then write up a design spec for what you want.

StoltHD · October 20, 2020, 2:03am

sorry, don’t know anyone that want to use time on a lineage-linked system…
I have already tried, but when they needed to also create import/exports for interchangeable formats, just to be able to use Gramps with other tools (they are like me, they seriously think that gedcom should be deprecated from all systems), they just wasn’t interested as long as their Excel Sheets and networkx and plotly scripts did the job for them.

And I have already started convert my projects that I was going to share to other formats, so I stopped the job getting it to Gramps xml, it was to much hassle for me since I’m neither a xml developer nor a python programmer.
so my hacks will stay private

I can only contribute ideas, but when those ideas is not of interest to implement, then its not much more to do.

emyoulation · October 20, 2020, 2:45am

You making an incorrect assumption about the approach.

The teachers & student don’t need to have an interest in the final productivity in Genealogy. It is simply a gradeable exercise in a real-world task. A class of 60 could each have different small exercises or larger 3-5 person exercises.

The modularity of Gramps makes it an ideal free platform for a wide variety of exercises.

StoltHD · October 20, 2020, 8:03am

I dont think that’s how it works in Norway… and anyway, as I say I dont know any “professors”

emyoulation · October 22, 2020, 11:55am

Mathplotlib overhead

StoltHD · October 22, 2020, 4:36pm

For us Windows users, 60-100MB software installations are not much… even my Notebook software are bigger than that…

But I do agree that it’s good to try to limit the size… But 50MB extra for great functionality are not a problem for most researchers, when other software packages takes 4-500MB and some of them up over 1,5GB.
I have photography and video software on my computer that each takes 1,5-3GB when installed…

DavidMStraub · October 24, 2020, 8:08am

I don’t see the point why people here are arguing over whether a dependency is too big of an overhead or not. Let the user decide!

It’s totally obvious to me that all the packages mentioned by @emyoulation in the original post would not be dependencies of the core Gramps, but of addons. So where is the problem? Why can’t we simply allow arbitrary Python dependencies in addons? Just declare the dependencies in the register method and let the user take care of the install. Something like:

import pkg_resources
from importlib.metadata import PackageNotFoundError, version

for requirement in addon.requirements:
    try:
        version(pkg_resources.Requirement.parse(requirement))
    except PackageNotFoundError:
        raise SomeAppropriateGrampsError(
            "This addon requires the installation of the {} package"
            .format(requirement)
        )

If a user wants to use scipy for something, I’m pretty sure they also know how to install it!
You could even go one step further and install the package programmatically. This is done e.g. by one of my favourite Python packages, where I also took the inspiration for the above snippet: see here.

Mattkmmr · October 24, 2020, 9:13am

@DavidMStraub Addon devs yes, but I’m afraid that many Gramps users just might be overchallenged with additional installations. It would be great if we could provide the installation of addon dependencies together with the download in the addon manager.

DavidMStraub · October 24, 2020, 11:29am

Yes, that would certainly be optimal. But that only makes sense when installing the dependencies with pip. But I don’t see a problem with that, even on Windows (even though I have never tried).

By the way, concerning

I don’t quite agree because every single package mentioned by @emyoulation can be installed with pip without any problems nowadays, thanks to wheels. Even PyTorch and Scipy! This is very different from the situation 3 or 4 years ago when one had to compile C extensions.

I’m not saying that we should start pip-installing dependencies of the core Gramps package - that would be a bad idea. But having an addon manager that automatically installs addon dependencies with pip in the background would be perfectly possible and very useful in my opinion.

emyoulation · November 4, 2020, 2:44pm

My initial interest was in what functionalities were being done manually by Add-ons but are now available in Python? (And where the built-ins might have more extensive error handling and multi-core threading support.)

(Is Gramps 5.1.3 at Python 3.3? I see GitHub is at 3.5)

The particular functionality that that inspired question is the redundantly implemented variations of CSV import & export.

My thought was that we might at least encourage new add-on development into leveraging those built-ins.

Mattkmmr · November 4, 2020, 4:00pm

@emyoulation regarding multi-core threading support:
I don’t think there is a single addon right now doing so heavy calculations that multi-core threading is needed. The other addons are limited by Gramps and the database read/write, so multi-core threading added in addons wouldn’t speed anything up.

DavidMStraub · November 4, 2020, 4:19pm

I actually opened a pull request (still open) with a very small change allowing to use the CLI database handler class to open the database in read-only mode, which would allow multi-threaded read-only operations on an SQLite Gramps DB. Feel free to add your

The motivation for this is using multithreading in a web application (https://github.com/gramps-project/web-api/issues/16).

Topic		Replies	Views
Personal introduction - steers sought on programmatic interation Help	21	857	December 15, 2022
Supply id via csv import Help	5	439	April 4, 2021
Gramps and recording and comparing DNA-matches Help dna	46	3530	October 25, 2022
Can non Python module be hung on the Gramps plugin framework? Development	7	717	June 20, 2022
Network graphs and leveraging NetworkX Python library Ideas	1	106	October 9, 2024

Datamining standard libraries for Gramps?

Data Mining Libraries

Data Processing & Modeling Libraries

Data Visualization Libraries

Genealogy or Genetics libraries

Related topics