Gramps and recording and comparing DNA-matches

Hi,

(the first two of paragraphs is background, feel free to skip them)

I just downloaded Gramps and am liking what I see so far. I was (and still am) a “The Master Genealogist” (TMG) user, who hasn’t yet moved on since development stopped on that, and haven’t updated my database much for a few years.

I have however gotten into working with autosomal DNA tests, and it’s time to apply that to updating my genealogy database, so I need a new tool. I also want to make my approach much more source-centric and I was considering writing something myself, but it’s been years since I did much serious programming and I never made a large application from scratch, so I went looking for what already exists in open source and found Gramps.

I’ve searched a bit though, and I haven’t found much about how Gramps and Gramps users include DNA match results. So I’m wondering if that is because I’m bad at searching, or if it just isn’t out there. How do Gramps users include the results from DNA-testing in their database?

Ideally I’d like something like the functionality of Genome Mate Pro, for those that are familiar with that application, built into a genealogy program, and keep track of my tree, and imported DNA-matches and their ancestor trees, in one place.

4 Likes

Gramps doesn’t have any DNA specific features at this time. So if someone has a good understanding of what sort of features are needed, we would love to hear from you.

If I had to store something about DNA at this time, I would probably add it to a Persons Attributes. That way it would stay attached to the person. This would be good if the data was reasonably concise, not so good if it was a full report.

The next step would be to write a tool or report that could do reasonable things with the DNA attribute, like look for matches in other people.

Both of these things can be done without changing Gramps itself, as we have an addon ability that allows easy addition of many types of features. This would be the way to go for experimental features like this, in my opinion. It can be developed and tested on your own machine, even while Gramps is being updated by the developers. And it can be shared easily and quickly, one it gets to the point where it pretty much works.

3 Likes

I wrote an experimental gramplet which used DNA markers stored in attributes. There is some information in feature request #8919 (Add ability to record Genetic information eg: Haplogroup).

2 Likes

@Nick-Hall @prculley I’m currently working on two new filter rules “persons sharing yDNA” and “persons sharing mtDNA” to filter relatives sharing the same DNA/haplogroups.

I’m also interested to help adding new fetaures for genetic genealogy to gramps. Some ideas:

  1. I don’t think that Person Attributes alone are enough to store DNA data e.g. you can add the haplogroups and a few DNA mutations, but more than that ant it gets messy. A new genetic/medical tab in the person editor might be a better choice. That would also allow more than a key/value pair. The raw data files of the companies have usually 4 columns:
rsid        chromosome   position     genotype
rs1234       1            12345        GG
  1. The files could be added similar to other media. If a user provides the rsid of what they are interested in, gramps could scan through the external raw file and add the rest of the data without the user having to add it. Note: The position on the cromosome depends on reference genome and can be different between different files. The genotype can also be the opposite strand e.g. CC instead of GG or AA instead of TT.
  2. For reports the comparison between two files should also be possible to add as well as extracting how many cM of DNA is shared. The tool Lineage might be interesting to have look into.
  3. From the size in shared cM possible relationships between two persons can be calculated. (see here)
  4. Genogram reports have already been asked several times as a feature request.
2 Likes

Great answers!

I don’t think storing a whole raw DNA file is particularly useful at this point, although it might be in the future. The matching algorithms are quite large beasts and if one have access to the raw files for multiple people, it is likely better to just upload them.

Storing y-dna and mtdna haplogroups is likely useful. I don’t see it as too complicated, considering it’s genuinely a property of each individual. A challenge might be that the designations keep changing, at least that’s my impression from observing discussions about such tests, I haven’t sprung for one myself.

What I personally am looking for is the ability to store data on autosomal matches. For each match, which is a pair of individuals, match data consists of one or more segments with information about chromosome, start and end point and length in centiMorgans.

As for using this data a starting point would being able to search and compare such matches visually per chromosome. When I have a new match with unknown shared ancestry it’s convenient to be able to see if I already have hypotheses about the line for the segments in questions, across companies.

The next step up from that would be automating such comparison.

Now there are lots of tools offered by the testing companies themselves, but they of course do not work directly for matches from different companies. There are also several third party tools, but they require duplicating a lot of effort. I have primarily used GenomeMatePro for instance to store and compare match data and not to fully document the relationships discovered, because the built in tools for storing the ancestry of the matches is not user friendly and requires a lot of duplicated effort both within the software and with my genealogy program.

3 Likes

Yes, the haplogroups evolve over time, but what don’t change are an individual’s haplotypes, and perhaps those are what should be stored as personal attributes, or as attributes related to “DNA test” events. Of course it would also be useful to store the haplogroups too, realizing they could change.

I share your feelings about Genome Mate Pro and have tried to think of ways to store some information about autosomal segments. Mostly, I’m interested in replicating its Segment Match feature, to document which segments of my DNA I inherited from which ancestors, and secondarily which cousins share them.

For storing information about matches, I have thought of using a “DNA match” event to be shared by both parties, with an event attribute for each matching segment. The format of the attribute’s value would be something that could be parsed for purposes of comparison to segments from other DNA match events. For example, suppose I match a person on chromosome 2 from position 120000000 to 140000000. I could store this value as “2-120000000-140000000” or perhaps with some other delimiter. Now, if I want to know whether that overlaps with a segment from another match, such as “2-130000000-150000000”, then I somehow need to parse the values first. It would be nice to have filters that could do this.

So, to generalize the problem, I would be interested to learn whether there are precedents among Gramps users for storing other multi-part values within a single attribute, and using them in comparisons.

3 Likes

Hmm as short term solutions there are several ways which would work for DNA matches:

  1. shared custom “DNA match” events with match data as attribute for each segment
  2. shared notes with custom type “DNA match” and match data
  3. Assosiations with match data in the value field for each segment (would require to be added to both persons)

Personally I think #2 might be the best way, but they all would require users to add all match data the same way e.g. “Chr-Start-End-cM”. Each line of the note could also be a seperate matching segment for the two people.

2 Likes

An individual’s actual DNA doesn’t change over time, but their haplotype sort of do, for three reasons:

Most people take a test covering a limited set of markers, giving them a high level, not-very-specific haplotype. Later they may chose to take a more specific test.

The nomenclature for haplotype is not fixed and has changed in the time Y-DNA tests have been available.

The grouping of haplotypes is based on the sample of Y-DNA available today. As more people test, some groupings are found to be mistaken and renamed.

3 Likes

Yes, I agree.

I was thinking that for Y DNA, for example, one could have a separate attribute for each marker, storing as a value the number for short tandem repeats for that marker:

Key=“DYS393”

Value=“13”

Key=“DYS390”

Value=“23”

etc.

As more detailed testing is done, more attributes could be added. Having the attributes stored as part of a DNA test event could also be useful if somehow a person got slightly different results from different tests due to some error in the lab.

In that way, what is known about the person is stored as attributes of the person. I’m less sure about where to store what is assumed about the person, namely their assumed haplogroup.

Regardless of what is stored where, personally I need to become more familiar with how to use attributes in Gramps, especially for comparisons or calculations, and at what point some programming would be required in order to do anything useful with them.

2 Likes

This is somehow how Legacy familytree store the DNA results…

They have a “DNA” button that open a “DNA-Form”, something similar to the form gramplets, where you can add all your DNA results as data-pairs, but you need to type it in manually (from the CSV or other types of tabular formats you get from the test provider…
Legacy has a predefined list of tests that you can select (something like Types in Gramps?), the you get a list with the definitions, and types in the values…
If you have a Y12 and a Y65 and a Y100-something, you need to type inn all the values, and you register it on the test date, in addition it has a dedicated field for the haplogroup (but this is fixed, and if you change it you change it for all the test, its not a list of triplets with a field name, value and date (as it should be)…

I’m not a developer, so I dont know how much job it wil be to add a form for DNA, but I know that since Gramps use Python, it will have a advantage from many other software, because of the number of Pyton libraries for both analyzing and calculation and visualize DNA that’s already made for Python.

Nearly every DNA or Humanities research lab in the Worrld use Python or R for analyzes… and in addition you have great visualizations libraries like the NetworkX (but it is under BSD, dont know if that is compatible with Gramps license).
I’m sure it would be easy to find multiple libraries both for analyzing, calculating and visualising that are compatible with the Gramps licensing regime, and that can be used by the Gramps UI libraries (But that would be up to the developers to decide)

I think It would be best if it could be a seperate module, like the name editor, where you could add multiple test result with the name and type and date of the tests taken, this module could of course store its data as either or both personal attribute pairs/triples (if next version of Gramps will provide attributes with dates) and as an test event for the person, since an event would give some additional fields for information, and a DNA test is actually a Event in a persons life (something that person has done one or multiple times in her/his life)…

I would prefere that the data was added as pair/triplets/quadruplets (four-tuples), instead of in a value separated list or a list of pairs in one attribute field… I think it would be easier for the human eye to read and understand, but of course in reports and other visualization of the DNA data, it has to be merged to something understandable i.e. a list with the metadata of the test as a description…

Many of the libraries and research software i.e. cytoscape and other more specific medical research software read either csv ((Node, Edge) or Network Lists)), the Gramps CSV export actually only need a little rescaping to be imported to Cytoascape and be useful, it should be possible to add fields so that it could be used for DNA purpose, usually a graph visualiization tool like Gephi, Palladio, Cytoscape or any other, use a two list approach, one list (csv file) for nodes and one list (another csv) for edges (relations for those not in the graph world), and in addition to the basic, the list can hold additional values like dates, periods, tags, weights, directions and other attributes that can be used in calculations of different graph networks or for analyzing the data it self…

This is some of the Python libraries I have found for DNA, I have not tested any of them, just collected them for later “use”, if they can be used in Gramps, I do not know…

Andrews, J. (2019). A minimal desktop app for easy and convenient gene annotation.: J-andrews7/Genotify [JavaScript]. Hentet fra https://github.com/j-andrews7/Genotify (Original work published 2017)

Frampton, M. (2018). seqfam: A package primarily designed for analysing next generation sequencing DNA data from families with pedigree information in order to identify rare variants that are potentially causal of a d… [Python]. Hentet fra https://github.com/mframpton/seqfam (Original work published 2017)

Johns, L. R. (2019). Lorarjohns/DNA_pandas_selenium [Jupyter Notebook]. Hentet fra https://github.com/lorarjohns/DNA_pandas_selenium (Original work published 2019)

Riha, A. (u.å.). lineage: Tools for genetic genealogy and the analysis of consumer DNA test results (Versjon 3.0.1) [Python, OS Independent]. Hentet fra https://github.com/apriha/lineage

Treece, J. (2019). Jeftreece/dnamatch-tools [Python]. Hentet fra https://github.com/jeftreece/dnamatch-tools (Original work published 2019)

2 Likes

Regarding haplogroup/haplotype:
The haplogroup is defined by SNP-mutations of the yDNA which occured over time. So if a new (more recent) SNP mutation is found which creates a new subclade further down and is more specific, your old haplogroup should already contain it (but was less specific). I’m unsure how often it really happens that your first prediction of your haplogroup was so wrong that you changed to another branch/haplogroup of the tree and your first haplogroup did not contain the new subclade. Is that still a common problem?

I think the only way to auto update/change/rename to newer haplogroup/haplotype would be having a raw DNA file to analyze in gramps, so probably your own haplogroup (where a raw data file exists) could be updated, but all other persons where the haplogroup was entered manually would have to be changed manually again. Or does anyone know another solution?

1 Like

I can’t that much about DNA, I was just thinking that different test could give different result, also regarding haplogroups, and in historic research ALL data should be saved and stored, and it should be possible to use the old data for comparing changes over time…

I think that of all data, not only DNA, Thats why I really hope for dated attributes or even better quadruplets, because then it will be possible to use the attributes to create custom linked data, that is not normaly done by roles, tags and events…

I know that a lot of what I wish for is something noone else use or even see any benefits from, but its still something I think can be of help for those using research tools outside Gramps and really wnat to use Gramps as a primary database for as much of their data as possible…

Gramps are already a really good choice for storing most humanities research data, but there are some things to be done before its easy to use the data outside of Gramps, in software not created for gedcom, most research tools still use csv or one of the network specific json standards out there…

This is of course way out of the Gramps usage area, but it would be great if it would be possible to make the data exchange even more “Open Data” like… as it is now, gramps is one of few Feature Rich Genealogy Software (commercial or free) that actually do work with csv, json and xml…
I think the gedcom 5.x.x standard should be banned, because I don’t want to be limited by a lineage-linke research approach…

But thats just me… I have understud that most users of genealogy software don’t actually understand or care about the difference between lineage-linked, Event-based and Document/Fact/Evidence-Based research…

Users are already using the existing Forms addon to do this. The markers are recorded as attributes in a “DNA Test” event. My experimental gramplet can then be used to compare markers across related people in a family tree. What we didn’t discuss at the time, was how to import data from files provided by the test companies.

1 Like

Yes, that seems to be a good starting point.

Do you have an example data file? What do you want the comparison chart to look like?

The previous posts have discussed storing some of the raw data, and storing information about matches. Another thing that I briefly mentioned earlier was this:

If you’re not familiar with Genome Mate Pro and its Segment Match feature, you can catch a glimpse of it at the end of this video (skip ahead to 18:24). That’s not necessarily the kind of display that I’d like to see in Gramps, but I would like to have that information displayable somehow, perhaps layered onto some kind of a pedigree or fan chart.

I wouldn’t necessarily want to use Gramps for doing all of the analysis that can be done in Genome Mate Pro, but would at least like to store the resulting data needed to create such a display, which would be details about which segments came from which ancestors.

A chart like that would be fairly easy to write. Is that what you want?

What is the format of the raw data? Do you have an example?

If I know what data we have to work with, and what we want to achieve, then I can choose the appropriate data structure.

And just to clarify – although the Segment Map is essentially a “roll up” of data about many matches, it would not be necessary to store all of the match-level data in Gramps in order to create it. Rather, one could store just the rolled-up results, that is to say, which segments came from which ancestors.

Having said that, there certainly are uses for storing the match-level data, as described in previous posts within this thread.

1 Like

Hi Nick, I will put together some sample data for you. It will be “rolled up” data, not raw data. FYI, users have only their own raw DNA data, and no raw data from any of their matches (except for close relatives whose DNA testing “kits” they manage).

1 Like

Hi Nick,

I have a spreadsheet I’d like to attach, but I don’t see how I can do that. Does this site not support attachments, or is it a privilege I need to earn? I see I can upload images, but it doesn’t accept a spreadsheet. Anyway, here’s a description of the spreadsheet, in the hope that I’ll be able to attach it eventually, otherwise I could email it directly to you.

Each line represents a segment of my DNA and its path of descent to me from the ancestral couple that I share with a given DNA match. The top half of the sheet is for my maternal chromatids, and the bottom half for my paternal chromatids.

To walk through the first line (row 3) as an example: the matching person is a third cousin on my mom’s side, so the ancestral couple that we share are our great-great-grandparents. So, I know that this segment came from my great-grandfather (“Jacob” in column L), and my cousin got it from her great-grandmother (who was Jacob’s sister), but we don’t which of their parents it came from, thus the question mark in column M. Working back from right to left, Jacob passed it to his son Harvey, who passed it to his daughter (my mom).

Further down you’ll see some placeholder lines where I don’t yet have any details. I’ve set up a few specific lines for chromosome 23(X), rows 43-49, to remind me that it’s only inheritable from certain ancestors (it cannot be passed from a father to a son). And the 23(Y) line, row 92 is simply my father’s father’s father’s etc. line.

Columns D and E (cM and SNPs) are attributes of each segment that are useful to store, but are not necessary for displaying a segment map.

Also, the segments noted here are really just the portions that I have in common with each of these matching people. Returning to the example in row 3, the portion of chromosome 1 from position 91.4 to 109.1 (these numbers are actually in the millions, by the way) is just the part where my third cousin and I overlap, but each of us may have more, on one end or the other, which we inherited from the same ancestor; we can’t know until we find more matching cousins. It seems unlikely that I’ll ever come anywhere near to completing the map, though it will continue to fill as more people have their DNA analyzed.

1 Like

I am more simple minded in this regard, since I am also using GenomeMatePro. So far, what I have done is to fill my tree out to them, once I have access to their tree or line from them to our common ancestor. Next I simply add a tag labeled “DNA” to that person and finally I add a note. I use the relationship calculator tool and copy and paste that into the note. I also add the Chromosome info to the note. There is probably more I could do with what we have available in gramps but that’s what I’ve done so far. I could then create a report using the tag as a filter. A report could be designed to use a note formatted in a specific way, 1 line per segment, as already suggested. I think it would be helpful to be able to put a header (or simply in bold so the report reads it differently than the segment lines) with the relationship and common ancestors. Just my half penny’s worth.

1 Like