DNA data storage Strawman

With the advent of numerous DNA services, here is a strawman on how DNA data might be stored within gramps. I have prototyped a few of these for viability. Hopefully this is a start of a conversation to have a consistent data entry for DNA data.

Person relationship to DNA company: A person can be working with one or more DNA companies (Ancestry, FamilyTreeDNA, 23andme, GEDmatch, MyHeritage, …). This strawman suggests storing that relationship as a Person attribute. The person will likely have at most single value for any attribute, and no attribute if no data. It is not really time-dependent. For instance

  • Attribute: Ancestry ID with Value: Ancestry UserName
  • Attribute: GEDmatch kit with Value: GEDmatch kit number
  • Attribute: FTDNA kit with Value: FTDNA kit number

Y-DNA data: As aligned with an experimental addon from 2017, the Y-DNA data would be a Person Event. The event would support the various levels (12, 25, 37, 67, 111, BigY) of markers. The Event type would be ‘DNA Test’ and would have attributes for each marker (DYS393: value, DYS390: value, …) Below is a sample

SampleYDNA

DNA Segment Map: As aligned with the DNA Segment Map gramplet, this would be an Association of type ‘DNA’ between the two people. A Note would contain the segment map information (chromosome, start, end, cM, SNP). Each of the DNA companies uses a different criteria for defining a match, some even allow the user to adjust this criteria. Does that require the company name tied to the Association type (DNA-FTDNA, DNA-GEDmatch, …) for instance? Or the Note to be attached to a Association Citation which identifies the company?

Shared cM length: This is the overall length of the shared DNA. This would be an Association of type ‘cM’ between the two people. A Note would contain the shared length. Each DNA company uses a different criteria. A report (sample below) could list all of the shared lengths for a person along with the calculated relationship and common ancestor. This could also add some probabilistic relationship from the Shared cM Project.

Thoughts?

2 Likes

Regarding Y-DNA: I have only done the Y-37 test so far, which gives the number of short tandem repeats (STRs) at various points, as shown in your example. I understand that one can also purchase tests for particular bundles of SNPs on the Y chromosome. These are used to determine which branch of the haplotree a person is on. It would be good to store that (meaning, the person’s predicted haplogroup) as an attribute. I don’t know if those test results also include the details of each SNP in the bundle (some bundles have hundreds or even thousands of them). Unless there’s an easy way to use the attribute data (for example, comparing people to see which markers differ), I don’t know if it would be worth the trouble of storing the markers there (vs. just in a PDF of the test results). For the same reason, I don’t use “forms” feature in Gramps to store census details in person event ref attributes.

I have done the full mtDNA test. It, too, provides a predicted haplogroup. The results also include the details of how my mtDNA differs from two different reference sequences (RSRS and rCRS), which could be stored in attributes (a few dozen values), but again I don’t think it would be worth the trouble.

A word of caution about the start and end locations in the notes for the segment map: the locations that one sees in their match results are based on particular “build”. l downloaded my raw data from FamilyTreeDNA in both “build 36” and “build 37” formats, and uploaded the build 37 data to GEDmatch. I think only the build 37 download is available on FTDNA now, but I notice that my match results there still show build 36 locations, even for very recent matches. My point is, one should not compare the start/end locations without knowing whether they correspond to the same build.

As you say, the number of cM can be calculated differently by different companies. One piece of the calculation is a dataset that shows the recombination rate between each SNP on each chromosome as observed in different populations. Just like life expectancy tables, these data are descriptive, not prescriptive. Another piece of the calculation is the choice of a mapping function (e.g. Haldane or Kosambi) to interpret that data. So there is no “right” or “true” cM; whatever number you have is just a clue as to how distant the relationship might be.

I think raw data e.g. SNPs, STRs should be attached as media objects instead of adding all that information into Gramps.

  1. Person relationship to DNA company: I use person events for that (event type: DNA test, date: date of the test, description: kit number or username, place: name of the company)
  2. Y-DNA data: Also a DNA test event and my YFull kit number in event description. The 780(!) STR markers are added as media file. My latest Y-subclade/haplogroup and my mt-haplogroup are added as person attributes.
  3. Shared cM length: I think the current system is okay, but I think it has some weaknesses: (1) no information about the testing company, (2) no information about the genome build version and (3) no way to indicate triangulation.

Initially I also used a Person event for the DNA company. But when I wanted to leverage the Ancestry ID (as in the cM report below) I could not easily do that in the Report (probably due to my lack of programming Reports experience). So I changed to a Person attribute. The question is - why would you use the DNA company.

  1. I think having the date of the test is an important information, because there are changes e.g. other testing chips or new matching algorithms. You might want to know that when comparing DNA tests with a long period of time in between.
  2. It’s a minor point, but all testing companies are using a different set of SNPs which overlap only partly, so when comparing matches from different companies it’s useful to know from which one e.g. a match on GedMatch between an Ancestry kit and a MyHeritage kit might have different results than a match between two Ancestry kits (both times the same persons)

I was thinking they should be part of a Citation. The source would ID the testing company/test methodology. The citation the date of the Note or Gallery object would say how fresh the match data. And the Media object or body of the Note would have the segment data.

And sharing the Citation to another Association pulls all the data along in one fell swoop

2 Likes

Yes that would also work well and source/citation attributes could be used for the data.

isn’t DNA a property or attribute of a human?

The test is an Event, but the result or more accurate, the digitalized data of the DNA would be an attribute of the person that the DNA sample originate from.

So shouldn’t the data itself be stored as attributes for a person, just like the color of her/his hear, the height or weight, the shoe size or how many teeth they have at a given time?

1 Like

Some people have multiple kits on GEDmatch, if they uploaded data from multiple testing companies. So they could have multiple GEDmatch kit number attributes. That in itself is not a problem, but the association or note would need to clarify which one of the person’s GEDmatch kits was used in those particular matches.

Also, GEDmatch Tier 1 allows users to combine multiple kits into a “super kit”, but that is just one more kit number a person might have.

Preliminary documentation for the DNA Segment Map gramplet
It has been submitted for peer evaluation before being made available through the Updated Add-ons interface.

Look in the See Also section for the link to the GitHub link pull request. You can manually install the preliminary version with the understanding that the Peer Review might require changes to the format.

Yes, the DNA test is an Event. And there is at least one experimental DNA gramplet that stores the Y-DNA STR results as attributes of the DNA Test Event. Y37 test would have 37 attributes, Y67 would have 67 attributes. It is a reasonable question to ask if these should be attributes of the person or attributes of the test. Since a person can start with a Y37 test and then get the Y67 test, I lean towards them as attributes of an Event rather than attributes of the Person. But that is just my opinion.

But there is also the shared match info, which is not an event or even a person attribute, but tied to an Association of the 2 people who have the shared segments. As mentioned, different testing companies would generate different shared match list (based on minimum threshold, for instance) - and company/threshold could be stored in a Citation. The shared segment length reported by Ancestry is different than that reported by FTDNA, for instance. And some of these companies allow user-selection on threshold criteria.

1 Like

The shared match is a research document…

And I ment that dna result should be stored as person attributes, not event attributes… even though the test is an event, the DNA and the digitalized result is a property of the person who took the test… not a property of the test event, its just a result of the test event…

I stored my Y-37 results (csv file) as a single media object, rather than taking the trouble to create 37 attributes, and then attached the media object to my person object and an event object. I probably don’t really need it in both places, just trying alternatives.

My mtDNA results are not in csv format, but I could make them so, and I could also attach the FASTA file.

If I look at the experimental DNA gramplet, its purpose is to determine if someone who would share a DNA test result is present and to then view those Y-DNA results to the others, implying that all those people have identical Y-DNA results. (assume the same for the mtDNA for maternal relations).

So if Joe has a Y37 result, then Joe’s brothers (as well as Joe’s paternal 1st cousin, …) would see the same Y37 results in the gramplet view. These other people have not had their own Y37 test, so I dont think they should have Person attributes. The Event is NOT shared in gramps. The view is calculated on-the-fly for the gramplet.

Maybe another way to look at this is: if Joe has a Y37 test and Joes brother had a Y67 test, then they would each see both test results in this experimental gramplet view (assuming they used the DNA Test event with attributes). But each would only have their own DNA Test event.

Maybe this experimental gramplet is not relevant and I shouldn’t be deriving a workflow based on it.

Question: what would you want to see in this case for Y-related people (if anything)?

@GaryGriffin - I think you misunderstand me…

I’m not talking about any match analyzing result, only the data from each single kit.

But to answer your other question as I understand it:

So in the example you give, I would think this:

  1. Joe took the Y37 test 20 June 2007, his kit number at FTDNA is “Kit-123456”

  2. You register this as a personal Event in a “Register DNA” - Editor Windows in the Gramplet.

  3. The Gramplet creates one Personal Event with the kit number, the company, the type of test, and any other relevant metadata of the test itself. The Personal Event get the name “Y37 - FTDNA - Kit-123456”. The Gramplet add the RAW data (CSV file or other type with the data to the Event, but the sequence data (the data pair) that you write in (or if it get imported as tabular data), get registered as attribute pair on Joe (it should have been triplets, but Gramps Attributes doesn’t support that at the moment), because the data itself is “properties” of Joe.

  4. You do the same for Joe’s brother’s Y67, “kit-234567”

  5. Then you run an analyze of the kits, the Gramplet creates a shared Event with the date of the day you run the analyzing, names the Event something like “Y-DNA Match Event Y67” with the date of the day it was run, the Kits (kit number or ID) selected is added as Shared Event Attributes,
    You add the two kits to the Analyzing Match job, and you will of course get direct matches on those two kits (persons), so they get added to the Shared Event and given a Role Type of “Direct Kit Match Y37” since the match only can be on the “lowest” kit.
    Every person that’s in the database that also “should” have this Y-DNA, but do not have a test (kit), get added to the Shared Event with a Role Type “Calculated Match”, and of course the date of the run and all the “metadata” needed.

  6. Then you get a kit from Peder Johnsen, its a Y-111 (or something), you don’t have any connections between him and Joe, but you start your Gramplet, add his data, the data get saved as explained in 1-3, Then you want to run a new match analyzing, so you select the other kits you have in your database to this new analyzing event.

  7. This new Event get the name “Y-DNA Match Event Y111”, the date of the run, the kits selected are added as Event Attributes to this new Analyzing Event.
    You find that there are actually some connections between the three people, so the three will be added to the Shared Event: Joe with a Role type of “Direct Match Y37”, his brother with a Role type “Direct Match Y67”, and Peder will also get a Role type of “Direct Match Y67”.
    All the people with calculated matches will be added with the Role Type “Calculated Match”.
    All roles are added with the date of the Match Analyze “run day”.

The Gramplet it self can store all its data in the different Shared Events as Attributes, or in it’s own json serialized object in the database, whatever is the best way of doing it.
If it is stored in the Event as Attributes, it can just store all metadata as data pair with a given key name.
The key name can be based on a given ID for each of the “jobs”, so that they are unique.
But since each “Run” get it’s own Event, and the data is stored in the even, each “metadata/gramplet app data” field key name can be the same in any of the Events.

The Gramplet then show a list of all the Kits registered (the Test Events) and all the Analyzing Match Events, with the most significant metadata displayed, i.e. name, data, type.
Two different or multiple tabs, maybe?
The user can of course click on any entry in any list, and view the event and it’s data.
the list can be to the left, and if you click an entry you get the metadata information on the right of the list, or something…

If an user open the Event in the Event Viewer, all the data is manually accessible.

Any visual or textual result should also be added to the Events as media files or as Notes if the Note Gramplet can hold the type of data generated.
But say that the Gramplet can generate a visual SVG Image and/or a tabular report in CSV, PDF, LaTeX, Markdown (or html) of the result, all those reports generated should be added to the event as media, but only once for each event, until any of the Events are altered, say because a correction to the Kit data is done, the reports will stay the same, until any changes.
If a Kit data is altered, this should be marked with a warning somehow.

If the user want to delete some of the “runs” or some of the reports, he can just delete the Analyzing Event, the media files, or whatever the user want to manually delete.

PS. as I started with, it is not how you run the analyzing, how you store the data from the analyzing or how the gramplet works I talked about, only where the DNA Test Digitalized Data was stored (the Data Pair).
Each data pair key name (Attribute name) could be prefixed with the kit number, (used as an unique Gramps ID (not the handler)).
But actually, what the easiest way to name the attributes would be, that I can’t say…

I can NULL and Nothing about DNA, so all my names for events and so is only randomly taken out of thin air


All I was saying is that the digitalized data from the DNA test is personal attributes, just like hair color, height, weight, shoe size, number of toes or any other unique attribute of a person you add…

I’m not a developer, so I don’t know what’s possible or not within the framework for gramplets, I really have no opinion on how you solve the calculations, the GUI, the look and feel, or how you store the metadata and result from the analyzing of matches… Only thing I know is that it’s a lot of data…


Have you looked at the python library lineage, it do a lot of different analyzing and match and merge of datasets… maybe you could use that to use run the different processes, and create a stunning gramplet to display the result? I don’t know what’s possible or not, I only know that there are already at least one python library that already deal with genetic DNA, so maybe you don’t need to implement all the algorithms yourself?


PS 2nd: This is no critic of yours or anybody else’s gramplets, it’s just my logical thoughts about how I think things should be done and my wishes for how something works, not saying anything about what’s wrong or what’s right.


EDIT:
This is also one of those events that will benefit from Main - Sub Events, You register any DNA test KIT as an Event, that will be the main event, then all the different types of tests you can do on the same kit number can be sub-events, so if you first started with a Y37 test it will be a sub-event of the “Kit number 123456” Event, and if you later order additional test on the same Kit, that do not generate a new Kit number, you add those tests as sub-events as well, with all the metadata for each test type that you need…

And the same can be for all Shared Analyzing & Calculation Events, any alternation can generate a new Sub-Event, i.e. if you add a new kit to the job, you run a new job because you have added a lot of new people with DNA data registered to their profile or similar scenarios…


I need to stress that this is just thoughts and ideas, it’s in no form any criticism about any Gramplet or other ways of doing it.