Updated DNA gramplet ready for review

DNA gramplet renamed to DNA Segment Map gramplet. Addresses all user feedback on this thread. Added Legend with hotlinks, error checking, and GEDmatch format input, supports intl charset,

Be sure to uninstall DNA gramplet and install DNASegmentMap gramplet.

2 Likes

Nice work! Also the loading of the segments seems to work faster now.

1 Like

Cross Posting to keep this thread alive.

Just wondering a couple of things, since I am arriving late at this conversation;

  1. Do we know what its status of the gramplet is, right now?
  2. Is/was there any discussion about including Y-STR extensions to this or thoughts about a different companion Y-STR gramplet.
  3. Extension of the data model to include terminal SNP, where known.
  4. Export/import formats considerations for backup and community processing,
  5. Imputed overlaps (ie back-calculated matches along the connection pathway between two individuals.

Apologies if this was already discussed.

#1) Gary submitted it for peer review. No word of when it will be available via the add-on management tools. For now, you still have to download it from the GitHub pull request (linked from the See Also section of the wikipage)

#4) This Gramplet leverages match data stored in Note objects, not externally as Media objects. So backup is transparent. However, export is problematic as Gramps export is focused on People objects (and those objects directly attached to people). Getting other data out in anything other than CSVs from Views (which tend to truncate data) is convoluted.

#5) This is strictly a visualization tool. It does not do validation nor analysis.

The Segment Map is a view of pairs of relationships (Active person and as many associates as defined). I am not sure how adding the Y-STR (especially if Y111) for the active person and/or each associate would be readable. I dont imagine it as a viz tool - maybe a report.

In a different thread, the case was made that the Y-STR could be described in a Media file or as a set of attributes for a person. There was an experimental gramplet that viewed the Y-STR if they were attributes on a DNA Test Event. That experimental gramplet could be modified if there was interest. Or if stored in a Media (csv) file, it could be previewed.

I think the real question is - what are the requirements of such a report and what/how data would be stored. Should it include Y-DNA haplogroup, mt-DNA haplogroup, termimal SNP, ā€¦ I would guess these would be stored as Person attributes. Is there any extrapolation for relatives (assume brothers have identical Y-STR, for instance)?

While in a support discussion with Michal Maňas, he provided: A list of centromeres; and, segments that are untested. (Which makes gaps in compare matches less disconcerting.)

He wonders if a switch to overlay these on a graph is viable?

See the Discussion webpage for the Addonā€™s wiki page.

I added to the Discussion page the following:

I am not sure I follow. The map is all of the DNA Associated people. I donā€™t understand what is the undocumented feature. I will certainly add it once I understand what it is. Can you provide a sample with expected vs actual results?

I donā€™t understand the untested regions comment. I donā€™t think any of the sources of the DNA data (GEDmatch, FTdna, ā€¦) have regions which are untested. So where exactly did these untested regions come from? And do they apply to all of the associated DNA files (if you have 5 associated people, did all 5 have these same untested regions)? If so, then the simplest way would be to create a dummy person (name of UNTESTED, for instance) with the untested DNA info. That will paint that data. Or am I misunderstanding?

The Maternal/Paternal is determined by the tree itself. I realize this is slightly different than DNApainter uses, because that tool does not also have an associate tree. Initially I coded this with a maternal/paternal flag required for each line (like DNApainter). The downside was each line of each file now had to be editted to add this flag assuming you are cut/paste from GEDmatch or similar. But the relationship of maternal/paternal was per-person, not per-segment. So lots of unnecessary editing . And consider that Gramps already knows the relationship, so why add it again.

In the case where you do not know which side, I can see why you might want to add a M/P flag to the person (not each line). What would be expected if the tree says the connection was on the paternal side and the DNA data had the flag for maternal? I removed the flag for these reasons.

My awareness of DNA testing practices is happily limited. (If I knew more, Iā€™d be spending too much time collecting & correlating that kind of data.) But he seemed to be saying that some chromosome pairs were ignored universally. Dunno about that.

And the Maternal/Paternal factor is from the way most matching guestimates are presented.

Family Tree DNAā€™s matching breaks down into categories of Maternal or Paternal lineage & several distance categories. (e.g., Maternal 3rd-5th cousin) Naturally, those estimates can be skewed by factors of consanguinuity.

So I think Michal wants to use that kind of categorization data as a filter overlay for the traditional Tree data.

Of the data he provided, what is really intriguing is the the list of centromere loci for each chromosome pair. It would be helpful to see a filled circle at that lociā€¦ or maybe a break in the bars being graphed.

I think we are mixing up Chromosome Painting with Chromosome Browsing.

TLDR: Chromosome Painting uses the active person and a vendor-specific reference ethnicity map, whereas Chromosome Browsing uses shared segments between the active person and associated person(s). This gramplet implements the latter.

Chromosome Painting
The new AncestryDNA Chromosome Painter has ā€˜untestedā€™ segments. As does the FTDNA Chromosome Painter. These tools use the members DNA and compares the Ethnicity mappings that they have collected and paints the segment map appropriately. The ā€˜secret sauceā€™ is this ethnicity reference table, which each vendor builds based on the samples they have processed. There are ā€˜blankā€™ or untested areas in it. All you get is the ethnic region on the segment map (not a persons name). This is useful for understanding where a particular segment geographically came from, in hopes to indirectly guess on which line it comes from. The maternal/paternal are both painted based on their individual lookup of the Ethnicity map. There is no ambiguity on maternal vs paternal sides, since each side is painted based on the Ethnicity map independently.

Let me exaggerate a point to simplify. If John Doe has a mother who is 100% German and a father who is 100% Irish, the map would have all of the maternal segments one color for Germany and the paternal segments all a different color for Ireland. And there would be grey segments where untested.

The ā€˜secret sauceā€™ is proprietary to the vendor, so this really cannot be implemented in Gramps.

Chromosome Browsing
This is not the same as Chromosome Browsing, which is what DNApainter and this gramplet do. The idea for this tool it to compare the active person with shared segments (created by GEDmatch or FTdna or ā€¦) of associated people and paint the segments that match, identifying which person maps with the active person. There are no ā€˜untestedā€™ in this analysis. If the associated person is maternally related (based on the tree), then the maternal chromosome is painted. Conversely if they are paternally related, the paternal chromosome is painted. If the person is not genetically (known) connected, then both are painted (but with transparency).

Example: If John Doe is related to Mary Smith thru his mother, then the shared segments they have would be painted on the segment map on the maternal side. If they are related thru Marys father, Maryā€™s map would have the paternal segment painted. The more people associated, the more different colors on the map.

If you hover over the shared segment on the map, it will tell you the shared length (in cM). This provides a hint on the genetic distance.

The Shared cM Project has a lookup for relationship to cM range. The ranges overlap significantly for lower cM values. Given a cM value, there is a large range of relationships possible. For instance. for 75 cM , the table from DNApainter (by Jonny Per) is below. This is more info than is practical to display, even if it were public domain.

1 Like

Technically, only the location of each targeted SNP is tested, so the vast majority of base pairs are not tested. The question is whether there are any larger-than-average gaps between SNPs tested by a particular company. or more specifically in this case, whether any of the testing companies looks at SNPs within the centromere, and if not, whether they consider two segments adjacent to each end of the centromere to be a single segment (including the length of the centromere).

This ISOGG wiki page contains some information about the locations of the centromeres and links to some Family Tree DNA documentation.

The Family Tree DNA chromosome browser also highlights ā€œSNP poor regions not tested for Family Finderā€, most notably the short ends of chromosomes 13, 14, 15, 21, and 22 but also a few other spots.

You can also look at your own raw data file(s) to see the the locations of all of the SNPs that were tested.

Hmm, I was not aware of this. I do see these grey areas in the FTDNA chromosome browser. Is there a list of the ā€˜untestedā€™ ranges for each of the vendors that provide DNA segment match info? I know I used both FTdna and GEDmatch match info. I am not sure if there are other providers for the shared segment details (Ancestry does not provide this info). If I assume that they are similar to the list provided initially (see below), then my personal DNA segment map is

Provided DNA untested ranges by the original question:

13,1,19020094,0,0
14,1,19067948,0,0
15,1,20004965,0,0
21,1,9922017,0,0
22,1,16055121,0,0

But I have overlaps with these ā€˜untestedā€™ regions. So I am suspicious of these ranges.

This is what the Family Tree DNA help page says:

ā€œOn the Chromosome View, the striped, gray segments signify that the region is ā€œSNP Poor.ā€ That is, the microarray chip did not include enough data points in this area to make a scientifically sound judgment about whether or not the matchā€™s segment is Identical By Descent (IBD). When segments that cross these areas are evaluated for IBD status, they are treated as two separate segments, one on each side of the SNP Poor area.ā€

It doesnā€™t give the specific locations, but again, you could infer that by looking at the locations in your raw data (for example, find the lowest reported positions in each of those chromosomes).

I donā€™t know specifics about other vendors, but generally each one has its own set of SNPs that it tests, and there is different overlap among them. (GEDmatch, by the way, just does matching, not testing.)

One reason you might see overlap is if some of your start and end position values are ā€œBuild 36ā€ and others are ā€œBuild 37ā€. The ISOGG wiki says:

ā€œ23andMe and Ancestry DNA use Build 37. Family Tree DNA use Build 37 for matching but Build 36 for segment boundaries in the Chromosome Browser. Raw data files are provided in both formats. Build 37 filled in quite a few gaps, and the number of base pairs in each of the chromosomes was longer in Build 37 as compared to Build 36. Consequently the cM totals per chromosome are lower for Family Finder than they are for 23andMe. GedMatch Classic used Build 36, and converted AncestryDNA and 23andMe data from Build 37 to Build 36 for backward compatibility. The new GEDmatch (formerly known as GedMatch Genesis) uses Build 37 but the one-to-one tool offers an option to display segment boundaries in Build 36, Build 37, or Build 38.ā€

1 Like

A few suggestions:

  • In the tooltip, include the start and end positions of the segment. Yes, it would make the tooltip larger, but that would be better than having to go all the way back to the note to find the information.

  • In the note attached to the association, allow an additional optional column where I might put, for example, the names of the suspected common ancestral couple. Show this in the tooltip as well.

  • For the X chromosome, if the selected person is male, donā€™t show the paternal half, or else grey it out somehow.

Alternatively, make the paternal half the correct length (i.e., the length of the Y chromosome) and label it as Y, while labeling the bottom half as X.

I created a bug report for these suggestions: 0012712: DNA Segment Map suggestions - Gramps - Bugtracker ā€“ Free Genealogy Software

Given someone wanted to add a M/P flag for each line and someone else wanted a comment:

Proposal: the input format changes from the required fields: Chromosome Number, Start, Stop, cM, SNPs to add an optional 6th field. If the new field is ā€œMā€ or ā€œPā€, it will override the tree-determined side of the chromosome. Otherwise the field is considered a comment for the purposes of the tooltip.

Should the override only occur if there is no genetic path between the people (and the tool would otherwise paint both M and P sides? Or should it override the calculated relationship?

I can think of a case where it might be good to override the calculated relationship. Although my parents were unrelated (in genealogical time, at least), one of my dadā€™s distant cousins married one of my momā€™s distant cousins. If any of the children of that marriage happened to showed up among my DNA matches, I would need to triangulate with other matching cousins to figure out whether they share my paternal or maternal DNA (or both, if we have multiple matching segments), and possibly override whichever relationship path the gramplet chose.

The GEDmatch tool now adds 3 additional fields for the chromosome match. If we add a M/P/U (Maternal/Paternal/Unknown) flag or a comment, then anyone using cut/paste from GEDmatch will now have to delete the trailing 3 fields in each line before adding the new field. Or the first of these fields will be interpreted as a comment.

Adding U as an option for the case where there is a genetic path on both sides. The tool currently picks the closest relationship, which may not be correct.

To be complete, there should also be an option B for both. GEDmatch distinguishes full matches from half matches. Full matches are both maternal and paternal and commonly occur between siblings.

But siblings have half-matches as well; for example, while they may share the same maternal grandparent DNA in one segment, they may have different paternal grandparent data in that same segment. Half matches between siblings may therefore be Unknown, since it may or may not be yet known whether it is a maternal or paternal match.

When I compare with my brother, for example, I have been able to determine whether some of the half matches are maternal or paternal (based on triangulation with cousins), but some are still unknown at this point.

Sorry, I was not clear. For U, both the maternal and paternal portions are painted. I think this is what you mean by B. I used U to specify it was not determined which side.

I am updating the algo for detecting which side. If you compare your sibling, it will be painted on both sides now (currently paints maternal side).