DNA Segment Map Gramplet & thousands separator

DavidMStraub · September 4, 2023, 4:16pm

Hi,

I recently got my first DNA result and just started using the DNA Segment Map Gramplet - thanks @GaryGriffin for that!

One thing I noticed is that copy & pasting from Gedmacht requires me to remove the thousands separator dots, otherwise the overlaps were not displayed.

This is what it looks like on Gedmatch:

emyoulation · September 4, 2023, 5:11pm

Is there a conversion tool in Gramps to handle the difference in thousands and decimal separators in different languages?

decimal radix	\| \|	thousands	\| \|	language
`.` period	\|	`,` comma	\|	english
`,` comma	\|	`.` period	\|	deutsch

GaryGriffin · September 4, 2023, 5:23pm

Ahh - using non-english separators. The code doesnt work with this. Let me work on fixing.

Currently the code is:

        if re.search('\t',line) != None:
            line2 = re.sub(',','',line)
            line = re.sub('\t',',',line2)

If you need a quick fix, you can change the middle line of code to change period to null and comma to period until I get a fix.

         if re.search('\t',line) != None:
             line2 = re.sub('.','',line)
             line3 = re.sub(',','.',line2)
             line = re.sub('\t',',',line3)

I will need to check locale to make this work properly.

GaryGriffin · September 4, 2023, 7:36pm

Created Mantis issue 0013012: DNA Segment Map fails if locale uses non comma as thousands separator - Gramps - Bugtracker – Free Genealogy Software

GaryGriffin · September 4, 2023, 8:07pm

I have not found a method to do this yet.

locale.localeconv()[‘decimal_point’]
locale.localeconv()[‘thousands_sep’]
-or-
locale.RADIXCHAR
locale.THOUSEP

does it, but gramps uses a gramps-ified locale version that does not seem to include this function in glocale.

emyoulation · September 4, 2023, 8:20pm

@Nick-Hall where should we file an enhancement feature request to support number format conversion in the Gramps variant of glocale? Or maybe I should ask how to write it to point out where the modification needs to occur?

locale.localeconv
locale.localeconv
RADIXCHAR
locale.THOUSEP

DavidMStraub · September 4, 2023, 8:36pm

Hi,

I suggest not to bother at all with Gramps’ locale management. The reason Gedmatch displays the table like this is likely due to my browser’s locale, but we don’t know whether the browser uses the same locale as Gramps; in fact we don’t even know the data was copied from a browser on the same system that Gramps is running on.

I suggest to take a more pragmatic approach. We know which columns are floats and which columns are integers, and we also know that the floats are always less than 1000. So we can just drop all non-[0-9] in the integer columns and convert all , (there will be at most 1) to . in the float columns.

GaryGriffin · September 4, 2023, 11:58pm

Agreed - I need to determine thousand separator and radix independent of locale.

So try the following code:

        if re.search('\t',line) != None:
            field = line.split('\t')
            line_wo_thousand = line
            if re.search(',',field[1]) != None:
                line_wo_thousand = re.sub(',','',line)
            elif re.search('.',field[1]) != None:
                line_wo_thousand = re.sub('\.','',line)
            if line_wo_thousand != line:
                line = line_wo_thousand
            if re.search(',',field[3]) != None:
                line2 = re.sub(',','.',line)
            else:
                line2 = line
            line = re.sub('\t',',',line2)
        field = line.split(',')

The top and bottom lines exist in the original code. The logic is completely replaced. Check the Start Pos for the thousands separator and the SNPs for the radix. Once line is normalized to no thousands separator and period for radix, replace the tab charactor with comma and then process line.

StoltHD · September 5, 2023, 1:32am

The decimal limiter should be set for the language the user’s system is configured with.

As long as it is not defined in any gen-standard that it should be a “.”.

DavidMStraub · September 5, 2023, 8:13am

No, the point is that the Gramplet needs to interpret copy/pasted content from a web site, and we don’t know which locale the web site displays the numbers in. For instance, if I’m opening the English web site from my browser with a non-English locale, it depends on the web site how it chooses to show me the numbers. So it’s better to parse it in a locale independent way.

@GaryGriffin: I think your code looks good! In re.search, doesn’t the dot also need to be escaped?

Here is a slightly more concise version which I think has the same effect:

if "\t" in line:
    field = line.split("\t")
    if "," in field[1]:
        line = line.replace(",", "")
    elif "." in field[1]:
        line = line.replace(".", "")
    line = line.replace(",", ".")
    line = line.replace("\t", ",")
field = line.split(",")

StoltHD · September 5, 2023, 10:32am

So, because your opinion I must see and store my results in a language I don’t use?

Isn’t that a very ignorant and arrogant approach?

Or are you going to make a converter that “re-convert” the delimiter to the correct type after the copy/paste/download has been done?

DavidMStraub · September 5, 2023, 10:46am

No, you just didn’t understand my point, arrogant is getting impolite like your last post.

StoltHD · September 5, 2023, 10:53am

then explain to me how you plan to show me the decimal delimiter as commas after you have “imported” them to Gramps, because I need all my result usable in MY language, not in English, and I bet most of Europeans need it to, when they are to use them in books and articles (or maybe export them again for use in other software).

DavidMStraub · September 5, 2023, 10:56am

Sure, I’m happy to explain.

As described in the wiki:

Create a Note in the Association or attached to a Citation in the Association with the shared DNA segment data.

So you copy and paste the data from the external website (e.g. Gedmacht), in whatever format that website provides to you, into your note. The gramplet is not involved in this step at all.

Then, the Gramplet needs to process that note, and thus needs to handle different number formats.

That has nothing to do whatsoever with my opnion, ignorance, arrogance, or native language.

StoltHD · September 5, 2023, 11:18am

so why not use a separator that is commonly used, most system use CSV if they provide export of DNA results… by using commas, you will also drop one step for anything else that the gramplet might be doing int the future… e.g., reading a CSV file from any of the test providers directly.

The only two formats I have ever seen been used is CSV and TSV in any software that import/export DNA result, I don’t think I have ever used a software that actually use DSV as an export format… but maybe it should have been used, since it supports any delimiter you like, including pipe, middle point etc.

Why using delimiters not commonly used in other software or “standards”…
(I am not talking about locales, but digitally used “standards”).

DavidMStraub · September 5, 2023, 11:32am

Sure, but that’s feedback to Gedmatch, not to Gramps.

StoltHD · September 5, 2023, 12:22pm

No, it’s feedback to Gramps or more accurately, for this function/gramplet specially, I don’t give feedbacks on tools I am not using or want to use…

I did want to use this gramplet, but…

I really do not understand why it is so difficult to use “standards” or logics, interoperability and interchangeability standards that’s already well established… or actually use tools and libraries that already do a great job for a given task, it just amazes me how reluctant, in general, some developers are to utilize the work of other open source and open data libraries and standards… it’s like it is very important to try to invent the wheel all over again, and again, and again… it just amazes me…

But maybe that’s why I am moving more and more of my research over to software solutions that actually use Open Data and Open Standards and as much as possible store data in plain text files, so that the data can be reused in different software without the need of storing it multiple times in multiple “formats”, and that utilize commonly used non-lossy interchangeable or interoperability formats…
Most likely it is just because I am lazy or that my head no longer manage to process programming languages and that I just want to do research rather than start learning Python, C#, C++, Pearl, R, Julia etc. etc.

I will not interphere with my opinions anymore, sorry that I spoked up.

emyoulation · September 5, 2023, 12:41pm

Hmmm. Making the input parsing ‘locale’ agnostic (instead of adaptive to the OS locale) raises a curious issue. If you’re predicting which parsing rule to use by column order, what happens with a RTL (RightToLeft language) source for a DNA segment data table?

Maybe @avma or @yaron could find a hebrew example?

yaron · September 5, 2023, 1:10pm

Sure

How about adding an “Auto Detection” combo box to the paste field with all the different locale options in case the user is not happy with the results?

When talking about number formatting Hebrew is not far different than British English, our differences are mostly in date/time display in longer form but that’s not the case I assume.

GaryGriffin · September 5, 2023, 11:54pm

Let me take a step back and explain the history of this gramplet. Initially it was written to accept CSV input. After realizing that most of the sites that provide the DNA shared segment info were generating a TSV (which some used a thousands separator), I added support for TSV - assuming an optional thousands separator of ‘,’ and a radix (decimal) character of ‘.’. This was to ease the data entry to a cut-paste operation.

Sites like GEDmatch provided data in TSV with a thousands separator. When it was pointed out that the German version of GEDmatch used a different thousands separator (and radix), I updated the gramplet to work with either a ‘,’ or a ‘.’ as thousands separator.

Note that the user cuts from a program like GEDmatch and pastes to an Association Note. Where I made the change was how I processed the Note to extract the data. The Note is in whatever language the user wishes.

So with this change, the following lines in the Association Note are interpreted exactly the same. Any of the 4 formats can be used interchangeably.

1,54751900,83468985,31.8,1451 (CSV)
1	54751900	83468985	31.8	1451 (TSV with no thousands )
1	54,751,900	83,468,985	31.8	1,451 (TSV with thousands ',')
1	54.751.900	83.468.985	31,8	1.451 (TSV with thousands '.')

Hopefully that explains the recent change.

Topic		Replies	Views
Always display Location coordinates in decimal format? Help hacks	33	1351	June 23, 2023
Making date input more flexible Ideas hacks , dates , supertool-script	19	1478	August 23, 2024
US and DE date formats mixed up in Ancestry GEDCOM import Help	36	900	June 5, 2023
Solved Bächlingen showing as BÃ¤chlingen Help	23	1170	September 24, 2020
Gramps and recording and comparing DNA-matches Help dna	46	3538	October 25, 2022

DNA Segment Map Gramplet & thousands separator

Related topics