@Nick-Hall where should we file an enhancement feature request to support number format conversion in the Gramps variant of glocale? Or maybe I should ask how to write it to point out where the modification needs to occur?
I suggest not to bother at all with Gramps’ locale management. The reason Gedmatch displays the table like this is likely due to my browser’s locale, but we don’t know whether the browser uses the same locale as Gramps; in fact we don’t even know the data was copied from a browser on the same system that Gramps is running on.
I suggest to take a more pragmatic approach. We know which columns are floats and which columns are integers, and we also know that the floats are always less than 1000. So we can just drop all non-[0-9] in the integer columns and convert all , (there will be at most 1) to . in the float columns.
Agreed - I need to determine thousand separator and radix independent of locale.
So try the following code:
if re.search('\t',line) != None:
field = line.split('\t')
line_wo_thousand = line
if re.search(',',field) != None:
line_wo_thousand = re.sub(',','',line)
elif re.search('.',field) != None:
line_wo_thousand = re.sub('\.','',line)
if line_wo_thousand != line:
line = line_wo_thousand
if re.search(',',field) != None:
line2 = re.sub(',','.',line)
line2 = line
line = re.sub('\t',',',line2)
field = line.split(',')
The top and bottom lines exist in the original code. The logic is completely replaced. Check the Start Pos for the thousands separator and the SNPs for the radix. Once line is normalized to no thousands separator and period for radix, replace the tab charactor with comma and then process line.
No, the point is that the Gramplet needs to interpret copy/pasted content from a web site, and we don’t know which locale the web site displays the numbers in. For instance, if I’m opening the English web site from my browser with a non-English locale, it depends on the web site how it chooses to show me the numbers. So it’s better to parse it in a locale independent way.
@GaryGriffin: I think your code looks good! In re.search, doesn’t the dot also need to be escaped?
Here is a slightly more concise version which I think has the same effect:
if "\t" in line:
field = line.split("\t")
if "," in field:
line = line.replace(",", "")
elif "." in field:
line = line.replace(".", "")
line = line.replace(",", ".")
line = line.replace("\t", ",")
field = line.split(",")
then explain to me how you plan to show me the decimal delimiter as commas after you have “imported” them to Gramps, because I need all my result usable in MY language, not in English, and I bet most of Europeans need it to, when they are to use them in books and articles (or maybe export them again for use in other software).
so why not use a separator that is commonly used, most system use CSV if they provide export of DNA results… by using commas, you will also drop one step for anything else that the gramplet might be doing int the future… e.g., reading a CSV file from any of the test providers directly.
The only two formats I have ever seen been used is CSV and TSV in any software that import/export DNA result, I don’t think I have ever used a software that actually use DSV as an export format… but maybe it should have been used, since it supports any delimiter you like, including pipe, middle point etc.
Why using delimiters not commonly used in other software or “standards”…
(I am not talking about locales, but digitally used “standards”).
No, it’s feedback to Gramps or more accurately, for this function/gramplet specially, I don’t give feedbacks on tools I am not using or want to use…
I did want to use this gramplet, but…
I really do not understand why it is so difficult to use “standards” or logics, interoperability and interchangeability standards that’s already well established… or actually use tools and libraries that already do a great job for a given task, it just amazes me how reluctant, in general, some developers are to utilize the work of other open source and open data libraries and standards… it’s like it is very important to try to invent the wheel all over again, and again, and again… it just amazes me…
But maybe that’s why I am moving more and more of my research over to software solutions that actually use Open Data and Open Standards and as much as possible store data in plain text files, so that the data can be reused in different software without the need of storing it multiple times in multiple “formats”, and that utilize commonly used non-lossy interchangeable or interoperability formats…
Most likely it is just because I am lazy or that my head no longer manage to process programming languages and that I just want to do research rather than start learning Python, C#, C++, Pearl, R, Julia etc. etc.
I will not interphere with my opinions anymore, sorry that I spoked up.
Hmmm. Making the input parsing ‘locale’ agnostic (instead of adaptive to the OS locale) raises a curious issue. If you’re predicting which parsing rule to use by column order, what happens with a RTL (RightToLeft language) source for a DNA segment data table?
Let me take a step back and explain the history of this gramplet. Initially it was written to accept CSV input. After realizing that most of the sites that provide the DNA shared segment info were generating a TSV (which some used a thousands separator), I added support for TSV - assuming an optional thousands separator of ‘,’ and a radix (decimal) character of ‘.’. This was to ease the data entry to a cut-paste operation.
Sites like GEDmatch provided data in TSV with a thousands separator. When it was pointed out that the German version of GEDmatch used a different thousands separator (and radix), I updated the gramplet to work with either a ‘,’ or a ‘.’ as thousands separator.
Note that the user cuts from a program like GEDmatch and pastes to an Association Note. Where I made the change was how I processed the Note to extract the data. The Note is in whatever language the user wishes.
So with this change, the following lines in the Association Note are interpreted exactly the same. Any of the 4 formats can be used interchangeably.
1 54751900 83468985 31.8 1451 (TSV with no thousands )
1 54,751,900 83,468,985 31.8 1,451 (TSV with thousands ',')
1 54.751.900 83.468.985 31,8 1.451 (TSV with thousands '.')