German place names with special characters are showing similar to what is shown in the title. Another example is WÃ¼rttemberg. This problem comes and goes. A while back I fixed them (I used copy and paste) and the seemed ok for a while but when I reopened Gramps yesterday they all show up this way. I wondered if I should just enter them with Latin characters so this won’t happen again. Are others having this problem?
I have not had problems with copy and paste. Usually this is from Wikipedia. I attach the place’s wiki page to the record.
I also have a list of common symbols/characters that I have put into a note record with the type “To Do”. Basically I created my own Character Map within Gramps. The characters are readily available to copy from the To Do gramplet on my dashboard.
I have never seen characters changing. 5.1.2 on Win10.
The desired character “ä” is Unicode U+00E4 / UTF-8 C3 A4.
The substituted characters “Ã” and “¤” are, respectively, Unicode U+00C3 / UTF-8 C3 83
and Unicode U+00A4 / UTF-8 C2 A4.
It seems that it took the desired character’s UTF-8 value of C3 A4 and created the two characters 00 C3 and 00 A4.
I just looked for one of my German ancestor names and see Jürg is fine, as well as Künzelsau found in his event description. Strange. I’ve been working for a while to create heirarchy and clean up my places. That’s when I found this the first time and at first I wondered if it had something to do with exporting as csv then re importing a while back. I had everything fixed and was just going though to see if I had any duplicates and saw this again. I’ll fix it again and keep an eye on it.
It is entirely possible that the CSV export/import caused an issue. Gramps (on Windows) exports CSV encoded in the utf-8 character set with a BOM (Byte Order Marker) to provide more positive identification of the encoding. And will import this again correctly as long as it remains encoded the same way.
But if the CSV file was edited with some other application that exported with a different encoding or left out the BOM, Gramps would try to import with the Windows system default encoding, which is usually one of the code pages. Since that is the closest standard way Windows apps have of encoding text. As long as the exporting app also used the Windows system default encoding and none of the characters used could not be encoded in that code page, it would still work. But a code page can only hold a total of 255 different characters, as opposed to utf-8 which can encode most all characters.
I realize this is all a bit technical, but it is intended to show that us developers (on Windows) cannot always guarantee getting this stuff right, as there is no completely ‘right’ way to do it.
I’ve occasionally had to use Notepad++ to edit a text file and change its encoding to deal with this issue; it has the ability to interpret the text in a variety of encodings and to save in other ones. But you have to inspect the file and make sure text with non-ASCII characters are showing up correctly yourself.
Thanks Paul. I think you’re right. I just opened the csv in Notepad++ to look at the coding and it shows UTF8, not with BOM. Another strange thing…I opened the file with OpenLibre to see if there was someplace to specify coding. I did a search for Bächlingen and it didn’t find it with the version pasted from a google search. It is there though, with correct umlaut mark. It just didn’t recognize the google version. Strange. So, the solution to my problem is to make sure to use Notepadd++ and set the coding to UTF-8-BOM, save then upload.
This ties into GEDCOM export tests I’m running with @GeorgeWilmes suggestion of visualizing with yEd.
yEd reported an invalid line during the import of a 26MB GEDCOM. Caused the import to abort.
When I checked that file with the NotePad++ text editor, discovered it substituted a NEL (next line control, 0x85) in 5 different spots in the CONTinuation lines of 2 Notes. Looks like they were originally some punctuation mark.
Went back to those Notes in my original Gramps tree and found non-printable characters in the same spots. (Fixed the GEDcom & the original Notes.)
Both notes had been pasted into the Note Editor years ago. They’ve been through loads of cycles of repairs & export to XML into fresh trees. So I have no idea where the Next Line symbols were swapped in.
[Bug report 11850 filed on NextLine. The splitting of a 4 byte Unicode character into two 2 byte characters has not been isolated enough to write up… yet.]
Connie, some background questions?
Is it correct that you are running Gramps 5.1.2 on Windows?
Are you running Gramps in German or English? (some other language?)
Did your original issue have anything to do with importing or exporting?
Since this is ONLY occurring in Placenames, are you using any tools to harmonize your Place data?
I’ve had Württemberg in my place Tree for years without a decomposing umlaut issue in the Windoze version. It’s been through several XML import rebuild cycles.
Lordemannd and emyoulation, yes. 5.1.2 on Windows 10, using English. I mentioned above that I had exported as csv a while back and suspect that is the problem. It doesn’t import the same as it was exported. The fix is opening the csv file in Notepadd++, setting the format to UTF-8 with BOM, saving and then importing the csv back into gramps.
That’s a good workaround. However, it might not be a good solution for the average user.
If the program cannot have confidence in the UTF byte order marker because the CSV file might be edited, perhaps some sorts of Preflighting verification feature could be added?
Perhaps the importer could find a UTF example in the CSV and ask the user which BOM interpretation is correct?
Excel does something similar for its text import where the user can correct for CSVs that are actually tab or positionally separated values. Or where the existence of a header (field name) label row is indeterminate.
I’ve used Open Refine and Open Office or Libre Office to work with the data for places. I just exported a new csv file to see what options are given for opening it with Libre and there is UTF-8 but no BOM option.
The utf-8 BOM was developed to provide a solid way to identify utf-8 encoding in files. As far as I know, it is not mandatory, so files exist encoded in utf-8 both with and without the BOM. In my experience, programs that can use a utf-8 encoded file will accept files with or without the BOM. It is more likely that the program will automatically recognize the proper utf-8 encoding if it has the BOM.
I looked through the csv file and see that it isn’t only the placenames affected. One person’s name that had a special character and some, but not all of the coordinates are also gibberish. Examples 51Â°20’42"N or even worse, 42Â°32â€²08â€³N and 41Ã‚Â°22Ã¢â‚¬Â²51Ã¢â‚¬Â³N. At first I thought it was only the entries where I used °, ’ and " rather than decimal form but that isn’t the case and some with degrees are ok. So now I’m really confused because the ones that are fine aren’t only ones I entered after working with the data in Open refine. Perhaps it made a difference if the coordinates were entered separately or in the box below there. I’ve done it both ways depending on where I got the data.
Could you check a few of those problematic GPS coordinates a bit more deeply in Gramps?
It’s odd that you have a mix of straight & curly quotes. (Although the wiki uses the same mix.)
I filed a bug report a while back about how mixed formats for coordinates was causing some problems in Gramps reports.
I wonder if we don’t need a mass validator that harmonizes all the coordinates to a preferred supported coordinate format ?
It seems like we would want to select the highest precision format for internal storage but allow display/reports/export to convert on-the-fly to any of the supported formats. That would eliminate double-conversion rounding errors and require less validation for every use.
Personally, I find a decimal format easiest for data-entry. (Provided that coordinate pairs can be 10-keyed… the comma slows me down since it requires a right-hand repositioning on the keyboard) but displaying in deg (degree, minutes, seconds; e.g., N50º52’21.92" , E124º52’21.92" ) notation is faster for comprehension & manual map lookup. But the keyboarding degree symbol is a pain to remember.
(I belatedly discovered that 52’21.92"N , 124°52’21.92"E isn’t nearly as efficient when using a globe… you have 2 equal hemisphere possibilities until the last glyph of each coordinate.)
What do you mean by checking a bit more deeply in Gramps? I might know how the mix of formats happened. When I was using open refine to create hierarchy I worked on bunch, opened the csv file in OO calc then tried to remove all that I hadn’t worked on, importing that smaller file. That would explain some coordinates being correct. It seems that all special characters and letters import back in wrong. Now, thanks to Jaran, I am using the Place Update tool.
ETA: I may have misunderstood you. If you are only referring to the mix of straight and curly quotes that could also come from me hand typing some, copy and paste others. I use the degree symbol a lot so know the code. alt 0176 creates °.
I’m suggesting that you could note a number of Places (names or IDs) oddities in the CSV. Then, you could inspect the coodinates in Gramps Place Editor.
See if the internal validator shades any in red… indicating unrecognizable data.
I like the PlaceCleanup gramplet too. But I wish it remembered what I want to keep. And had an option bring in JUST the Lat/Long. (Harmonizing IDs, enclosing MCDs, Postal codes, etc. with GeoNames is working against my objectives right now.)
the csv I now have won’t do any good for showing what was there before because it is exporting the oddities that were imported from a previous csv file. Sorry. But yes, those bad coordinates are shaded in red and read error in format in the place editor.
BOM is not a UTF-8 thing, it is more of a UTF-16 thing. Windows saves text files in 16-bit form probably because at one time Unicode was going to be a 16 bit encoding. it turned out to be a 31 bit encoding placed in a 32 bit unit. the BOM was part of the 16 bit way because 8 bit computers might store the bytes on 8 bit media in the opposite order than another computer. for example the PC (Intel x86) is opposite Sun Sparc and IBM mainframe (i did assembly level programming on all 3 so i had to know). to deal with the “byte flip” problem, Unicode established the BOM to be sure the exchange of Unicode text could be done without garbling the whole thing. if the BOM were actually needed, all the bytes would be flipped in pairs without it. it is not needed if the text stays on the same system. some other kind of processing had to do that, but what BOM would normally do is not what was seen. misapplied BOM would give far worse results. this sounds like some program reading 16 bit text as if it is ASCII only and mishandling a character code above what it expected (either above 127 or above 256). but it definitely seems like something else processed it and handled 16 bit character incorrectly (possibly an old DOS program).
i don’t use Windows so i don’t run into this.
Phil, unicode.org says,
“The first version of Unicode was a 16-bit encoding, from 1991 to 1995, but starting with Unicode 2.0 (July, 1996), it has not been a 16-bit encoding. The Unicode Standard encodes characters in the range U+0000…U+10FFFF, which amounts to a 21-bit code space. Depending on the encoding form you choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes, one or two 16-bit code units, or a single 32-bit code unit.”
That said, I want to mention that nothing I’ve done would involve an old DOS Program. I had copied and pasted names with foreign characters from Wikipedia, FamilySearch, maybe a google search result. If I enter coordinates myself I know the Alt code for degree so I use Alt+0176. I just use the apostrophe and quote mark on my keyboard for minutes and seconds. If I copy and paste coordinates, most of the time I’ve used Wikipedia, geohack.toolforge.org which is where the Wikipedia coordinate link leads too, or werelate.org. They all use UTF-8 characters. Finally, I exported a place csv file, as I said earlier, manipulated it in openrefine and OpenOffice but now I have LibreOffice.