German place names with special characters are showing similar to what is shown in the title. Another example is WÃ¼rttemberg. This problem comes and goes. A while back I fixed them (I used copy and paste) and the seemed ok for a while but when I reopened Gramps yesterday they all show up this way. I wondered if I should just enter them with Latin characters so this won’t happen again. Are others having this problem?
I have not had problems with copy and paste. Usually this is from Wikipedia. I attach the place’s wiki page to the record.
I also have a list of common symbols/characters that I have put into a note record with the type “To Do”. Basically I created my own Character Map within Gramps. The characters are readily available to copy from the To Do gramplet on my dashboard.
I have never seen characters changing. 5.1.2 on Win10.
The desired character “ä” is Unicode U+00E4 / UTF-8 C3 A4.
The substituted characters “Ã” and “¤” are, respectively, Unicode U+00C3 / UTF-8 C3 83
and Unicode U+00A4 / UTF-8 C2 A4.
It seems that it took the desired character’s UTF-8 value of C3 A4 and created the two characters 00 C3 and 00 A4.
I just looked for one of my German ancestor names and see Jürg is fine, as well as Künzelsau found in his event description. Strange. I’ve been working for a while to create heirarchy and clean up my places. That’s when I found this the first time and at first I wondered if it had something to do with exporting as csv then re importing a while back. I had everything fixed and was just going though to see if I had any duplicates and saw this again. I’ll fix it again and keep an eye on it.
It is entirely possible that the CSV export/import caused an issue. Gramps (on Windows) exports CSV encoded in the utf-8 character set with a BOM (Byte Order Marker) to provide more positive identification of the encoding. And will import this again correctly as long as it remains encoded the same way.
But if the CSV file was edited with some other application that exported with a different encoding or left out the BOM, Gramps would try to import with the Windows system default encoding, which is usually one of the code pages. Since that is the closest standard way Windows apps have of encoding text. As long as the exporting app also used the Windows system default encoding and none of the characters used could not be encoded in that code page, it would still work. But a code page can only hold a total of 255 different characters, as opposed to utf-8 which can encode most all characters.
I realize this is all a bit technical, but it is intended to show that us developers (on Windows) cannot always guarantee getting this stuff right, as there is no completely ‘right’ way to do it.
I’ve occasionally had to use Notepad++ to edit a text file and change its encoding to deal with this issue; it has the ability to interpret the text in a variety of encodings and to save in other ones. But you have to inspect the file and make sure text with non-ASCII characters are showing up correctly yourself.
Thanks Paul. I think you’re right. I just opened the csv in Notepad++ to look at the coding and it shows UTF8, not with BOM. Another strange thing…I opened the file with OpenLibre to see if there was someplace to specify coding. I did a search for Bächlingen and it didn’t find it with the version pasted from a google search. It is there though, with correct umlaut mark. It just didn’t recognize the google version. Strange. So, the solution to my problem is to make sure to use Notepadd++ and set the coding to UTF-8-BOM, save then upload.
This ties into GEDCOM export tests I’m running with @GeorgeWilmes suggestion of visualizing with yEd.
yEd reported an invalid line during the import of a 26MB GEDCOM. Caused the import to abort.
When I checked that file with the NotePad++ text editor, discovered it substituted a NEL (next line control, 0x85) in 5 different spots in the CONTinuation lines of 2 Notes. Looks like they were originally some punctuation mark.
Went back to those Notes in my original Gramps tree and found non-printable characters in the same spots. (Fixed the GEDcom & the original Notes.)
Both notes had been pasted into the Note Editor years ago. They’ve been through loads of cycles of repairs & export to XML into fresh trees. So I have no idea where the Next Line symbols were swapped in.
[Bug report 11850 filed on NextLine. The splitting of a 4 byte Unicode character into two 2 byte characters has not been isolated enough to write up… yet.]
I have the same problem with newline when copying bibliography from zotero into text fields in Aeon Timeline … when I export the timeline to CSV, the line in the CSV file that contain the bibliography string are broken because of the hidden next-/newline character …
The same hidden Character also occur in the text fields I copy it to in Gramps, so in a text file like csv or gedcom, the “next line” will corrupt the file “standard” …
I have also found this in some text copied from webpage documents viewed in both Chrome and Firefox.
I also see it when I copy text from internet to the Notes in Zotero, and that lost character to not get removed with the “remove code” tool in Zotero either …
Nearly always this next-/newline char. comes after a “.” in the end of a sentence …
Connie, some background questions?
Is it correct that you are running Gramps 5.1.2 on Windows?
Are you running Gramps in German or English? (some other language?)
Did your original issue have anything to do with importing or exporting?
Since this is ONLY occurring in Placenames, are you using any tools to harmonize your Place data?
I’ve had Württemberg in my place Tree for years without a decomposing umlaut issue in the Windoze version. It’s been through several XML import rebuild cycles.
Lordemannd and emyoulation, yes. 5.1.2 on Windows 10, using English. I mentioned above that I had exported as csv a while back and suspect that is the problem. It doesn’t import the same as it was exported. The fix is opening the csv file in Notepadd++, setting the format to UTF-8 with BOM, saving and then importing the csv back into gramps.
That’s a good workaround. However, it might not be a good solution for the average user.
If the program cannot have confidence in the UTF byte order marker because the CSV file might be edited, perhaps some sorts of Preflighting verification feature could be added?
Perhaps the importer could find a UTF example in the CSV and ask the user which BOM interpretation is correct?
Excel does something similar for its text import where the user can correct for CSVs that are actually tab or positionally separated values. Or where the existence of a header (field name) label row is indeterminate.
If you used Excel for saving the CSV, Excel use Codepage Windows-1252 as default on English Windows …
So to be able to save as UTF-8 if you use “Save as…” you may need to change the codepage in the file save options by choosing the right version in the list …
If you don’t chose this “type” (see the screenshot), Excel will use your systems default Codepage.
And the same can occur when importing to Excel, if the CSV is not defined with a Codepage, it use the default, but that can be changed when importing files:
This Screenshots are taken using a Norwegian Excel, and the file are already a UTF-8 defind file, but if you get the “stranges” characters in an import, try different Codepages (Filopprinnelse) …
I’ve used Open Refine and Open Office or Libre Office to work with the data for places. I just exported a new csv file to see what options are given for opening it with Libre and there is UTF-8 but no BOM option.
The utf-8 BOM was developed to provide a solid way to identify utf-8 encoding in files. As far as I know, it is not mandatory, so files exist encoded in utf-8 both with and without the BOM. In my experience, programs that can use a utf-8 encoded file will accept files with or without the BOM. It is more likely that the program will automatically recognize the proper utf-8 encoding if it has the BOM.
I looked through the csv file and see that it isn’t only the placenames affected. One person’s name that had a special character and some, but not all of the coordinates are also gibberish. Examples 51Â°20’42"N or even worse, 42Â°32â€²08â€³N and 41Ã‚Â°22Ã¢â‚¬Â²51Ã¢â‚¬Â³N. At first I thought it was only the entries where I used °, ’ and " rather than decimal form but that isn’t the case and some with degrees are ok. So now I’m really confused because the ones that are fine aren’t only ones I entered after working with the data in Open refine. Perhaps it made a difference if the coordinates were entered separately or in the box below there. I’ve done it both ways depending on where I got the data.
Could you check a few of those problematic GPS coordinates a bit more deeply in Gramps?
It’s odd that you have a mix of straight & curly quotes. (Although the wiki uses the same mix.)
I filed a bug report a while back about how mixed formats for coordinates was causing some problems in Gramps reports.
I wonder if we don’t need a mass validator that harmonizes all the coordinates to a preferred supported coordinate format ?
It seems like we would want to select the highest precision format for internal storage but allow display/reports/export to convert on-the-fly to any of the supported formats. That would eliminate double-conversion rounding errors and require less validation for every use.
Personally, I find a decimal format easiest for data-entry. (Provided that coordinate pairs can be 10-keyed… the comma slows me down since it requires a right-hand repositioning on the keyboard) but displaying in deg (degree, minutes, seconds; e.g., N50º52’21.92" , E124º52’21.92" ) notation is faster for comprehension & manual map lookup. But the keyboarding degree symbol is a pain to remember.
(I belatedly discovered that 52’21.92"N , 124°52’21.92"E isn’t nearly as efficient when using a globe… you have 2 equal hemisphere possibilities until the last glyph of each coordinate.)
What do you mean by checking a bit more deeply in Gramps? I might know how the mix of formats happened. When I was using open refine to create hierarchy I worked on bunch, opened the csv file in OO calc then tried to remove all that I hadn’t worked on, importing that smaller file. That would explain some coordinates being correct. It seems that all special characters and letters import back in wrong. Now, thanks to Jaran, I am using the Place Update tool.
ETA: I may have misunderstood you. If you are only referring to the mix of straight and curly quotes that could also come from me hand typing some, copy and paste others. I use the degree symbol a lot so know the code. alt 0176 creates °.
I’m suggesting that you could note a number of Places (names or IDs) oddities in the CSV. Then, you could inspect the coodinates in Gramps Place Editor.
See if the internal validator shades any in red… indicating unrecognizable data.
I like the PlaceCleanup gramplet too. But I wish it remembered what I want to keep. And had an option bring in JUST the Lat/Long. (Harmonizing IDs, enclosing MCDs, Postal codes, etc. with GeoNames is working against my objectives right now.)
the csv I now have won’t do any good for showing what was there before because it is exporting the oddities that were imported from a previous csv file. Sorry. But yes, those bad coordinates are shaded in red and read error in format in the place editor.