GEDCOM Character encoding mismatch

Is there a way to force the GEDCOM import to override the encoding that is incorrectly stored in the file?

I received a copy of a 17mb 2010 vintage GEDCOM of one of my family lines (Ferree) that was generated with Legacy Family Tree 7. It is a mangled file in too many ways to count.

But the first problem is that the “CHAR” GEDCOM line has an invalid “ANSI” value. Worse is that the content is full of backslash delimited 2-character Hexadecimal values for special characters. So, beyond the 57,614 lines where the Gramps GEDCOM import has VERY valid complaints about custom tags in Legacy, there are an uncounted number of lines where Gramps (and some text editing tools) claims to have unreadable characters.

Does anyone know of a encoding format that will force Gramps GEDCOM import recognize something like \F8 as an ø character? Or is there a way to hack the module to do so.

(My exemplar is tree #2823 from Vol. 3 of the 1997 Brøderbund WFT genealogical CD-ROMs being cited as a Source. Yep… the data is THAT crappy.)

After I get a Windows box up and running again, I do intend to download their version 9 software and see if Legacy Family Tree 9 writes a better GEDCOM than version 7 did.

(In a similar vein, I’ve found that the text import addon demands a vcard version 3 format file. But every vCard I’ve exported from a smart phone seems to be the 2.1 version. The VCFs are pretty basic and are mostly compatible. So I am able to just tweak the version in the VCF with a text editor to version 3 and import anyway. But it would be nice to force the VCF importer to try even though the validation is a mismatch.)

If you can work out what character encoding the gedcom uses, the iconv command line tool can probably change it to UTF-8:

Take normal precautions: only work on a copy, take backups, floss before bed, etc.

Craig

My big concern with fixing this delimiter with Find&Replace is limiting the Scope. I was thinking that this would be easier within the natural constraints while the GEDCOM is being parsed.

I’m worried that the baskslash will be used elsewhere and coincidenatlly look like hexadecimal. Such as URIs that might be in a Note or a Description. (The data in this particular GEDCOM is so very inconsistent that ANYTHING is possible.)

Three things:

  1. If you want, you can change the CHAR value, and see what it does,
  2. Legacy works quite well under Wine, so you can run it without a Windows machine,
  3. You can let me have a look, if you trust the file to be safe in my hands.
1 Like

Changing the CHAR tag value from the invalid ANSI to UTF-8 , ANSEL , UNICODE or ASCII wasn’t successful in fixing the “invalid character” error during import. However, I’m suspicious that the Linux text editors being used for that change. They might be tweaking more than the CHAR line. (It took decades to populate my toolbox of dependable utilities. I’ve only had Fedora for about a quarter year.)

This ‘Ferree’ GEDCOM is not confidential. Glad to foward it by eMail. (Again, someone REALLY didn’t have a good understanding of data-entry for a database.) After resolving the invalid characters, there 57,000 error lines during import to fix, and then after THAT, there will have to be a massive realignment of data. (Thousands of “Places” are actually other types of information).

OK, I understand. I hope that a zipped GEDCOM file is small enough to be passed through Yahoo and Gmail, and if not, there’s also Dropbox, or OneDrive, whatever you like.

I have a few GEDCOM filters, in C, C#, and Python, so if I can find something that looks like a pattern, I can probably write some additional software for it. And I can also try it on Legacy, if that helps. I’ve never really liked it, even though it’s one of the few American programs with a Dutch translation, and even the only one among those that can speak FamilySearch.

iconv is not a blind search and replace tool–it will report unrecognized character entities as errors. Via the command line options, you can tell iconv how to handle specific exceptions, if necessary. Try it and then open the new file with a basic text editor to see if the replacements look sensible. You can use ‘diff old new | sort | uniq’ to see a list of substitutions made.

Craig

2 Likes

OK, here’s a quick answer, about the character set, as seen with notepad on Windows 11.

First of all, the character coding is right. It is ANSI. I can see that, because I see texts like Brøderbund, where I can see that the ø is indeed encoded by a single byte, which appears as that \F8 in an editor that doesn’t understand ANSI. Notepad does.

Second, when I search for ‘rbund’, I also see BrdÌœerbund, and that’s UTF-8 with a typo. The ø should appear before the d, but I found it behind that, encoded as Ìœ, which are two valid ANSI characters, so that notepad thinks that everything is OK, but it is a mess for a human, like me. And this one is quite easy to repair, by using a search and replace for the whole name. No coding required.

There may be other names that were messed up in this way, but this is the only one that I spotted tonight. There may be other ones, and if you know some, please put them in mail.

Since many 2 character UTF-8 codes are legal in ANSI too, they will not be reported, but the mess is still there, and when needed, I may try to write a program that finds them by looking for sequences of two non-ASCII characters, like that Ì followed by œ.

Things like these happen a lot, when people paste texts from web pages, which are most often encoded as UTF-8, this one too, into a program that’s working in ANSI mode. It will happily store the two bytes that make up the UTF-8 string for ø, so many times, only a human will see that it’s wrong.

And because of this, it’s no use trying UTF-8 and other character sets, for the simple reason that in this file, the character sets were mixed up by these pastes from the web. It is very, very common.

Yes, that is exacting what I found and “Brøderbund” was the reference record I used too. (Although the Notes have encoded characters too.)

But GEDCOM specifically excludes ANSI as an ecoding since there are multiple variants and no clear standard.

I figured the cleanup was going to have to wait until I can get my belongings out of storage and can work on recovering the fried Windoze desktop. The file has been this way for 13 years, there’s no rush. Once that running, I can search for backslashes in a hex editor and clean it one type of encoded character at a time. Then convert the encoding to UTF-8 and work on the next problem.

What nags at me is that both Text Editors claim the file include Unreadable Characters. (One actually abandons the file.) That should not be the case if characters encoded with a backslash and hex are the problem. So there is something else wrong with the file.

My standard Linux text editor shows an error here too, but you can also install Kate, designed for KDE, but compatible with other desktops. I use Kate for large texts like this one.

Windows notepad doesn’t show errors either, which is quite logical, because it runs on Windows, and this GEDCOM comes from a Windows app too.

You may also notice that when you try to search for \F8, you won’t find it, because that’s not the actual text. It’s a display trick.

I do know that Gramps most probably detects all of these anyway, because it warns about control characters in the import log, which I saved as a text file to be sure. And that was a good idea, because these warnings are not included in the GEDCOM import notes.

I hope to find time to fix most of these things by tomorrow or tuesday, with notepad, on Windows.

And as a closing remark, here’s what Tamura Jones wrote about this:

2 Likes

I did some further tests, and when I open the GEDCOM with Kate, I see the detected character set mentioned in the lower right corner, but it’s wrong, because it detects ISO-8859-15, and displays white rectangles for some characters. I can only see what I pasted here earlier when I select cp 1252, which it displays as windows-1252. That’s on Linux Mint.

When I open the file with notepad on Windows, this is the default, so it will display all characters as I pasted those yesterday. Both Kate (on Linux) and notepad (on Windows) don’t complain about unreadable characters. Xed does complain, however.

What are names of your editors?

I faced such a situation when importing very old MacOS Classic files into my Fedora system. I wrote a macro generator to handle such cases. It was validated against a conversion from XIXth century Norwegian to Bokmål. It is based on a 2-tier strategy and can cope with context in order not to mess with file Windows paths if some are present in the file. What is needed is a precise specification of how reverse solidus is used. Then I can write macros to transcode to Unicode.

As an example a 700k-character novel in Norwegian was converted in 4 seconds.

I did some further tests today, and found that the misspelled Brøderbund name is the only text that has UTF-8 sequences in it. And I found that, by asking ChatGPT for a way to detect those in an ANSI file. It tried to evade the question by saying that there is no sure way to do that, and I should tools mentioned earlier in this thread, but I persisted, and wrote that I wanted to find possible UTF-8 sequences, and when I did that, it gave me some Python code, and then rewrote the code in C#, so that I could run it with Visual Studio on Windows 11. And that confirmed that the misspelled Brøderbund name was indeed the only one with 2 byte UTF-8 strings in it, so I did a global search and replace to repair it. There were a few dozens, because the misspelled name had been pasted lots of times.

When I imported the corrected file, Gramps still complained about control characters, and advised to change the encoding to CHAR cp1252, and that helped. The codes shown in the logs were all single byte codes that appear to be control codes in ISO 8859-1, a.k.a. ISO Latin 1, but which are real characters in cp1252. And Gramps is perfectly able to handle those, when told.

In theory, this means that you should be able to use Kate, or another editor that speaks cp1252 to replace the misspelled Brøderbund with the right one, and replace CHAR ANSI with CHAR cp1252, but I will mail a working GEDCOM too.

You will still see thousands and thousands of unknow GEDCOM tags, lots of places used for other types of information, and dates with WFT in them, and maybe some other soul can help you write an Isotammi script for those.

I did some tests with Legacy 9, but I don’t expect that it can cure much. It can export GEDCOM files in PAF 5 and GEDCOM 5.5.1 format, saving you from a few unwanted tags, but there are other ways to get rid of those too, using existing Linux tools, which may also give you a chance to try more useful tags.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.