Choosing a Character Encoding for exporting through GEDCOM (wiki)

Understanding appropriate character encoding is crucial when exporting genealogical data to GEDCOM format. The right character encoding ensures that all characters (especially those from non-English languages or special symbols) are correctly preserved and interpreted across different systems.

As the GEDCOM 5.5 version’s recommended default, Genealogical applications frequently misuse the ANSEL label when the data actually uses a different encoding. Gramps GEDCOM importer includes an Encoding options dialog for when data encoding mismatches the label.

The choice of character encoding for GEDCOM export is often driven by the specific characters present in your genealogical data. Here are some examples:

  1. ANSEL (American National Standard for Extended Latin Alphabet Coded Character Set).:

    • Appropriate for data with European language characters, including those with diacritical marks.
      Example: Names with diacritical marks, like “François” or “Müller”, would require ANSEL or a more comprehensive encoding.
    • provides a table of coded values for representing characters of the extended Latin alphabet in machine-readable form for 35 languages written in the Latin alphabet and 51 romanized languages
    • 8-Bit ANSEL (aka ANSI/NISO Z39.47-1985 copyright, superseded 2013) is the preferred character set for GEDCOM 5.5.1 standard.
  2. ASCII (American Standard Code for Information Interchange):

    • Good for basic English letters, numbers, and common punctuation.
    • Works well if you only have simple English text
    • Widely compatible.
  3. UTF-8 (Unicode Transformation Format - 8-bit):

    • GEDCOM7 specification for character encoding only allows the UTF-8 format.
    • Ideal for data with characters from multiple languages or scripts.
    • Example: If your database contains names in both Latin and Cyrillic scripts, UTF-8 would be the best choice.
  4. UTF-16 (Unicode Transformation Format - 16-bit):

    • Similar to UTF-8 in capability but less common in GEDCOM files.
    • Useful for data with a high proportion of characters outside the Basic Multilingual Plane.
    • supports compound Character Encoding Scheme (CES) such as escape sequences
  5. CP1252 (Windows-1252):

    • While not officially part of the GEDCOM standard, CP1252 is still commonly used, especially in Windows-based systems.
    • Supports most Western European languages and some additional symbols.

This sample.ged file (from the GEDCOM wikipedia article) has “ANSEL” character encoding:

The presence of specific characters can necessitate a particular encoding. For instance:

  • Euro symbol (€): This character is present in CP1252 and UTF-8/UTF-16, but absent in ISO-8859-14. If your data includes the Euro symbol, you would need to choose an encoding that supports it, such as CP1252 or UTF-8.

  • Cyrillic characters: If your genealogical data includes Russian or other Slavic names, you would need to use UTF-8 or a Cyrillic-specific encoding like Windows-12515.

  • Diacritical marks: Names with characters like “ñ”, “ö”, or “é” would require ANSEL, UTF-8, or CP1252, as ASCII doesn’t support these.

When exporting to GEDCOM, it is important to:

  1. Check what kinds of letters and symbols you have in your data. (Names, Event descriptions and Notes are most likely to have special symbols.)
  2. Choose the most appropriate encoding that supports all your characters.
  3. Ensure your genealogy software correctly specifies the encoding in the GEDCOM header (HEAD.CHAR line)3.
  4. Test the exported file by importing it into different genealogy programs to verify character integrity.
  5. Avoid special characters in filenames and file paths. Even if the target Genealogical software supports this, the OS can layer on extra transforms and make file handling too fragile.

By selecting the right character encoding for your specific data, you preserve proper names, places, and other information in your family history information and ensure that it can be read by other people and programs.

Citations:
[1] Gramps 5.2 wiki: Exporting Data - GEDCOM
[2] UTF-8 to ISO-8859-1 conversion of euro symbol
[3] GEDCOM Character Encodings by Tamura Jones
[4] Table Comparing Characters in Windows-1252, ISO-8859-1, ISO-8859-15
[5] GEDCOM encoding - Ahnenblattportal
[6] Differences between ANSI, ISO-8859-1 and MacRoman character sets
[7] StamboomNederland GEDCOM Quality
[8] FH6 Gedcom export to TNG - Family Historian User Group

4 Likes