A local AI for Gramps?

,

I’m more or less doing the same kind of tests as

But surname and place name matching remains an issue for these AI systems.

Recently, I knew one of my branches originated from Bavaria (Bavarians might complain since the territorial boundaries aren’t very clear…). The surname in French (well, Frenchified) is Haentzler, sometimes Häntzler for local pronunciation! But in Bavaria, it was more likely Hänßeler, or even Hensler. Apart from the accents on certain letters and local dialect variations, it sounds very similar. Explaining that to a machine? No thanks!

It’s quite simple, but for example:

It’s clearly a lineage that passed through France (specifically the East) before migrating to the United States, rather than originating from Germany (well, maybe from a German port, but coming from France!). In the end, a surname with “French-sounding” qualities (like Henseler) isn’t emphasized more; descendants living in Germany might think they are originally from France (or rather, Switzerland), and name changes become frequent after migration.

There are many such examples…

Indeed, with this example of a surname in the USA (rarely represented but present), the probability of migration from Alsace was high. However, the branch originating from Ulm in Germany clearly shows that there are no rules.

https://sortedbyname.com/letter_h/haentzler.html

Nevertheless, with first names such as Jean[1], Jeanette, and Joseph, one still tends to think of a Francophile family.

AI, even locally, will likely use English internally and switch to transcribed text via a translated interface for interaction with “the client.” We are dealing with distortions reminiscent of school exercises, akin to “telephone game” (good luck to the automatic translator…). Probably, the AI will initially be biased and default to the Ulm, Germany branch, generalizing from it.

What a strange idea to impose borders on physical and administrative territories. And why not assign a fixed name to a lineage? More seriously, it is within these nebulous clusters of names, sources, and territories that AI would benefit from greater flexibility.

A Soundex/Phonex system could assist in this. In such a case, we would have an overwhelming volume of data without efficient filtering. For example:

https://sortedbyname.com/letter_h/hensler.html

To index Paris civil records, perhaps AI will handle the task—but not beyond 20 km from the capital, lest it accumulate errors.

[1] In Alsace, one will find Jean = Johann = Hans…

HAENTZLER, MARCEL was born 4 November 1903 in MULHOUSE, France, son of ALFONSO HEAENTZLER (father)

Alfonso! I didn’t see that coming… Clearly, the American civil registrar was of Hispanic background. The father of this “distant American cousin,” Alphonse Haentzler, becomes Alfonso Heaentzler. I’ll add von Mulhausen (or Dumoulin) to the surname, as a particle associated with a classic error in Alsace. In the local dialect, this means “from Mulhouse,” but to an outsider, it might suggest a noble particle… One quickly stumbles upon an imaginary trail of high lineage. Surely a “Habsburg”! :blush:

P.S.: When writing in English, Alsatians sometimes first think in German! English (the language, not the nationality) is, in fact, more “related” to “Germanic” than to “Latin.” I imagine it’s somewhat similar for Canadians (here, referring to inhabitants of Canada), Belgians, Swiss, etc., who regularly switch from one language to another.

C’est marrant, j’avais fait un patch relativement propre à l’époque !

--- finddupes.glade	2010-10-24 17:05:33.000000000 +0200
+++ finddupes.glade	2010-12-17 18:43:53.000000000 +0100
@@ -264,6 +264,25 @@
                   </packing>
                 </child>
                 <child>
+                  <object class="GtkCheckButton" id="phonex">
+                    <property name="label" translatable="yes">Use phonex codes</property>
+                    <property name="visible">True</property>
+                    <property name="can_focus">True</property>
+                    <property name="receives_default">False</property>
+                    <property name="use_underline">True</property>
+                    <property name="active">True</property>
+                    <property name="draw_indicator">True</property>
+                  </object>
+                  <packing>
+                    <property name="left_attach">1</property>
+                    <property name="right_attach">2</property>
+                    <property name="top_attach">5</property>
+                    <property name="bottom_attach">6</property>
+                    <property name="x_options">GTK_FILL</property>
+                    <property name="y_options"></property>
+                  </packing>
+                </child>
+                <child>
                   <object class="GtkComboBox" id="menu">
                     <property name="visible">True</property>
                     <property name="model">liststore1</property>
--- FindDupes.py	2010-12-06 10:02:02.000000000 +0100
+++ FindDupes.py	2010-12-17 18:51:58.000000000 +0100
@@ -41,6 +41,7 @@
 from gui.utils import ProgressMeter
 from gui.plug import tool
 import soundex
+import phonex
 from gen.display.name import displayer as name_displayer
 from QuestionDialog import OkDialog
 import ListModel
@@ -87,6 +88,7 @@
 #-------------------------------------------------------------------------
 class Merge(tool.Tool,ManagedWindow.ManagedWindow):
      
     def __init__(self, dbstate, uistate, options_class, name, callback=None):
         
         tool.Tool.__init__(self, dbstate, options_class, name)
@@ -102,12 +104,14 @@
         self.removed = {}
         self.update = callback
         self.use_soundex = 1
+        self.use_phonex = 0
 
         top = Glade()
 
         # retrieve options
         threshold = self.options.handler.options_dict['threshold']
         use_soundex = self.options.handler.options_dict['soundex']
+        use_phonex = self.options.handler.options_dict['phonex']
 
         my_menu = gtk.ListStore(str, object)
         for val in sorted(_val2label):
@@ -117,6 +121,10 @@
         self.soundex_obj.set_active(use_soundex)
         self.soundex_obj.show()
         
+        self.phonex_obj = top.get_object("phonex")
+        self.phonex_obj.set_active(use_phonex)
+        self.phonex_obj.show()
+        
         self.menu = top.get_object("menu")
         self.menu.set_model(my_menu)
         self.menu.set_active(0)
@@ -158,6 +166,7 @@
     def on_merge_ok_clicked(self, obj):
         threshold = self.menu.get_model()[self.menu.get_active()][1]
         self.use_soundex = int(self.soundex_obj.get_active())
+        self.use_phonex = int(self.phonex_obj.get_active())
         try:
             self.find_potentials(threshold)
         except AttributeError, msg:
@@ -166,6 +175,7 @@
 
         self.options.handler.options_dict['threshold'] = threshold
         self.options.handler.options_dict['soundex'] = self.use_soundex
+        self.options.handler.options_dict['phonex'] = self.use_phonex
         # Save options
         self.options.handler.save_options()
 
@@ -252,6 +262,11 @@
                 return soundex.soundex(val)
             except UnicodeEncodeError:
                 return val
+        elif self.use_phonex:
+            try:
+                return phonex.phonex_fr(val)
+            except UnicodeEncodeError:
+                return val
         else:
             return val
 
@@ -667,12 +682,16 @@
         # Options specific for this report
         self.options_dict = {
             'soundex'   : 1,
+            'phonex'    : 0,
             'threshold' : 0.25,
         }
         self.options_help = {
             'soundex'   : ("=0/1","Whether to use SoundEx codes",
                            ["Do not use SoundEx","Use SoundEx"],
                            True),
+            'phonex'   : ("=0/1","Whether to use PhonEx codes",
+                           ["Do not use PhonEx","Use PhonEx"],
+                           True),
             'threshold' : ("=num","Threshold for tolerance",
                            "Floating point number")
             }
#
# -*- coding: UTF-8 -*-
#
# Gramps - a GTK+/GNOME based genealogy program
#
# Copyright (C) 1999  Frédéric Brouard
# Copyright (C) 1999  Florence Marquis
# Copyright (C) 2005  Christian Pennaforte
# Copyright (C) 2005  Florent Carlier
# Copyright (C) 2010  FR #4468
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
#

"""
Provide phonex calculation
"""

#-------------------------------------------------------------------------
#
# Standard python modules
#
#-------------------------------------------------------------------------
import string
import unicodedata
import re

#-------------------------------------------------------------------------
#
# constants 
#
#-------------------------------------------------------------------------
IGNORE = "HW~!@#$%^&*()_+=-`[]\|;:'/?.,<>\" \t\f\v"
TABLE  = string.maketrans('ABCDEFGIJKLMNOPQRSTUVXYZ', 
                          '012301202245501262301202')

#-------------------------------------------------------------------------
#
# phonex - returns the phonex value for the specified string
#
#-------------------------------------------------------------------------
   
def phonex_fr(strval):
    "Return the phonex value to a string argument for french language."
    
    if strval is None:
        return "Z000"
        
    r = strval.encode('UTF-8')
        
    # 1 remplacer les y par des i
    r = r.replace('Y','I')
    
    # voir 7    
    r = r.replace(u'É','Y')
    r = r.replace(u'È','Y')
    r = r.replace(u'Ê','Y')
                
    r = unicodedata.normalize('NFKD', 
                    unicode(strval.upper().strip())).encode('ASCII', 'ignore')
                    
    if not r:
        return "Z000"
  
    # 2 supprimer les h qui ne sont pas précédées de c ou de s ou de p
    r = re.sub(r'([^P|C|S])H', r'\1', r)

    # 3 remplacement du ph par f
    r = r.replace(r'PH', r'F')
  
    # 4 remplacer les groupes de lettres suivantes :
    r = re.sub(r'G(AI?[N|M])',r'K\1', r)
  
    # 5 remplacer les occurrences suivantes, si elles sont suivies par 
    # une lettre a, e, i, o, ou u :
    r = re.sub(r'[A|E]I[N|M]([A|E|I|O|U])',r'YN\1', r)
    
    # 6 remplacement de groupes de 3 lettres (sons 'o', 'oua', 'ein') :
    r = r.replace('EAU','O')
    r = r.replace('OUA','2')
    r = r.replace('EIN','4')
    r = r.replace('AIN','4')
    r = r.replace('EIM','4')
    r = r.replace('AIM','4')
  
    # 7 remplacement du son É:
    # voir plus haut
    r = r.replace('AI','Y')
    r = r.replace('EI','Y')
    r = r.replace('ER','YR')
    r = r.replace('ESS','YS')
    r = r.replace('ET','YT')
    r = r.replace('EZ','YZ')

    # 8 remplacer les groupes de 2 lettres suivantes (son â..anâ.. 
    # et â..inâ..), sauf sâ..il sont suivi par une lettre a, e, i o, 
    # u ou un son 1 Ã 4 :
    r = re.sub(r'AN([^A|E|I|O|U|1|2|3|4])',r'1\1', r)
    r = re.sub(r'ON([^A|E|I|O|U|1|2|3|4])',r'1\1', r)
    r = re.sub(r'AM([^A|E|I|O|U|1|2|3|4])',r'1\1', r)
    r = re.sub(r'EN([^A|E|I|O|U|1|2|3|4])',r'1\1', r)
    r = re.sub(r'EM([^A|E|I|O|U|1|2|3|4])',r'1\1', r)
    r = re.sub(r'IN([^A|E|I|O|U|1|2|3|4])',r'4\1', r)

    # 9 remplacer les s par des z sâ..ils sont suivi et précédés des 
    # lettres a, e, i, o,u ou dâ..un son 1 Ã  4
    r = re.sub(r'([A|E|I|O|U|Y|1|2|3|4])S([A|E|I|O|U|Y|1|2|3|4])',r'\1Z\2',r)

    # 10 remplacer les groupes de 2 lettres suivants :
    r = r.replace('OE','E')
    r = r.replace('EU','E')
    r = r.replace('AU','O')
    r = r.replace('OI','2')
    r = r.replace('OY','2')
    r = r.replace('OU','3')  

    # 11 remplacer les groupes de lettres suivants
    r = r.replace('CH','5')
    r = r.replace('SCH','5')
    r = r.replace('SH','5')
    r = r.replace('SS','S')
    r = r.replace('SC','S')

    # 12 remplacer le c par un s s'il est suivi d'un e ou d'un i
    r = re.sub(r'C([E|I])',r'S\1',r)
  
    # 13 remplacer les lettres ou groupe de lettres suivants :
    r = r.replace('C','K')
    r = r.replace('Q','K')
    r = r.replace('QU','K')
    r = r.replace('GU','K')
    r = r.replace('GA','KA')
    r = r.replace('GO','KO')
    r = r.replace('GY','KY')

    # 14 remplacer les lettres suivante :
    r = r.replace('A','O')
    r = r.replace('D','T')
    r = r.replace('P','T')
    r = r.replace('J','G')
    r = r.replace('B','F')
    r = r.replace('V','F')
    r = r.replace('M','N')
 
    # 15 Supprimer les lettres dupliquées
    oldc='#'
    newr=''
    for c in r:
        if oldc != c:
            newr=newr+c
        oldc=c
    r = newr

    #16 Supprimer les terminaisons suivantes : t, x
    r = re.sub(r'(.*)[T|X]$',r'\1', r)
   
    str2 = r[0]
    r = r.translate(TABLE, IGNORE)
    
    if not r:
        return "Z000"
        
    prev = r[0]
    for character in r[1:]:
        if character != prev and character != "0":
            str2 = str2 + character
        prev = character
        
    # pad with zeros
    str2 = str2+"0000"
    return str2[:4]

#-------------------------------------------------------------------------
#
# compare - compares the phonex values of two strings
#
#-------------------------------------------------------------------------

def compare(str1, str2):
    "1 if strings are close. 0 otherwise."
    return phonex_fr(str1) == phonex_fr(str2)

Pas testé, mais il existe d’autres contributions (ou versions), par exemple, des ajouts complémentaires (depuis 2010):

# 17 Affecter à chaque lettre le code numérique correspondant en partant de la dernière lettre
    num = ['1', '2', '3', '4', '5', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'N', 'O', 'R', 'S', 'T', 'U', 'W', 'X', 'Y', 'Z']
    l = []
    for c in r:
        l.append(num.index(c))

# 18 Convertissez les codes numériques ainsi obtenu en un nombre de base 22 exprimé en virgule flottante.
    res = 0.
    i = 1
    for n in l:
        res = n * 22 ** -i + res
        i = i + 1

ce truc en base 22 et virgule flottante, c’est pour la frime ou cela apporte vraiment quelque chose ?

Les instructions à l’AI/IA (AI/IA tools ou outil local ?) risquent d’être un peu longues…

According to the AI:
1. What is this transformation for?

Objective: Convert a Phonex key (e.g., “D153”) into a unique number, which could enable:

  • Numerical comparison between keys (e.g., for sorting or grouping names).
  • A compact representation of keys as numbers, useful for certain algorithms or databases.

2. Positive points

  • Originality: This is a creative approach to exploit Phonex keys numerically.
  • Potential for sorting: If you want to sort names by phonetic similarity, this method could help order the keys in a continuous manner.

3. Points to consider

a. Relevance to the Phonex algorithm

Phonex is designed to be a discrete key (4 characters), not a number. The strength of Phonex lies in its simplicity and its ability to group similar names under the same key.
Converting to base-22 adds complexity without necessarily improving phonetic comparison. For example, two phonetically close keys (e.g., “D153” and “D154”) could have very distant numerical values, which does not help identify their similarity.

b. Practical issues

  • Loss of readability: A key like “D153” is easy to interpret. A base-22 number (e.g., 0.123456) is less intuitive.
  • Limited precision: Floating-point numbers have limited precision. For long keys, this could cause collisions or rounding errors.
  • Uselessness for comparison: Phonex is already designed to compare character strings. Adding a numerical step does not simplify comparison, since you can already compare the strings directly (e.g., if phonex_fr(name1) == phonex_fr(name2)).

c. Concrete example
Let’s take two Phonex keys:

  • “D153” → Converted to base-22, this gives a number X.
  • “D154” → Converted to base-22, this gives a number Y.
    Even if “D153” and “D154” are phonetically close, X and Y could be very different numerically, which does not reflect their similarity.

4. Cases where this could be useful

  • If you want to store keys compactly in a database (although 4 characters are already very compact).
  • If you want to experiment with numerical representations for a specific project (e.g., a custom clustering algorithm).

5. Simpler alternative
If your goal is to compare or sort Phonex keys, you can:

  • Compare them directly as character strings (this is already optimized for that).
  • Sort them alphabetically to group similar names.

Conclusion
This addition is not useful for the classic use of Phonex (phonetic comparison of names). It adds complexity without providing a clear benefit for comparison or search. However, if you have a very specific need (e.g., numerical representation for a custom algorithm), it could make sense — but then it would need to be well-documented and tested for its impact.

For the “old” code (2010) tested in Gramps and oriented toward ‘French’, it works; the AI validates it (phew, I’m lucky!).

Just:
Points to improve or verify

  1. Handling of special cases:

    • Some French sounds (like “gn” or “ill”) are not explicitly handled. For example, “champignon” and “champinon” might not have the same key, even though they are pronounced almost the same.
    • The letters “y” and “i” are treated the same way, but their pronunciation can vary depending on context.
  2. Performance:

    • The code uses regular expressions and loops, which is fine for occasional use. For large-scale processing (e.g., a genealogical database), optimization could be useful.

If I understand correctly, you don’t need a large AI to transcribe names, first names, and places.

Not necessarily lighter, as it requires several Python libraries (libs) locally (cv, pytesseract). Here’s a trial of a version moving in an alternative direction:

import os
import cv2
import pytesseract
import re
from PIL import Image

# Make sure Tesseract is installed and the path is correctly configured
# pytesseract.pytesseract.tesseract_cmd = r'<path_to_tesseract_executable>'

def preprocess_image(image_path):
    """Preprocess an image to improve OCR accuracy."""
    img = cv2.imread(image_path)
    gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    denoised_img = cv2.GaussianBlur(gray_img, (5, 5), 0)
    enhanced_img = cv2.equalizeHist(denoised_img)
    binary_img = cv2.adaptiveThreshold(enhanced_img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
    return binary_img

def extract_text(image_path):
    """Extract text from an image using Tesseract OCR."""
    processed_img = preprocess_image(image_path)
    config = '--psm 6'
    text = pytesseract.image_to_string(processed_img, config=config)
    return text

def parse_second_column(text):
    """Parse the extracted text to get values from the second column."""
    lines = text.split('\n')
    second_column_values = []
    for line in lines:
        if line.strip():
            columns = line.split()
            if len(columns) >= 2:
                second_column_values.append(columns[1])
    return second_column_values

def find_target_number(second_column_values, target_number):
    """Search for a target number in the list of second column values."""
    matches = [value for value in second_column_values if value == str(target_number)]
    return matches

def process_folder(folder_path, target_number):
    """Process all TIFF and JPEG files in a folder to search for a target number."""
    results = {}
    for filename in os.listdir(folder_path):
        if filename.lower().endswith(('.tiff', '.jpeg', '.jpg')):
            image_path = os.path.join(folder_path, filename)
            try:
                text = extract_text(image_path)
                second_column_values = parse_second_column(text)
                matches = find_target_number(second_column_values, target_number)
                if matches:
                    results[filename] = matches
                else:
                    results[filename] = "No match found"
            except Exception as e:
                results[filename] = f"Error: {str(e)}"
    return results

# Example usage
folder_path = "path/to/your/images"
target_number = "75015"
results = process_folder(folder_path, target_number)
for filename, matches in results.items():
    print(f"File: {filename} → Matches: {matches}")

https://raw.githubusercontent.com/romjeromealt/AW2gramps/refs/heads/main/trouver_numéro.py

It’s limited. Even with the latest version of the Francophone dictionary (training), tesseract does not seem to provide a good character detection rate. During tests, I had to settle for between 10% and 15% on handwritten text. The advantage remains its quick usage (via command line) or through Python.

I admit that the blame is shared!
Asking a machine to find a number with a French dictionary is not a very good idea… There must be a numbers and digits dialect dictionary?

The new link seems to be this one.

In writing, but also in listening…

It seems we are already facing these types of tools, with their ‘machine-like’ phrasing and rhythm (for example, audio contact at Allianz insurance).

We could thus transcribe documents with fewer errors by “conversing” with these tools. A pre-transcription OCR, disseminated via audio and corrected by the human eye and brain.

The French version (male voice) in the model’s videos still has a nice British accent… For transcription and listening, this is not an issue (more composed and monotone); in fact, it’s more “general” than a Provençal or South-Western accent! :face_exhaling:

When we convert a French original document (with its errors) into text (OCR, AI), then into English audio, we almost correct in real time what our eyes read! Ultimately, it’s our brain via our ears that makes the connection, and in Gramps, we can link… an audio file as an alternative source to the paper document (photograph).

I should be able to understand the value of the massive census index between 1836 and 1936 (Industrial Revolution, economic statistics, etc.).

However, for genealogy and history, data from the late 18th century and the first quarter of the 19th century (post-Revolution and the Napoleonic eras) are missing. For Alsace-Moselle, there are also gaps and omissions between 1870 and 1920. Indeed, I imagine no provisions were made to integrate data written in German or names that were “Germanized.”

Other engine (toolset)

On the handwritten text, it is still not usable (OCR).

{
  "execution_time": "PT2.979285S",
  "results": [
    {
      "text": "NOMS",
      "confidence": 0.943,
      "polygon": [
        [
          453,
          20
        ],
        [
          481,
          19
        ],
        [
          481,
          28
        ],
        [
          453,
          28
        ]
      ]
    },
    {
      "text": "FOLIOS",
      "confidence": 0.758,
      "polygon": [
        [
          552,
          14
        ],
        [
          585,
          13
        ],
        [
          586,
          22
        ],
        [
          553,
          24
        ]
      ]
    },
   ...,
    {
      "text": "PRENOMS ET DEMEURES",
      "confidence": 0.626,
      "polygon": [
        [
          35,
          28
        ],
        [
          125,
          29
        ],
        [
          125,
          39
        ],
        [
          35,
          37
        ]
      ]
    },
    {
      "text": "FALNOMS ET DEMELHES",
      "confidence": 0.6946666666666667,
      "polygon": [
        [
          230,
          32
        ],
        [
          318,
          33
        ],
        [
          318,
          42
        ],
        [
          230,
          42
        ]
      ]
    },
    {
      "text": "FARNOMS KT DEMEURES",
      "confidence": 0.4036666666666667,
      "polygon": [
        [
          420,
          33
        ],
        [
          515,
          29
        ],
        [
          516,
          38
        ],
        [
          420,
          42
        ]
      ]
    },
    {
      "text": "MATRICE",
      "confidence": 0.568,
      "polygon": [
        [
          552,
          41
        ],
        [
          583,
          40
        ],
        [
          583,
          48
        ],
        [
          553,
          49
        ]
      ]
    },
    {
      "text": "HATAICE",
      "confidence": 0.607,
      "polygon": [
        [
          354,
          47
        ],
        [
          384,
          47
        ],
        [
          384,
          54
        ],
        [
          354,
          55
        ]
      ]
    },
    {
      "text": "LE PROPRIETAIKIA.",
      "confidence": 0.1705,
      "polygon": [
        [
          434,
          47
        ],
        [
          500,
          44
        ],
        [
          500,
          52
        ],
        [
          434,
          54
        ]
      ]
    },
    {
      "text": "Artegala yetin",
      "confidence": 0.129,
      "polygon": [
        [
          202,
          82
        ],
        [
          352,
          82
        ],
        [
          352,
          104
        ],
        [
          202,
          106
        ]
      ]
    },
    {
      "text": "Arbogail Joual Autorfurent",
      "confidence": 0.15866666666666665,
      "polygon": [
        [
          204,
          111
        ],
        [
          351,
          116
        ],
        [
          350,
          130
        ],
        [
          203,
          127
        ]
      ]
    },
   ...

OK, my example is not in a “lab” environment (resolution, drop shadow, strikethrough text, degraded format, multiple scripts and authors, etc.), yet there are still nearly 50% errors on the printed French text.

On the handwritten text (text and numbers), it’s closer to 99% errors… Most humans capable of reading this kind of text could achieve better results!

The circle is complete: American archives help us trace certain difficult periods in European history[1]… For example, I better understand the scattering of the branches in Mulhouse, thanks to sources digitized thousands of kilometers away!

[1] The common ancestors (great-great-great-grandparents!) date back to the mid-19th century and are from another department (sector). Nevertheless, I admit that entering this type of source into our family tree, even without a known direct or close link, is not simple.

Ah, well, here’s something that aligns with my needs, without reinventing the wheel!

In recent years, projects such as EXO-POPP and SOCFACE have demonstrated the effectiveness of deep learning models for processing census records from the period 1836–1936. However, applying these methods to older sources presents specific challenges.

Their automated exploitation faces three major obstacles:

  • Evolution of script: several centuries of variations in handwritten writing styles;
  • Orthographic instability: a lack of standardization in surnames and place names;
  • Condition of preservation: physical degradation (stains, bleed-through ink) linked to the age of the documents.

Current automatic text recognition (ATR) models, often trained on more recent and standardized corpora, struggle to adapt to the high variability of these materials.

The DAI-CReTDHI-Record-ATR dataset was created to address these challenges. It comprises 7,720 handwritten parish and civil registration records from three departmental archive collections: Ardennes, Indre-et-Loire, and Charente-Maritime. The corpus covers a period spanning from the 16th to the 19th century.

Exploiting this dataset serves several strategic objectives for research in digital humanities.

  • Benchmarking: evaluating the performance of current ATR models and identifying their limitations on historical sources;
  • Model adaptation: training and optimizing new models capable of handling the complex spellings of the early modern period;
  • Dissemination: publishing these models as open source to provide the scientific community and genealogists with powerful tools for automated indexing.

At present, it is rather Claude (Anthropic) that seems “very comfortable with long documents [transcripts, corpora of records, research notes]

That doesn’t mean I’ll trust Claude (Anthropic) and train it on the DAI-CReTDHI-Record-ATR dataset — which, by the way, uses examples from Western, Central, and Eastern France (from the North). Indeed, even though Alsace/Moselle in the 19th century used +/- the codes, standards, and conventions of “the France of the interior” (cf. beyond the Vosges), the management of names, places, and other variations of local patois calls for disabling the filters built into commercial AI/IA tools…

Okay, for formatting (automation, batch process & co.), it delivers error-free output and an impression of intelligence. However, regarding analysis, transcription, and transliteration, the recurring deviations are real problems in genealogy and for generalized use in Gramps.

Okay, the Mistral OCR 4 model provides something usable… at least for this test, since there is no proof that it wasn’t trained on this example during the past weeks!

I like to believe they did a good job, even though their confidence score should be based on the total number of characters present in the document, not just those the model attempted to transcribe. :wink:

response time: 15.20s
characters extracted: 6883
confidence score: 97.86% (minimum 18.08%)

NAMES, FIRST NAMES AND DOMAINS OF THE OWNERS. FOLIOS OF THE MATRIX. NAMES, FIRST NAMES AND DOMAINS OF THE OWNERS. FOLIOS OF THE MATRIX. NAMES, FIRST NAMES AND DOMAINS OF THE OWNERS. FOLIOS OF THE MATRIX.
Odiam, Gaspard and B. de 19 Arbogart, B. de 95 Bangraty, a. 122
Odiam, B. de 7 Arbogart, J. de 7 Bangraty, a. 14
Oimann, J. de 1 Acker, P. de 147 Bangraty, a. 27
Oimann, J. de 6 Acker, J. de 1149 Bangraty, a. 28
Orbogart 2 Acker, J. de 1161 Bangraty, a. 31
Orbogart, a. 11 Arbogart, M. de 1327 Bangraty, a. 69
Orbogart, a. 3 Arbogart, M. de 1328 Bangraty, a. 100
Orbogart, a. 4 Acker, M. de 1331 Bangraty, a. 149

Well, I smiled (‘‘almost laughed!’’) when I saw the fusion (‘‘mixup / mashup’’) between Mediterranean flexibility (Latin) and marketing

I admit I don’t know how they achieved that flexibility :thinking: but it seems “decent” in Eastern France (and Western Germany).

For the technicians, this video is more “raw” (‘‘even honest’’):