[Batch Import] csv2sql csv2json, or csv2xml

Hello.
I want to try to import a data set related to architecture (buildings) with some common records like a genealogical search (events, dates, places, sources, repositories, citations, notes, etc.).

Here some samples of the data set:

As the database is SQL, I thought it was possible to generate a simple “bridge”. But I do not have access to this SQL DB.
So, either a translator from csv format to xml, to json or to sql. This project is based on some core wikipedia modules, but I did not look if a convertor already exists and might be adapted.

Looking at current snapshots, the csv fields looks like:
Titre: is an address or a place
Adresse complète: complete (full) address
Adress: Address
Numéro de rue:Street number
Ville:City
Pays:Country
Coordonnées:geographical coordinates
Image principale:main image
Événement:Events with (date and type)
Personne:Individual or Company
Inscription:a specific mark like a classed (classified ?) monument
Langue:Lang
Source:Source or Repository
The Description will be the content with HTML marks or wikipedia syntax, including Source via ref tag with or without attribute.
e.g.,
<ref name="seyboth">{{source|Seyboth Das Alte Strassburg (Livre)}} - Seyboth\, Adolphe\, ''Das alte Strassburg\, vom 13. Jahrhundert bis zum Jahre 1870\; geschichtliche Topographie nach den Urkunden und Chroniken\, bearb. von Adolph Seyboth''\, Strasbourg\, J.H.E. Heitz (Heitz & Mündel)\,1890\, p.11</ref>

<gallery>
Fichier:....jpg
Fichier:...pdf
</gallery>

or

[[Fichier:...]]
[[Media:...]]

should be the image and media stuff into the new description field.

Currently, I just wonder which will be the simpliest (or more direct) way for importing such dataset into a Gramps DB?

Best regards,
Jérôme

As this might help some people (or AI) ! Here a working (basic, few parsing) script, which will generate a Gramps XML from the above data set into CSV file format. Based on 5.2 branch (sorry !). Once shared, I hope that OpenAI, Claude or whatever AI will generate many Gramps XML at a glance. :wink:

import csv
import re
from xml.dom.minidom import parseString
from xml.etree import ElementTree as ET
from datetime import datetime

def dms_to_decimal(dms):
    """Convertit les coordonnées DMS (ex. 48° 34' 34.00" N) en décimal."""
    match = re.match(r"(\d+)° (\d+)' ([\d\.]+)\" ([NSEW])", dms.strip())
    if match:
        degrees, minutes, seconds, direction = match.groups()
        decimal = float(degrees) + float(minutes)/60 + float(seconds)/3600
        if direction in ('S', 'W'):
            decimal *= -1
        return decimal
    return None

def parse_coords(coords_str):
    """Extrait et convertit les coordonnées DMS en décimal."""
    if not coords_str:
        return None, None
    parts = coords_str.split(',')
    if len(parts) == 2:
        lat = dms_to_decimal(parts[0].strip())
        lon = dms_to_decimal(parts[1].strip())
        return lat, lon
    return None, None

def csv_to_gramps_xml(csv_file_path, output_xml_file_path):
    with open(csv_file_path, mode='r', encoding='utf-8') as csv_file:
        csv_reader = csv.DictReader(csv_file)
        data = [row for row in csv_reader]

    # Créer la structure XML de base pour Gramps 5.2
    database = ET.Element('database', xmlns="http://gramps-project.org/xml/1.7.1/")

    # En-tête avec la version de Gramps
    header = ET.SubElement(database, 'header')
    ET.SubElement(
        header, 'created',
        date=datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        version="5.2.5"
    )
    ET.SubElement(header, 'researcher', name="Generated by script")

    # Sections principales (comme dans example.gramps)
    objects = ET.SubElement(database, 'objects')
    places = ET.SubElement(objects, 'places')
    notes = ET.SubElement(objects, 'notes')

    # Compteurs pour les handles (format Gramps : _xxxxxxxx)
    place_handle = 100000000
    note_handle = 400000000

    # Dictionnaire pour stocker les handles des notes
    note_handles = {}

    # Ajouter d'abord toutes les notes
    for row in data:
        if 'Description' in row and row['Description'].strip():
            note = ET.SubElement(
                notes, 'note',
                handle=f"_{note_handle}",
                change=str(int(datetime.now().timestamp())),
                id=f"N{note_handle}",
                type="Note"
            )
            text = ET.SubElement(note, 'text')
            text.text = row['Description'].strip()
            note_handles[row['Titre']] = f"_{note_handle}"
            note_handle += 1

    # Ajouter les lieux et référencer les notes
    for row in data:
        place = ET.SubElement(
            places, 'placeobj',
            handle=f"_{place_handle}",
            change=str(int(datetime.now().timestamp())),
            id=f"P{place_handle}",
            type="Place"
        )
        ptitle = ET.SubElement(place, 'ptitle')
        ptitle.text = row.get('Titre', 'Inconnu')

        # Nom du lieu
        pname = ET.SubElement(place, 'pname')
        pname.set('value', row.get('Titre', 'Inconnu'))

        # Coordonnées
        lat, lon = parse_coords(row.get('Coordonnées', ''))
        if lat and lon:
            coord = ET.SubElement(place, 'coord')
            coord.set('lat', f"{lat:.6f}")
            coord.set('long', f"{lon:.6f}")

        # Référence à la note si elle existe
        if row['Titre'] in note_handles:
            noteref = ET.SubElement(place, 'noteref')
            noteref.set('hlink', note_handles[row['Titre']])

        place_handle += 1

    # Convertir en chaîne XML
    xml_str = ET.tostring(database, encoding='utf-8').decode('utf-8')

    # Ajouter la déclaration DOCTYPE pour Gramps 5.2
    doctype_declaration = '''<!DOCTYPE database PUBLIC "-//Gramps//DTD Gramps XML 1.7.1//EN"
"http://gramps-project.org/xml/1.7.1/grampsxml.dtd">'''
    xml_str = f"{doctype_declaration}\n{xml_str}"

    # Formater le XML avec minidom
    dom = parseString(xml_str)

    # Écrire le fichier
    with open(output_xml_file_path, 'w', encoding='utf-8') as f:
        dom.writexml(f, indent='  ', addindent='  ', newl='\n', encoding='utf-8')

# Exemple d'utilisation
csv_to_gramps_xml('export.csv', 'output.gramps')

Note, Mistral AI, maybe with only few features, does not block any coding help or devel stuff. Feel free to improve, modify this first draft.

Maybe I will just add a wikipedia syntax support on Note (or at least set HTML type). Parsing might be generated from Note object. To enable internal handle links into Note (Media, Events [dates and places], Individuals/Companies) might be an other quick improvement.

So, as Note object becomes the top header into the objects hierarchy, managing records could be robust, safe and more easily checked on import process.

I did some experimentations in the past around custom Gramps XML file format (events and places databases). Tiny “light” squeleton for a valid Gramps XML is on the above script. Maybe just set 1.7.2 and (6.0.0) if you want a branch 6.0.X support only.
To set 5.2.0 will only generate a minor upgrade, as it will provide a more large support (no downgrade).

Great, there is a python module !

Note, I cannot export filtered records to Gramps XML file format anymore without at least one individual. It sounds like gedcom export and I cannot do something like a pure 1:1 events or places manager for objects unrelated to at least one person. Same type of limitations for reports (e.g., Narrative Web).

This is anyway a cosmetic issue once import becomes a success. Just need a new way for a file format and filtering (to export all records should be done via a Gramps XML backup).

I made a github repository for the script

ok, AW (for Archi Wiki) sounds like Automatic Weapon for some native english readers. Maybe think on an other automatic stuff. Automatic Way to Gramps sounds good for me.

A little bit experimental as wikipedia wrapper (parser) will generate a very slow process on the last version.

$ python3 AW2gramps.py export.csv output.gramps
Conversion de 13839 entrées...
Traitement des notes : 13839/13839 (100%)
Traitement des lieux : 13839/13839 (100%)

Very slow

$ python3 AW2gramps.py export.csv output.gramps --wikipedia
Conversion de 13839 entrées...
Traitement des notes : 677/13839 (4%)
...

I guess this might be used for displaying some contents or sections from gramps manuals too?

Still some minor issues around backreferences and objectreferences with a Note object as the top hierarchical object, also an other one around batch (CPU versus memory with python3) methods for gramps XML generation. It may be very fast on the first pass/check, then could be very slow[1] on data write for very large dataset (csv file).

For the conversion process, both are cosmetic issues.

$ wget https://www.archi-wiki.org/assets/export.csv
$ python3 AW2gramps.py export.csv output.gramps

Date handling might be improved, but I guess it is rather related to the MediaWiki (ArchiWiki) stuff. Same problem (format) on names for architects.

Currently, only Sources and Media are really linked (and backlinked) via the top Note (head/lead object). Events, Places and Persons might be also linked, but it becomes too complex for advanced dataset and might generate infinite loops on check process.

The Person-Event-Place group will be different than Source/Citation-Media-Note. Place and Events are linked, that was the most important for the data navigation into gramps.

[1] e.g., 8 minutes

Before going too far into customization (customisation) for Gramps XML output, some ‘last’ fixes are now making the data handling (conversion) more “Gramps” compliant, and more flexible for data navigation or any re-use.

Not able to secure full or complete internal uri set for some objects on Note, these lines have been removed, as we have references and backreferences anyway on each Note object.

By looking at HTML Note cleanup addon (copy some code), it sounded a little bit better (thank you Paul and Nick), but some issues on large notes seem to have an other source (no investigations, testcases or unit tests). I thought on some specific mistakes with len() and some characters or tags [e.g., ’{’ or ‘}’] and for avoiding too many extra hacking on this script, setting internal uri references for object handles has been ignored, despite working (but only sometimes without location issue on characters range).

Maybe last remaining issue is around Media handling and on Citations set. Both will be either created or listed during the Gramps XML import.
So, generated file will still have a non-blocking issue, fixed by Gramps on import. These sections are more flexibles (less strict), so I will not try to use a specific function or an extra check for them.

This project was related to one dataset (model). Maybe some generic code might be used for wrapping/parsing only HTML page or lists. The ‘XML’ step was, for me, a way to generate a flat and robust self-database with internal check before import. Sure, to only import csv data into gramps might be easier but as it was a large dataset with multiple tables, to build an alternate database (.gramps file) was my simpliest solution.

Ah, maybe I should also force to ignore bce stuff… Got some issues with characters like “-”.

I was able to get the history of this section:

but not certain that my traceback (5.2.x branch):

File "/usr/lib/python3/dist-packages/gramps/plugins/importer/importxml.py", line 1062, in parse
    self.p.ParseFile(ifile)
  File "../Modules/pyexpat.c", line 419, in StartElement
  File "/usr/lib/python3/dist-packages/gramps/plugins/importer/importxml.py", line 3352, in startElement
    f(attrs)
  File "/usr/lib/python3/dist-packages/gramps/plugins/importer/importxml.py", line 2646, in start_dateval
    if val[0] == "-":
IndexError: string index out of range

is a “real” (as normal usage) issue or only a wrong value set as a date on my custom Gramps XML (51,3Mo). I should be able to add some print statements or a fallback as workaround. Unit tests might be too serious (i.e., complicated) for me.

Does it just mean that this list might sometimes be empty?

val = []
$ grep '<dateval type="Span" val=""' output.gramps
      <dateval type="Span" val=""/>
      <dateval type="Span" val=""/>
      <dateval type="Span" val=""/>
$ gramps -d ""
2026-03-25 17:11:46.273: DEBUG: sqlite.py: line 135: ('SELECT blob_data FROM event WHERE handle = ?', ['700000010'])
2026-03-25 17:11:46.273: DEBUG: sqlite.py: line 135: ('UPDATE event SET blob_data = ? WHERE handle = ?', [b'\x80\x03(X\t\x00\x00\x00700000010q\x00X\x05\x00\x00\x00E0034q\x01K\x00X\x0c\x00\x00\x00constructionq\x02\x86q\x03(K\x00K\x06K\x00(K\x00K\x00K\x00\x89tq\x04X\'\x00\x00\x00<dateval type="Span" val="0000-00-00"/>q\x05K\x00K\x00tq\x06X\x11\x00\x00\x00Inconnu (Inconnu)q\x07X\x00\x00\x00\x00q\x08]q\t]q\n]q\x0b]q\x0cJ\xc7\xfe\xc3i]q\r\x89tq\x0e.', '700000010'])
2026-03-25 17:11:46.273: DEBUG: sqlite.py: line 135: ('UPDATE event SET handle = ?, gramps_id = ?, description = ?, place = ?, change = ?, private = ? where handle = ?', ['700000010', 'E0034', 'Inconnu (Inconnu)', '', 1774452423, 0, '700000010'])
2026-03-25 17:11:46.273: DEBUG: sqlite.py: line 135: ('DELETE FROM reference WHERE obj_handle = ?', ['700000010'])
2026-03-25 17:11:46.274: DEBUG: sqlite.py: line 135: ('SELECT 1 FROM event WHERE handle = ?', ['700000011'])
2026-03-25 17:11:46.274: DEBUG: sqlite.py: line 135: ('SELECT 1 FROM event WHERE handle = ?', ['700000011'])
2026-03-25 17:11:46.274: DEBUG: sqlite.py: line 135: ('INSERT INTO event (handle, blob_data) VALUES (?, ?)', ['700000011', b'\x80\x03(X\t\x00\x00\x00700000011q\x00X\x12\x00\x00\x000.9054461098245229q\x01K\x0cX\x00\x00\x00\x00q\x02\x86q\x03Nh\x02h\x02]q\x04]q\x05]q\x06]q\x07JB\t\xc4i]q\x08\x89tq\t.'])
2026-03-25 17:11:46.274: DEBUG: sqlite.py: line 135: ('UPDATE event SET handle = ?, gramps_id = ?, description = ?, place = ?, change = ?, private = ? where handle = ?', ['700000011', '0.9054461098245229', '', '', 1774455106, 0, '700000011'])
2026-03-25 17:11:46.274: DEBUG: sqlite.py: line 135: ('DELETE FROM reference WHERE obj_handle = ?', ['700000011'])
2026-03-25 17:11:46.274: DEBUG: sqlite.py: line 135: ('SELECT 1 FROM event WHERE gramps_id = ?', ['E0035'])
2026-03-25 17:11:46.274: DEBUG: sqlite.py: line 171: ROLLBACK;
2026-03-25 17:11:46.274: ERROR: dbloader.py: line 548: Failed to import database.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/gramps/gui/dbloader.py", line 531, in do_import
    dbstate=self.dbstate,
$ grep -n5 "700000011" output.gramps
66-    <event change="1774452423" handle="_700000010">
67-      <type>construction</type>
68-      <dateval type="Span" val="0000-00-00"/>
69-      <description>Inconnu (Inconnu)</description>
70-    </event>
71:    <event change="1774452423" handle="_700000011">
72-      <type>Construction</type>
73-      <dateval type="Span" val=""/>
74-      <description>Inconnu (Inconnu)</description>
75-      <place hlink="_100000005"/>
76-    </event>

Ok, I see one unit test on master branch for the span date.

but I do not know if it is possible to set an empty date with span attribute on date edition into gramps (or gramps-web).