[Batch Import] csv2sql csv2json, or csv2xml

Hello.
I want to try to import a data set related to architecture (buildings) with some common records like a genealogical search (events, dates, places, sources, repositories, citations, notes, etc.).

Here some samples of the data set:

As the database is SQL, I thought it was possible to generate a simple “bridge”. But I do not have access to this SQL DB.
So, either a translator from csv format to xml, to json or to sql. This project is based on some core wikipedia modules, but I did not look if a convertor already exists and might be adapted.

Looking at current snapshots, the csv fields looks like:
Titre: is an address or a place
Adresse complète: complete (full) address
Adress: Address
Numéro de rue:Street number
Ville:City
Pays:Country
Coordonnées:geographical coordinates
Image principale:main image
Événement:Events with (date and type)
Personne:Individual or Company
Inscription:a specific mark like a classed (classified ?) monument
Langue:Lang
Source:Source or Repository
The Description will be the content with HTML marks or wikipedia syntax, including Source via ref tag with or without attribute.
e.g.,
<ref name="seyboth">{{source|Seyboth Das Alte Strassburg (Livre)}} - Seyboth\, Adolphe\, ''Das alte Strassburg\, vom 13. Jahrhundert bis zum Jahre 1870\; geschichtliche Topographie nach den Urkunden und Chroniken\, bearb. von Adolph Seyboth''\, Strasbourg\, J.H.E. Heitz (Heitz & Mündel)\,1890\, p.11</ref>

<gallery>
Fichier:....jpg
Fichier:...pdf
</gallery>

or

[[Fichier:...]]
[[Media:...]]

should be the image and media stuff into the new description field.

Currently, I just wonder which will be the simpliest (or more direct) way for importing such dataset into a Gramps DB?

Best regards,
Jérôme

As this might help some people (or AI) ! Here a working (basic, few parsing) script, which will generate a Gramps XML from the above data set into CSV file format. Based on 5.2 branch (sorry !). Once shared, I hope that OpenAI, Claude or whatever AI will generate many Gramps XML at a glance. :wink:

import csv
import re
from xml.dom.minidom import parseString
from xml.etree import ElementTree as ET
from datetime import datetime

def dms_to_decimal(dms):
    """Convertit les coordonnées DMS (ex. 48° 34' 34.00" N) en décimal."""
    match = re.match(r"(\d+)° (\d+)' ([\d\.]+)\" ([NSEW])", dms.strip())
    if match:
        degrees, minutes, seconds, direction = match.groups()
        decimal = float(degrees) + float(minutes)/60 + float(seconds)/3600
        if direction in ('S', 'W'):
            decimal *= -1
        return decimal
    return None

def parse_coords(coords_str):
    """Extrait et convertit les coordonnées DMS en décimal."""
    if not coords_str:
        return None, None
    parts = coords_str.split(',')
    if len(parts) == 2:
        lat = dms_to_decimal(parts[0].strip())
        lon = dms_to_decimal(parts[1].strip())
        return lat, lon
    return None, None

def csv_to_gramps_xml(csv_file_path, output_xml_file_path):
    with open(csv_file_path, mode='r', encoding='utf-8') as csv_file:
        csv_reader = csv.DictReader(csv_file)
        data = [row for row in csv_reader]

    # Créer la structure XML de base pour Gramps 5.2
    database = ET.Element('database', xmlns="http://gramps-project.org/xml/1.7.1/")

    # En-tête avec la version de Gramps
    header = ET.SubElement(database, 'header')
    ET.SubElement(
        header, 'created',
        date=datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        version="5.2.5"
    )
    ET.SubElement(header, 'researcher', name="Generated by script")

    # Sections principales (comme dans example.gramps)
    objects = ET.SubElement(database, 'objects')
    places = ET.SubElement(objects, 'places')
    notes = ET.SubElement(objects, 'notes')

    # Compteurs pour les handles (format Gramps : _xxxxxxxx)
    place_handle = 100000000
    note_handle = 400000000

    # Dictionnaire pour stocker les handles des notes
    note_handles = {}

    # Ajouter d'abord toutes les notes
    for row in data:
        if 'Description' in row and row['Description'].strip():
            note = ET.SubElement(
                notes, 'note',
                handle=f"_{note_handle}",
                change=str(int(datetime.now().timestamp())),
                id=f"N{note_handle}",
                type="Note"
            )
            text = ET.SubElement(note, 'text')
            text.text = row['Description'].strip()
            note_handles[row['Titre']] = f"_{note_handle}"
            note_handle += 1

    # Ajouter les lieux et référencer les notes
    for row in data:
        place = ET.SubElement(
            places, 'placeobj',
            handle=f"_{place_handle}",
            change=str(int(datetime.now().timestamp())),
            id=f"P{place_handle}",
            type="Place"
        )
        ptitle = ET.SubElement(place, 'ptitle')
        ptitle.text = row.get('Titre', 'Inconnu')

        # Nom du lieu
        pname = ET.SubElement(place, 'pname')
        pname.set('value', row.get('Titre', 'Inconnu'))

        # Coordonnées
        lat, lon = parse_coords(row.get('Coordonnées', ''))
        if lat and lon:
            coord = ET.SubElement(place, 'coord')
            coord.set('lat', f"{lat:.6f}")
            coord.set('long', f"{lon:.6f}")

        # Référence à la note si elle existe
        if row['Titre'] in note_handles:
            noteref = ET.SubElement(place, 'noteref')
            noteref.set('hlink', note_handles[row['Titre']])

        place_handle += 1

    # Convertir en chaîne XML
    xml_str = ET.tostring(database, encoding='utf-8').decode('utf-8')

    # Ajouter la déclaration DOCTYPE pour Gramps 5.2
    doctype_declaration = '''<!DOCTYPE database PUBLIC "-//Gramps//DTD Gramps XML 1.7.1//EN"
"http://gramps-project.org/xml/1.7.1/grampsxml.dtd">'''
    xml_str = f"{doctype_declaration}\n{xml_str}"

    # Formater le XML avec minidom
    dom = parseString(xml_str)

    # Écrire le fichier
    with open(output_xml_file_path, 'w', encoding='utf-8') as f:
        dom.writexml(f, indent='  ', addindent='  ', newl='\n', encoding='utf-8')

# Exemple d'utilisation
csv_to_gramps_xml('export.csv', 'output.gramps')

Note, Mistral AI, maybe with only few features, does not block any coding help or devel stuff. Feel free to improve, modify this first draft.

2 Likes

Maybe I will just add a wikipedia syntax support on Note (or at least set HTML type). Parsing might be generated from Note object. To enable internal handle links into Note (Media, Events [dates and places], Individuals/Companies) might be an other quick improvement.

So, as Note object becomes the top header into the objects hierarchy, managing records could be robust, safe and more easily checked on import process.

I did some experimentations in the past around custom Gramps XML file format (events and places databases). Tiny “light” squeleton for a valid Gramps XML is on the above script. Maybe just set 1.7.2 and (6.0.0) if you want a branch 6.0.X support only.
To set 5.2.0 will only generate a minor upgrade, as it will provide a more large support (no downgrade).

Great, there is a python module !

1 Like

Note, I cannot export filtered records to Gramps XML file format anymore without at least one individual. It sounds like gedcom export and I cannot do something like a pure 1:1 events or places manager for objects unrelated to at least one person. Same type of limitations for reports (e.g., Narrative Web).

This is anyway a cosmetic issue once import becomes a success. Just need a new way for a file format and filtering (to export all records should be done via a Gramps XML backup).

I made a github repository for the script

ok, AW (for Archi Wiki) sounds like Automatic Weapon for some native english readers. Maybe think on an other automatic stuff. Automatic Way to Gramps sounds good for me.

2 Likes

A little bit experimental as wikipedia wrapper (parser) will generate a very slow process on the last version.

$ python3 AW2gramps.py export.csv output.gramps
Conversion de 13839 entrées...
Traitement des notes : 13839/13839 (100%)
Traitement des lieux : 13839/13839 (100%)

Very slow

$ python3 AW2gramps.py export.csv output.gramps --wikipedia
Conversion de 13839 entrées...
Traitement des notes : 677/13839 (4%)
...

I guess this might be used for displaying some contents or sections from gramps manuals too?

1 Like

Still some minor issues around backreferences and objectreferences with a Note object as the top hierarchical object, also an other one around batch (CPU versus memory with python3) methods for gramps XML generation. It may be very fast on the first past/check, then could be very slow[1] on data write for very large dataset (csv file).

For the conversion process, both are cosmetic issues.

$ wget https://www.archi-wiki.org/assets/export.csv
$ python3 AW2gramps.py export.csv output.gramps

Date handling might be improved, but I guess it is rather related to the MediaWiki (ArchiWiki) stuff. Same problem (format) on names for architects.

Currently, only Sources and Media are really linked (and backlinked) via the top Note (head/lead object). Events, Places and Persons might be also linked, but it becomes too complex for advanced dataset and might generate infinite loops on check process.

The Person-Event-Place group will be different than Source/Citation-Media-Note. Place and Events are linked, that was the most important for the data navigation into gramps.

[1] e.g., 8 minutes

Before going too far into customization (customisation) for Gramps XML export, some ‘last’ fixes are now making the data handling (conversion) more “Gramps” compliant, and more flexible for data navigation or any re-use.