Lowest common denominator: XML schema in .gramps files?

Would it be viable to add a feature to represent the XML at its lowest common denominator (lcd) schema?

Maybe as a post process validator tool? That can “discount” the XML schema as appropriate.

And to make the XML importer only complain about XML schemas that are higher than it recognizes? So it starts accepting if a newer version of Gramps wrote the file… so long as the XML schema is recognizable?

And perhaps offer an addon XML importer option that loads what it can but logs the unrecognized chunks? (Similar to how the GEDCOM importer gracefully sidesteps unrecognizable custom tags.)

The XML importer should be capable of importing any previous version without data loss.

True. But say that you were to create a simple Tree in Gramps 6.0.5. (with content that has been in .gramps since the beginning. Such as a person, their parents with birth and death dates but no Place).

And you export that tree to share as example data. That example can only be used by Gramps 6.0 users. The Windows 32-bit users limited to using Gramps 5.1.5 cannot use that example file.

If I generate a .gramps example file for posting here on Discourse, it should loadable by users who have delayed upgrading to the current release… or ‘loadable’ so long as the file is not illustrating a newer schema feature.

I’d have to export to GEDCOM for that to be broadly usable. Or tweak the XML text file headerdata from version from 1.7.2 to 1.7.0 and the Gramps version to 5.

Yet the import works perfectly fine for that legacy data after tweaking the versions.

Importing newer XML schema versions into old Gramps versions that were written for older XML schema versions is not supported.

Depending on the data and the versions used, it may work or data may be lost or corrupted. This is why we prevent the import of future versions.

The reply by @romjerome about some XML citations in a GEDCOM to .gramps file by the @DavidMStraub conversion have helped the refine the ‘ask’.

It is too easy to miss an advanced Gramps XML data element when doing hand inspections.

So I guess that I am asking about workflows and tools to validate an XML file against DTD versions.

The validations I might do are :

  1. against the DTD referenced in the XML file.
  2. against the DTDs at the various Gramps release backwards compatibility breakages

I asked perplexity. And it suggested:

There are several Python tools available for validating an XML file against a DTD and identifying non-compliant chunks of data. The most notable tool is the widely used Python library lxml, which provides robust support for DTD validation. Here are key points:

  • lxml’s etree.DTD class allows loading a DTD file and then validating an XML document against it. The validation result is a boolean indicating compliance.
  • When validation fails, lxml provides an error log (dtd.error_log) that includes detailed error messages pinpointing the exact locations and reasons for non-compliance within the XML document.
  • You can validate XML files that either contain a DOCTYPE referencing the DTD or use an external DTD file passed explicitly to the lxml.DTD constructor.
  • Example usage involves parsing the XML document with lxml, loading the DTD, calling dtd.validate(xml_tree), and then analyzing dtd.error_log for compliance issues.
  • This method is practical for batch validation and programmatic inspection of XML correctness against DTDs.

Other Python tools or web applications also exist that leverage lxml or similar libraries, but lxml is the most mature, widely recommended, and feature-rich option for the purpose of validating XML files against a DTD with error detail reporting.github+3

If a more interactive or GUI approach is preferred, professional XML editors like oXygen or XMLSpy offer XML editing and DTD validation features, but for scriptable and automated workflows, lxml in Python is ideal.linkedin

  1. GitHub - Abdellah-belcaid/XML-Operations: The XML Validator and Transformer is a Python-based web app that validates XML, transforms XML to JSON/HTML via XSLT, and converts XSD to DTD using Perl. Requires Python, Django, lxml, and xmltodict. A powerful tool for XML data manipulation.
  2. https://stackoverflow.com/questions/15798/how-do-i-validate-xml-against-a-dtd-file-in-python
  3. Validation with lxml
  4. Use lxml and Python3 to validate XML against a DTD.The DTD can be specified on the command line, or as an optional parameter to the script. · GitHub
  5. Understanding the Power of XML and XML DTD in Data Science: A Hands-On Guide
  6. How do XML developers validate XML documents? | MoldStud
  7. Checking XML Well-Formedness - Python Cookbook [Book]
  8. https://validatexml.com
  9. XML Validator
  10. Generating DTD | PyCharm Documentation
  11. A Roadmap to XML Parsers in Python – Real Python
  12. How to Parse XML in Python? Multiple Methods Covered
  13. https://www.xmlvalidation.com

I am also wondering if there is a DTD validation test of the example.gramps XML file written by new releases of Gramps. Do we verify that Gramps is always writing valid .gramps XML?

@dsblank has previously asked about suggesting things that should be in the TestPlan

I use the xmllint command to test our schemas when they are updated.

For example:

xmllint --noout --relaxng grampsxml.rng example.gramps

If you try to validate our current example database using the 1.7.1 schema, it will fail with errors even if you update the namespace.

1 Like

Perplexity breakdown of your reply. (objective is a wiki article)

The “update the namespace” caveat Nick mentioned refers to the importance of ensuring that the XML document’s declared namespace matches the namespace declared in the schema (such as the RelaxNG schema file) used for validation.

Namespaces in XML serve as unique identifiers to distinguish elements and attributes that may have the same name but different meanings across vocabularies. When validating an XML file against a schema, the validator checks that elements and attributes correctly belong to the namespace defined in the schema.

If the XML document uses an outdated or incorrect namespace prefix or URI that does not correspond exactly to the one expected by the schema, validation will fail even if the XML elements structurally conform to the schema’s rules. Thus, updating the namespace in the XML to align it with the schema’s target namespace is essential for successful validation.

In the case of the Gramps example, even if the schema version 1.7.1 is updated, the XML instance document must also have its namespace declarations updated accordingly. Without this, xmllint or any validator will report errors, as the elements in the XML are not recognized as belonging to the schema’s namespace “space” and thus appear invalid.

In summary:

  • XML namespaces act like “addresses” for elements/attributes.
  • Validation requires the XML instance’s namespaces to match the schema’s namespaces.
  • Updating namespaces in the XML to reflect changes in the schema namespace URI or prefix is necessary when schemas evolve.
  • Ignoring this step results in validation failure despite correct structural conformity.
  • This ensures that both schema and instance document share a common “language” for element names, enabling meaningful validation.

The parameters in the command xmllint --noout --relaxng grampsxml.rng example.gramps mean the following:

  • --noout: This option tells xmllint not to output the XML content to the standard output after parsing or validation. It suppresses printing the XML document itself and only reports validation errors or messages.
  • --relaxng grampsxml.rng: This specifies that xmllint should validate the XML file against a Relax NG schema instead of a DTD or XML Schema. The grampsxml.rng is the path to the Relax NG schema file used for validation. --relaxng instructs xmllint which schema file to use.
  • example.gramps: This is the input XML file to be parsed and validated. xmllint reads this file to check its conformance with the schema specified via --relaxng.

In summary, the command runs the xmllint XML tool to validate the XML file example.gramps against the Relax NG schema in grampsxml.rng and only outputs validation errors, suppressing the full XML output.

xmllint DTD validation alternative to the Relax NG schema

Nick notes that this currently gives errors even if the namespace is updated.

To validate an XML file against a DTD using xmllint (instead of a Relax NG schema), you would typically use the --valid or --dtdvalid options.

Given the XML snippet specifying the DTD in the DOCTYPE declaration:

xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE database PUBLIC "-//Gramps//DTD Gramps XML 1.7.2//EN"
"http://gramps-project.org/xml/1.7.2/grampsxml.dtd">
<database xmlns="http://gramps-project.org/xml/1.7.2/">
  <header>...

You can validate it with xmllint as:

bash

xmllint --noout --valid example.gramps

Here:

  • --valid tells xmllint to validate the XML file against the DTD referenced inside the file itself (in this case, in the DOCTYPE declaration).
  • --noout suppresses output of the XML content, showing only errors or validation messages.
  • example.gramps is the XML file to validate.

If the XML file did not include a DOCTYPE or you want to specify a DTD file explicitly, you can use:

bash

xmllint --noout --dtdvalid grampsxml.dtd example.gramps

Where --dtdvalid grampsxml.dtd points to the external DTD file explicitly for validation.

In summary, to do the equivalent of validating with a Relax NG schema using a DTD, replace:

  • --relaxng grampsxml.rng with either --valid if DTD is included internally, or
  • --dtdvalid grampsxml.dtd if referencing an external DTD file explicitly.

This will validate example.gramps against the defined DTD and report any non-compliance errors.

Depends on what you want to get out of your ‘validation’?
Suggest you define that first.
Validate against the schema you have, note variations.
Tweak either the schema for the instance or vice versa until you get a validation. then edit the instance to move the ‘invalid’ content from the ?old? to the new schema manually. I can’t see any automations working 100%
HTH

1 Like

My main goal would be to downgrade the XML version as much as possible. So identify the data which data would be lost and choosing what can be safely discarded. (Maybe along similar lines to the strategic Note markdown trimmer that @kku did in SuperTool.)

The thought is that more people would be willing to share data in Gramps .xml if it was easier to assure broad useabilty.

So a county’s Historical Society might willing to maintain a Place database (or maybe just cemeteries list), local history reference book list, or local person list on an instance of @DavidMStraub 's Gramps Web. And visitors could pick and choose items to Export/Sync to their local Gramps.

But it wouldn’t fly if everyone had to be on the highest release of Gramps all the time. That would be as aggravating as Microsoft constantly pushing upgrades before you can do a particular task.

If a newer schema version only adds attributes, you can hope to produce a version where you can just “strip” the new attributes. But any less trivial change makes it impossible to produce a kind of backward compatible schema. What would be needed would be a tool to “downgrade” XMLs to older schema versions by making the necessary transformations (XSLT)?

Disclaimer: haven’t read all the perplexity stuff above (TL;DR).

1 Like

Have you asked Perplexity to help you write a Python script that does the full job?
– Check the schema version
– Remove non-compatible data
– Write to the target version
– Verify data integrity

Or try Cursor – Copilot recommends it.
It can scaffold the entire project setup from your description.