I was faced recently with the ever-moving nature of internet.
My tree is annotated with “proofs” of records in the form of citations (along with source and repository). The URL for the data is stored in a Note with type LINK. There are thousands of such URLs.
During this summer, I crazily (who would ever do this?) undertook checking the validity of these URLs, only to discover that most repositories changed their archival software – and sometimes even their hostname – which invalidated the references. And when permalinks were used, permalinks were also invalidated. I naively believed that “permanent” links were guaranteed to be strongly attached to the data while it appears the the management software plays an important role (indeed, the ARK definition never specifies how such “permanent” links are created).
I already corrected 3k links and I am not at the end of it.
So, I tried to specify how I could solve this maintenance problem by slicing the URLs into fragments which could be dynamically combined to provide the effective link to data.
The idea is to divide maintenance into smaller steps:
repository for hostname and ARK authority
source for document designation
citation for intra-document “pointer”
Thus the burden reduces to manage tens of repositories and hundreds of sources. The intra-document part is less subject to change according to my observations but this depends on technological choices for access to “final” data.
Nevertheless, this is a considerable improvement.
I have schematically described my idea in the attached proposal. I’d be glad to received comments about it. I intend anyway to implement it, at least for my personal usage (GRAMPS is FOSS), but I have presently too little time for it.
Hmm. This problem bears a vague resemblance to what the Web Search add-on gramplet approaches.
In the original 2010 WebConnect pack add-on Quick Reports, @dsblank composed URLs from Gramps records to pass parameters to online databases. It suffered from a high rate of link rot because the online databases constantly evolved their patterns for parameter passing.
Because the patterns were buried in the code, it was very scarey for non-coding websurfers to add or maintain the patterns. Perhaps the most important reason it was under utilized: the innovative interface used the QuickReports context menu. Which made it too invisible and laborious to navigate.
In WebSearch, @Urchello took a different swing at the problem. Firstly, it made the pattern formulation much easier to access and maintain for non-coders websurfers who recognized URL-based query patterns. Secondly, it allowed custom Attributes to be used to store Uniform Resource Identifiers (URIs) and feed those custom Attribute values as paramters in the URL composition. Finally, by moving the interface to a (configurable) Gramplet, it made the online lookup opportunities more visible and accessible.
However, it adds enormous “clutter” to the Attribute list for a single add-on. So much that is might require layering on some sort of Hierarchy or Filtering for the Attribute menus. And the labeling of the links in the UI is so highly abbreviated that they become obscure.
If these other concepts could be integrated (evolved) into the “dynamically synthesised URLs” proposed approach using a Citation (CITE) plug-in type, it could be a revolutionary refinement.
Due to the high rate of link rot, it might need an addition a suite of complimentary tools though. Such as:
a validation report that generates samples each Request for comment: dynamically synthesised URLs currently in-use (or more rigorous “all in use”) as a webpage. Linkrot could be quickly identified by ruuning the webpage through a link checker tool, like Xenu LinkSleuth.
a harvester that archives Webpage (“Complete” or “Single File”) dated snapshots of those synthesised URLs pages. (But makes them part of a local “Repository” with the URI as a “Call Number” to avoid polluting/overloading the Media category.)
If you’re considering only ark:/ links, you could go one step further and let gramps resolve the domain name. Their is a registry (https://cdluc3.github.io/naan_reg_priv/naan_registry.txt) that allows that. When an archive changes domain name, they should update that file.
Unfortunately, not all archive sites use ARK. The synthesis must cope with that.
I had a look at the link. Unfortunately again, it is not up-to-date. Many hostnames are obsolete. Only a DNS-like registry could keep abreast with the changes.