I can understand that, but even then, I don’t think that it makes sense to work with user defined forms, or data structures that vary by form, where each evidence person has a different structure, depending on the form defined by the user. No professional software engineer would ever do that, and I know that, because I can see the uniformity in data structures as it shows through the results on the sites. I’ve also done a few transcriptions for FamlySearch, working from home, and their software is also based on a universal design, which is flexible enough to work with data found in all sorts of sources.
Do you have a specification for these universal data structures? If so, I’m sure that we could write an importer.
Before I wrote the original census addon I tried to find census information to import, but unfortunately I had no luck. Perhaps more data is available now?
As far as the US goes, while a lot of our records are indexed with basic details like names, dates of birth, etc., a lot of details are not included in said indexes. This includes records from censuses, to birth and death certificates, to local church records, and even old military records.
Germany has a lot of records that are not indexed fully as well. Hamburg has very good indexing though. Most notably most indexes for German records will not include witnesses which can be very important. Further, I live near the Library of Congress where I get some of my German records from- however you are not allowed to leave the Library with books so I must photograph all the pages with my phone and type the information manually at home.
Poland is also lacking in sufficient indexing, especially when it comes to records from areas that used to be considered Germany.
England has indexes, especially regions like Cornwall, which are quite robust. But I have found that indexes for other regions such as Devon can be incomplete.
Ireland had a lot of record destruction and loss but free records for many parishes are available online. These records are also not indexed, however.
Sweden has some indexing done via paid sites, but they have a huge amount of records digitized and available for free on Riksarkivet. Also the indexes that are done do not come close to covering the body of records available, especially censuses.
France I cannot speak to as, despite my name, I am not French haha.
Czechia / Slovakia have almost no records indexed but a huge amount of their records are available online for free.
Switzerland is pretty good with indexes.
Hungary is lacking in digitization for my family so I would have to work from photographed records or pay to have scans done. These scans would then not have indexes.
German Jewish records are fairly well indexed however I currently have over 1000 images a relative ordered with no index that are not available online.
There are other countries I am forgetting about but that is a fair amount still.
I actually have an answer for this one! Check out the USGenWeb project.
I don’t, but you can easily derive those structures from the search results that you can see on those sites. They use a single format, that is independent of the type os source. And sometimes, you can also find traces of those structures in the page source, like in parts of javascript, or meta data.
Except for the Dutch open archives site, which has a public API, none of these sites actually produce source data that you can import, because that would be very bad for their business, which is largely based on selling data to be attached to persons in user trees on their sites. And on all commercial sites, part of that data are no source data at all, but other trees.
What I mean to say is that for me, as a software engineer, it is quite easy to derive a universal data structure from the things that I see on sites like Ancestry and FamilySearch, where users can see source and tree data side by side, ready to be matched, copied, or attached, grouped by person, and preceded by a universal header that presents whatever data they have about the related event, and the citation. That is very obvious when I look at these sites with a software engineer’s eye.
Reverse engineering it is, what I mean.
This is not entirely true. The fields you see for the transcriptions are based on the headers of whatever source you are looking at.
It’s also worth noting that the forms used by gramps do have a standard format, the same code is used to load every form from their xml files. Because of this users are able to create their own form files and load them into gramps.
What I mean is that the person data that I can see in the left column on FamilySearch have a universal structure from which elements can be easily copied to persons in the shared tree. And I see that on Ancestry too.
My main problem with the forms in Gramps is, that, because forms are designed by users, I can’t count on any field being universally available like the fields in person objects, also on sites, which may or may not have data in them. I just can’t count on them like I can count on the data in the tree.
To me forms are like spreadsheets, where each users creates his or her own mess.
I think this is the result of a difference in needs of users between online sites with massive user bases and individual local family tree software.
While it is important that a massive online genealogy program with many users maintains strict standardization, it is not much of a concern that (for instance) all users of Gramps are using transcriptions that have the same format for a given source. Local programs allow the opportunity for user customizability which online record respository-like programs do not.
The original code was written for my own personal purposes. I wanted something to speed up data entry and give me some reports and gramplets to show any inconsistencies between census returns.
I published the code thinking that it might be useful to others. There have been many definitions submitted, but I don’t review them (except for the English ones). Each user is free to use a submitted form or one of their own. If you use your own definitions then you have complete control over the attribute names.
We could introduce an information type. For example attributes named “Occupation”, “Profession” and “Trade” could all have the type “occupation”. If we wanted to transfer data between users in a standardised format then this would be useful. If sites provided data in a standardised XML or JSON format for import then we could use that. Unfortunately I don’t see a common standard - the site that @RenTheRoot linked has a different format to FreeCen for example.
In the UK we have FreeCen. Unfortunately there isn’t a download in XML, JSON or csv format.
The USGenWeb site does have text file downloads which is better than nothing. However, there doesn’t appear to be any cross-site standard being used.
Ah yes this is very true. In fact I think there are various free census and other record websites floating around, none of which adhere to a standard format. Honestly I would say that the standard for transferring such data is GEDCOM, which Gramps does indeed support haha. This is really data entry vs data import. The Forms addon is a data entry interface, data import interfaces would be something entirely different.
Absolutely correct
phil
Actually, very few, if any historical social research labs, archives and universities researching human social historical data (like census data etc.) use gedcom, every single one I have found use CSV or JSON (often JSON-LD), I have actually never seen gedcom used in any research project outside lineage-linked niche sites like familysearch, myheritage, ancestry etc.
Even Oxford and other large universities use CSV or JSON for their open-data data sources, even those that contain social humanity research.
As far as I can see, the default for sharing open-data like census data or other social humanity research data, transfer or downloads of data from census’ etc. is actually CSV or JSON or in some cases XML, not gedcom, I actually don’t think gedcom is used at all outside the sites of myheritage, familysearch, ancestry and similar sites or in the software from software developer that have been designing their software around the ideology of the genealogy research of the LDS (the lineage-linked approach).
EDIT: I forgot to write that multiple sites also use some xml format in addition to csv and json.
I wonder what format was used for exchanging contact tracking information between agencies during Covid?
In danger for being marked as OT again!
I am not sure regarding government offices, but nearly all open-data that are shared regarding Covid research that I have seen has been either in CSV or in a linked-data format like JSON-LD, that will in practice say some form of network graph or knowledge graph formats.
A lot of the research seems to have been done in software similar to Gephi and Cytoscape and then published in some online software created for the purpose, I have seen sites using Plotly and similar JAVA libraries.
I have not seen a single project that say they have used a genealogy-based og gedcom-based software for their research.
Just as one example, the data from NY Times (on github) from before they started to sync their data with the government data, is made accessible in RAW CSV, e.g. just a csv table with dates and numbers.
this project shared data in CSV and JSON: GitHub - opencovid19-fr/data: Consolidation des données de sources officielles concernant l'épidémie de COVID19
We can find a lot more of this by searching github for “covid data” or similar strings.
I also tested a search with “census data”, just for fun of it:
One thing I found that most of you most likely know about is this page:
https://www2.census.gov
Looks like their “Open Data” is presented as zipped text files, but they are so big, so I don’t want to take the time to open one of them now.
Possibly. But part of my curiosity is about how relationships or connectivity is defined and stored between these nodes. This could be applied to data described by Elizabeth Shown Mills in Identity Problems & the FAN (Friends, Associates, and Neighbors) Principle. Neighbors can commonly be found through Census data.
Likewise, the way they specify unknown persons and do person matching is of interest.
Rather than come up with our own format, it seems like it would be better to use data structures that could used with with emerging research tools.
this can best be found with linked data, e.g. network graphs.
I did this with a dataset from a Norwegian National Census from 1801, approx. 800k-900k people, I linked all people based on addresses, and it was really simple to do, I uses Cytoscape for that, because it is possible to define networks based on columns from one single table, so I could import the whole census dataset, then create networks based on residence data, place or cities etc., etc.
But the datasets are so huge, especially if you start to add data from multiple census, so I think the only smart way to go is to use a graph database or a multi-format database for all the data, e.g. from multiple years of census’, and then do extractions of smaller datasets if you want to work with the data in e.g., Gephi or yEd., not because that software can’t handle some millions of records, but because when you work with this type of records, they need to be readable.
Just a digression coming from a little hurricane in my brain…:
If Gramps used a multi-format database backend, let say ArangoDB, just for fun. could be anything, could be neo4j or MongoDB or Cassandra etc.
Then a “census” gramplet, could import a complete census dataset for a region of interest to the database, you could search the census data with graph algorithm for whatever you would like to find people in or close to your family members, then find information in this dataset that may not be provided in the online genealogy sites, e.g., previous residence or when they moved to new place and based on that, you could download and search a new dataset etc.etc.
All of this was of course done in the “Research Project” feature that have discussed before in another post.
When all of the data is added int the project for the Object of interest, the data could be submitted to the main Gramps database as confirmed data or proven data.
then you could go to the next OoI and do the work over again, but this time you might also need a dataset for the neighbor county, so you just download the complete dataset into the “gramplet” and starts the new search.
If a user didn’t need this advanced form of data processing, they just didn’t have to use the Gramplet and could add data manually.
The important thing about a feature like this is the possibility to customize the forms/tables so that they fit the RAW datasets so no attributes data was lost, like it often is in genealogy databases where they often not even list witnesses etc. in the indexed views.
When it comes to how the data/information is stored, I think the best format is a linked-data format or some form of a simple dated and cited “triplet”, e.g., “node 1, node 2, type, date of data, citation”.
Something similar to the CSV for the hierarchical place data. or just store a JSON-LD object in a JSON table…
I have not thought enough on that, since I stopped thinking about this feature in Gramps a long time ago, I found it more important to be able to get my data out in a format that could be used in other software without too much of a difficult workaround.
But if a feature like this was created in the spirit of Gramps, the relation was given attributes/properties ala properties on edges in network graph software, that way, we could add additional values to those relations as we found them… and those attributes could be used for search and network graph design or reports for those who needed it.
Maybe you could create a new discussion that have this as a topic?
It can be a wide one with ideas of how to store linked data and how an export/import of linked-data should be done…
e.g., should linked data only be for the top level of Gramps Objects, should a relation (edge), contain custom attributes, etc., etc.
Maybe not that many would participate, but maybe it would give some good suggestions?
PS. I think you will need to clearly mark it as a “Discussion” so that no one take it as a demand for changes.
That would be nice indeed, but in practice, none of these seem to have the following to become a standard outside their own realm, and organizations like the FHISO don’t understand how markets work, so they can’t change much either.
An example is that outside the genealogical world, there is a standard based on a principle that looks a bit like the FAN principle mentioned by Elizabeth Shown Mills. It’s called FOAF. which means Friend Of A Friend, and it’s described here on Wikipedia:
But standards like these can only take off when there are enough big guys like Google that want to invest in it, and that only happens if it gives them some sort of advantage. This is also why a standard like GEDCOM X never took off, and why the acceptance of GEDCOM 7 is slow too. Better exchange of genealogical data has no value for authors of family tree software, because it costs a lot, and it only makes it easier for users to move their data to a competing program.
GEDCOM X is accepted by vendors when it’s their only route to integration with FamilySearch, meaning that programa like Ancestral Quest, Legacy, and RootsMagic probably all speak GEDCOM X, on an encrypted channel to FamilySearch. I tried to eavesdrop on that, but failed, but I bet that they use it, either with JSON or XML, whatever works best for their developers. FamilySearch supports both.
And like GEDCOM X, which is mostly lineage linked indeed, other protocols are also mostly hidden from most of us users. I know that they exist, between archives and the big sites, but very few are actually visible.
One that is somewhat visible is the Dutch A2A protocol, where A2A means Archive To Archive. It was created by a student in 2009, and it looks like FamilySearch accepted it as in import format. And I guess that it is, because the signs are visible in this marriage record of my grandparents:
You can follow the signs by clicking the image on the left of that page, and when I do that, and switch the site to English, I get this:
And when you look at the page source for that (word wrap on), you can find the actual XML data near the bottom, marked as ‘a2a’. And that piece of XML has all 6 persons in it, with their names, ages, and roles, or relations, and it also has a full source reference that you can use to create a citation like the one that is visible on the site, and it also has a reference to the original data set, which comes from this database, operated by the Dutch archives as a collective:
https://opendata.archieven.nl/en/
The site has all sorts of datasets, and with the right search words, I might find the whole set that includes this marriage, but that seems to be a little harder than I thought.
I’m just writing this to illustrate that the data are there, as downloadable XML files, and can be accessed by other means too. The standards are local ones though.
I don’t see one either, and to me that means that if we want one, we probably need to agree on some sort of persona object like we discussed earlier, that can store the data in a sort of universal way, which is independent of the form used for entry, if any. And like the A2A standard that I mentioned in another message, it needs to be flexible enough to deal with a variety of source types, just like we do here in The Netherlands. And that does for instance imply, that it should be able to store person ages when they are shown in a particular source type, and birth dates for another, to stay as close to the original as reasonably possible.
There may be similar standards in use between your National Archives and sites like Ancestry and Find-My-Past. And I guess that they exist, because scans for some of my English ancestors are available on both, and they are not the only ones.
This page shows another example of such a cooperation: