Working with evidence

Somehow, Discourse things I’m too pushy (am I?), so I can’t send another reply to the XML subject. That’s why I’m pasting what I wrote into a new topic:

The best solution, I think, is to work with a concept that I learned from fellow developer Thomas Wetmore, way back in 2010, in the days of the Better GEDCOM community. He referred to that as the persona, and although it was the first time that I read about that, the term was most probably not invented by him. You can find his paper about that on this page, at #72:

https://tech.fhiso.org/cfps/papers

Simply put, you can see a persona as an extracted person, which can be stored like any other person in the database. And when you work in that way, every record results in a group of personae, each of which is linked to a single source, where the person has a role, like parent, or child, or witness, whatever, just like we already have in the attributes generated by the forms gramplet.

Personae exist in lots of places, like in the left hand column, when you match sources on FamilySearch, and they also appear on Ancestry, whenever you try to match a person in another tree, or a source, with your own.

Model wise, the easiest way to deal with personae is to import them as normal persons, and tagging them as extracted, or something like that. And where software like Clooz or Evidentia breaks the concept by merging personae into a single person, and loosing information, the proper way to deal with them is to create a link between personae and who you think is a true individual (or conclusion person). This is what the developer of the Dutch program Centurial calls correlation.

Correlation works like merging, but it’s not destructive. It means that when I tell the software that I think that Johann Herman Borgstette born in Spenge, and Jan Harmen Borgsteede married in Amsterdam are the same person, the software creates a new (conclusion) person, that links to these two personae. And because it creates links, this virtual merging can be undone easily, so you will not end up polluting your tree.

In a way, this is quite close to association, where you can already create a link between persons, meaning that we don’t even need to change the data model much, except that we need another label than ASSO to mark this link as evidence, or something.

When you visit the openarch site that I mentioned earlier, you can see that you can already download a GEDCOM for each source, so for that site, you don’t even need special extraction software. The only change you need is an import that tags each imported person as extracted, so that you can apply a filter, if you want.

Personally, I would also need some software to deal with the citations, because I don’t want formatted citations in my database. But that’s another issue.

Have you read Robert Charles Anderson’s Elements of Genealogical Analysis?

I haven’t, but I want to add that I’m not too fond of such books, because I think that they make things way more complicated than needed. That’s why I also don’t like programs like the ones that I mentioned. They make me feel like a bureaucrat, and although I have worked as a civil servant, I try to keep things as simple as possible.

Thomas Wetmore once wrote s short story about working with 5 x 3 inch index cards, with source data written on them, and arranging them to find matching persons, and that’s a metaphore that I really like, because it’s easy to build, and explain. And I think that I can explain that in just a few pages, so I see no need for a big book.

In my opinion the process is so simple, that you don’t need the GPS either.

Persona came from the GENTECH data model on which Anderson happened to work with others. See:

Scroll to bottom and click on + button and a link to download the model document is available.

Anderson’s methodology, which he used for The Great Migration Study, is pretty straight forward. It involves building linkage bundles to connect information together to build up dossiers on the research subject. In that sense it is equivalent to the story you relate about Thomas Wetmore and the index cards. It is really what we all do in our heads or on paper when we’re doing research for the most part.

In my mind I think creating a Person record flagged as a Persona is not necessary and the wrong way to approach it.

I think what is necessary is to very clearly separate the data from the conclusions drawn from it in a structured manner that would allow you to identify subjects of interest in the data and correlate them with the subjects representing the conclusions being drawn. When those subjects are people then you can think of them as Persona and Person, but no reason it should not be done in a way so they can be anything like a Group or Artifact.

I think to accomplish this separation you need a Document or Content construct in a M:1 relation to a Source. In my mind this is what something like the Forms Gramplet should be populating with data. If possible each peice of data should be separately identifiable. They should not be treated as facts, they are the claims.

Once you have this construct and data available I think you have what you need to offer two different user interfaces and approaches to correlating and documenting your evidence. Both similar but different, one built around the GPS that is like the Evidentia and Centurial approach, the other around linkage bundles following the Anderson approach.

I seem to recall reading some comment from Thomas Whetmore a year and a half or so ago when I started researching data models talking about the potential of this very thing, of actually having all the data at hand.

Use a birth as an example. Wouldn’t it be nice after several years to find some new information about someone, enter it in, and pull up a correlation report where you could view all the sources with all the birth information you associated or correlated with a person? All the dates found, places found, who informants may have been, when the information was recorded, your notes and reasoning about the conclusion you drew from it all at the time and so forth? Sure you can do this manually with a note today, but wouldn’t it be nice if you had a more structured process to facilitate it?

3 Likes

Right. I see your point. Using a person for a persona may be overkill, especially when you want to embed personae in a source (document) in a simple way. In that case, you may not want to add place and event records, to store the properties of that persona.

It’s also nice to see the GENTECH model documentation. I now see that it was already 10 years old when I joined Better GEDCOM, and I had problems finding it, because of dead links, like here:

http://wiki-en.genealogy.net/GENTECH_Genealogical_Data_Model

Can you figure out where you read that comment from Thomas Wetmore? I just started digging through the RootsDev mail list archive on the Google Groups site, and there was a lot of discussion about that, way too much to work through in a single night.

What I do remember from Thomas, are his comments about events versus properties, a part where he thought that the GEDCOM based model, where every date and place must be embedded in an event, was overly complex. And when I work with personae in some form, as I find them in the wild, on international and local archive sites, I can imagine that it’s easier to use properties like birth-date and birth-place, or age, and occupation, just like names and gender are properties already, or characteristics in GENTECH lingo.

What I’m still in doubt about is how detailed a workable model should be. What I mean is, that when you talk about pieces of ‘data’, that must be identifiable, how detailed would that need to be? Is a persona a piece, or should his/her given name, surname, age, occupation, etc., all be pieces, that must be evaluated. Or is that already taken care of because they’re all characteristics? That part is not clear to me yet.

Anyway, if a persona is a piece, it does mean that it must have some ID, so that it can be referred to from an actual person object. And if every characteristic is a piece, that will need an ID (or handle) too. Is that right?

What I do know, is that I don’t want Gramps to have the kind of bureaucratic forms that I see in Clooz, Centurial, and Evidentia. They have way more fields than is acceptable to me, and I also see that these programs are way too Amercan, in the sense that they push their users to use formatted citations in American styles, or in the case of Centurial, to use citation templates inspired by the works of ESM, which are way too American too, and for which Tamura Jones once wrote that you need a wizard to select the right one.

When I focus on the sources that I see on-line, of Ancestry, or FamilySearch, or local sites, I think that the attributes that are needed to be able to generate a citation from are quite easy to define, like some already are in the source hierarchy of the GENTECH data model, or the Dutch A2A standard, or other standards used on our side of the Atlantic. And with these attributes, formatted citations can be generated for any culture that I can think of.

And like I wrote before, I prefer to avoid the GPS as much as possible, and work much like I can work on Ancestry or FamilySearch, where you can add a comment at the persona level if you want, but no deeper than that.

And finally, I’d love to get rid of the whole source/citation difference, and turn the whole idea of citations with attached notes upside down and work with documents that have their attributes (like author, title, version, publication date) embedded, like the attributes that you can define in Word, or in a the meta data of a picture. It would make life much easier, and in the end, it might also give us the opportunity to import data from Zotero, or a similar tool.

I had problems finding it, because of dead links

I seem to recall I had to dig a bit to finally find it myself. I assume you know Thomas’ DeadEnds model docs are available at http://www.bartonstreet.com/deadends/ and Tony Proctor’s STEMMA docs at https://parallaxview.co/stemma/

Can you figure out where you read that comment from Thomas Wetmore?

No I just recall running across it or a reference to it when I was tracking these things down and thinking to myself this guy gets it.

when you talk about pieces of ‘data’, that must be identifiable, how detailed would that need to be?

Well you want things like the date and place and type but they’re extracted and interpreted in relation to and as part of the extracted subject.

Anyway, if a persona is a piece, it does mean that it must have some ID, so that it can be referred to from an actual person object. And if every characteristic is a piece, that will need an ID (or handle) too. Is that right?

I don’t think each peice of data would need an id, the extracted subject would though as you need some way to construct the back reference.

When the view I am working on is mostly behind me I will probably start prototyping something that can be shared to demonstrate how it might be as a straw man.

they push their users to use formatted citations in American styles, or in the case of Centurial, to use citation templates inspired by the works of ESM

In my mind citations and citation handling are a separate issue from the data and evidence correlation process. I think Gramps is open and flexible enough that if someone wanted to implement some template based thing then fine, let them. So long as people who don’t want to use it don’t have to.

Also in my mind anything added like I have in mind also needs to be done so it is purely optional and has no impact on existing Gramps users. As I see it Gramps is a toolbox with lots of tools for people to track and do their work the way they choose to, we should try not to shoehorn anyone into any particular approach.

it might also give us the opportunity to import data from Zotero, or a similar tool.

I never worked with Zotero before but pulled it down and looked at it briefly about a month ago I guess. I can see the benefit in developing some kind of interface to work with it. Would be cool to add some kind of drag and drop support for it to the view I’m working on eventually.

Thanks for mentioning Thomas’ docs. I often get lost when using Google, so it’s good to have a proper reference. It’s also nice to see that my memory still works, in the sense that his document clearly shows his thoughts about embedding event information in persons, in what so called vital structures. And I also see that he indeed just uses persons that can reference other persons, even in more than 2 layers, so there is no real need for a persona.

And I assume that in your mind a subject is a person(a) or event. Is that right?

I understand what you mean, and they are not part of the correlation process indeed. I do mention them though, because in my mind, working with evidence starts with the recording of evidence, or sources, whatever term you prefer.

I have a long running dream of treating a source like an email, where you have a single object that has an author, and a subject (title), and a time stamp, and a piece of text, that may have some sort of mark-up (like html), and attachments. And I see that in the DeadEnds source too, in some way.

Treating a source like a single top-level object, like that email, would make my life much easier than having to work with repositories, sources, and citations, things that we have adopted from GEDCOM, but which are not used anywhere else. This means that I think that it’s important to provide for a way to store the attributes that we now have in repositories, sources, and citations, in the new source (or evidence) object, and make it flexible enough to add attributes from A2A, DC, or EE, or whatever standard that you can think of. And with those stored in the new source object, you can generate all sorts of formatted citations, if you want.

This single level source model is much like what you find in Zotero, where you also have attributes, and text that has been scraped from a web site.

And in that text, you might also think about embedding information like STEMMA does, although there are also other standards for that, like historical data.

Are you on GitHub?

his document clearly shows his thoughts about embedding event information in persons, in what so called vital structures. And I also see that he indeed just uses persons that can reference other persons, even in more than 2 layers, so there is no real need for a persona.

When the time comes I need to sit and review all of these models again. I find it is always best to examine something and then let it “digest” for a period of time so to speak and then revisit it as often I see it in a different light.

And I assume that in your mind a subject is a person(a) or event. Is that right?

For me a subject is any subject. Person, place, event, artifact, animal, a group of objects.

Treating a source like a single top-level object, like that email, would make my life much easier than having to work with repositories, sources, and citations, things that we have adopted from GEDCOM, but which are not used anywhere else.

Oh I see what you meant earlier. Thank you, this is a good thought, I had not stepped back enough to think about them in a different context like that. Again, the GEDCOM influence.

Are you on GitHub?

Yes, cdhorn (Christopher Horn) · GitHub

This is the best way to show where and how you found your info. This is how are all archives in France.
Ignore this would means you work is incorrectly done. That “email” is another way to get info. If you store this url into your database, will this data always available in twenty years ?
Personaly, I don’t trust an url.
I have 25 repositories, 3240 sources and 14400 citations. For an event, you need less than 2 minutes to verify it.

The only thing missing in gramps is having a person as a repository. In this case, you will know that the information was given by that person. The problem will be that you will not be able to verify it when that person is deceased.

1 Like

The problem is that outside GEDCOM based software, noone actually uses separate entities for repositories, sources, and citations. Provenance is encoded in call numbers, which are often more like paths, and when you really want to respect the fonds (which is a beautiful French word) you may need more than two levels, like in this reference for a French document that I found on an English site:

I find similar multi level paths in German archives, and in Amsterdam, and when you search for sources on FamilySearch, or download persons with the GetMyAncestors tool, you will also see that they have left the whole separation of sources and citations, and simply provide a formatted citation string that has lots and lots of levels. And when you use EE, or a tool like Zotero, you will also see that it has a flat model to store citations.

Yes you’re right Serge, you have to document exactly where things came from and I think no one disagrees with that. And the Gedcom model works.

The question posed is whether there is a different way to go about it that might work better. But in the end after thinking a bit I don’t see one. You might change the names of some entities around, rename citation evidence perhaps or introduce a source heirarchy with repository at the top level. But no matter what you still have a source and you still have a citation to identify the location of the information in the source.

This is what the Ark concept is based on. It can reference anything using an ark link. The repository is embedded in it, the object (or idea) have it’s own unique ID, the page if it’s a book, etc.

I don’t like the ark concept. Indeed, you have the info directly. My problem is:
What source is referenced by this ark url ?
Will this concept be available in the future ?
Is this concept used by all archives (repository) in all countries ?

I understand what you (and Serge) say here, but I have a different opinion about the GEDCOM model, which is that, if that worked, there would be no market for Evidence Explained, nor anyone wanting to discuss things in Better GEDCOM, nor for FS to want to move to GedcomX, which is totally different, source and citation wise. I also see fellow genealogists asking for advice on Facebook, on how to create a useful hierarchy with sources and citations. Most people ignore repositories, and for a good cause, I think.

When I look at on-line sources in the wild, I often see formatted citations, like

“Netherlands Births and Baptisms, 1564-1910”, database, FamilySearch (FamilySearch.org : 3 January 2020), Harm Bouwman in entry for Trientje Bouwman, 1884.

and

“Netherlands, Groningen Province, Civil Registration, 1811-1940,” images, FamilySearch (FamilySearch.org : 22 May 2014), Ten Boer > Geboorten 1883-1892 > image 79 of 533; Nederlands Rijksarchiefdienst, Groningen (Netherlands National Archives, Groningen).

which actually refer to the same record, which can also be cited as

[AlleGroningers] in Groningen (Netherlands), Civil registration births
Bron: boek, Period: 1884, Ten Boer, June 7, 1884, Geboorteregister 1884, record number 94

and only this last is simple enough to fit in the repository, source, citation structure.

But like you said, we can delay the citation, until we have a proper citation language, and use the word source for what it is, an artiifact that you try to use as evidence. And in the world of (archival) science, there is no place for the use of the word citation like it’s used in GEDCOM (and Gramps). In that world, a citation is the full string, as presented above, and not just a page number and date, as defined in GEDCOM. Nobody else uses it like that.

When I refer to an email as an example, I mean the email as a single object, with no other hierarchy than that it can be stored in folders, either inside your email program, or as a file on disk. And when I translate that to an evidence object, a term that I use to avoid confusion with other objects that we already have, it means that you can have an object that has all the attributes that you need to fill the repository, source, and citation fields, all the things that an email can have, like mark-up and attachments, and structured data that represents the event and the persons involved. And depending on the actual source that the evidence object is based on, it should also have the citation string as supplied by the site, so that the proper elements may be extracted later.

When used like this, the other Gramps objects can largely stay as they are.