Blue skying: Natural Language Processing shell

This is abso-f***ing-lutely blue-sky territory.

But it would be wonderous if a Gramplet was created which allowed a chunk of free-form text (handtyped, Notes of ‘Transcript’ type, or scraped from a webpage) to be sent to a set of Natural Language Processor and received back transformed content.

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

Even at the crudest levels, it probably already identifies key words and gives them grammatical context. – These are People (see Disable everything except NER (named entity recognition)). These are probably first names. These are dates. et cetera.

Just identifying blocks of Text that are significant and indicating context would help you target your efforts. (Like getting the teachers copy of Cliff’s Notes where they had highlighted material for good essay questions.

But this would be the “low-return on investment” test case. Once Gramps was talking to a NLP and had an example of how to accept processed data back, then the possibilities are unlimited.

This library HAS to be a huge resource suck. So it would be great if no resources were allocated until Text was feed into it and processing was manually triggered. And then after processing, it would be great to demonstrate de-allocating those resources.

Automatic processing and dynamic allocation might never be needed if early results showed that the library sucked wind as well as resources.

2 Likes

Could you post an example (faked, of course) of sample inputs/output pairs for the process you’re dreaming of?

BugBear

The GrampAPI - NLP API would pass a free-form text chunk to the NLP with parameters that specify what to look form and what data structure is wanted back. The input seems to be a string, file/document, or collection of files/documents. I’m not certain what sort of format their NLP passes back. But from skimming their website, it looks like the options include requesting back an array of found terms (with context markers & parsing locations) and returning a fully marked-up copy of the original text that unambiguously identify the found terms.

One spaCy NLP parameters is processing for proper names or Named Entity Recognition.

What the Gramplet does with the structured data from the NLP is limited by only by our imagination.

But let’s give an example of a Note-View based NLP-fed Gramplet for sharing that Citation to Persons mentioned in biographies/obituaries. (Like a Hints feature.)

So, the Gramplet has a table with a variation of the immediate relatives information seen in the Relatives Gramplet as cells in a table column:

In the adjacent column are cells with markers if this citation is already shared to the relative but blank otherwise. The individual NLP discovered names can be dropped in cells. If an instance of the Citation doesn’t exist for that person, a new share is created when the Gramplet commits the table. If a shared instance already existed for a particular relative, no new share is added.

This would be a similar capability as the Source Linker in FamilySearch:

A natural extension would be to insert Gramps Person Links into the Note at the same time shares of the Citation are created. Or to offer a Data Entry like option to add missing persons.

A more intelligent version would recognize when ”Mrs.” form names (or couples) existed in in the text and offer Families instead of Persons to share/link. Or offer linking out to a couple degrees of separation of extended family.

Another more intelligent version will have “trained” the NLP to recognize Gramps links around chunks of text and either ignore that pre-identified content or leverage it.

A really intelligent version might compare Soundex searches of NLP identified names compared to extended family to make match guesses.

Or it might use the contextual capabilities to segregate the surviving family members and offer the option to put ‘died before’ Death Events in family members without a death event… or refine a ‘died before’ date if the new Citation occurred earlier.

A similar feature might ask the MLP for Places. And compare place names with 50 or a hundred miles of Person’s place of Death. Then allow adding dated Residences (as opposed to Citations instances) to the various people found.

1 Like

Here’s a sample obituary that could be used for testing. To see what kind of name extraction the NLP offers. And to see what tokens spaCy creates if given free rein to fully tokenize & markup free-form text for context.

Gramps could just dump the outputted raw results into another Note in the same Citation.

"The Herald
New Castle, Pennsylvania
Wed., September 5, 1917
Page 2

William F. McCullough, Sr.

The death of William F. McCullough occurred last night at the home of his daughter, Mrs. Robert Lowery of 136 Boyles avenue. He had been in poor health since January.

Mr. McCullough was born in Westmoreland county and was 78 years of age. During the past 47 years he made his home in this city and was one of the most highly respected citizens. He was the oldest member of the Mahoning Lodge he made his home in this city and was one of the most highly respected citizens. He was the oldest member of the Mahoning Lodge, No. 243 F. and A. M., having been a member for 45 years. The deceased was a member of the First Christian church and was an active member until his health failed.

His wife Susan Gould McCullough preceded him in death six years ago. He is survived by the following children: Mr. T. B. McCarthy of Bellvue, Mrs. Robert Lowery of this city, W. F. McCullough, Jr., H. C. McCullough of Detroit, Mich., J. E. McCullough of Pittsburgh, two sisters, Mrs. R. H. McCullough of Edenburg, and Mrs. Jane Knapp of Knoxville and one brother, A. E. McCullough of Detroit. Mich. Eight grandchildren also survive.

The funeral services will be held on Thursday afternoon at 2:30 o’clock from 136 Boyles avenue. - Rev. Williams and Rev. Sniff will officiate."

This might be a better example. Lets say that you want to extract all the citation worthy data from some text.

So you run a NLP diagramming gramplet on the note.

It identifies a series of proper names & highlights them. You pick one of the names as the central focus of the Article and use share (or Add) to open a selection dialog for a Person, Family, Place or Event to matching that name. (Gramps adds Link to that block of highlighted text in the Note and attaches a new citation with the Note.
Then the gramplet brings up a list of People & Families within n degrees of separation of that focal person (or Places within n miles) and lets you drag’n’drop to link those other highlighted names (changing the highlight color as links are created.

You decide that you need to set a series of Residences for all the Places & People mentioned. So you select the simple form at the bottom of the gramplet & chose Residence as the type (the form inherits the Citation & its date). Then you drag the People and Places to the form and hit Add when it is correctly composed. The Residence event is created with citation under the Person. You move on to the next person. If they have the same location, you just drag the person and hit add.

For the people who have been linked but have no information to create a residence event, drag the living to “Role” list box, select “Surviving” and pressing Add shares the Event to those people. Then repeat for a “Role” of “Deceased”

The BYU Linking Lab already does some Basic NLP to pass hints through to the FamilySearch SourceLinker tool. Here’s an example hints from an NLP processed obituary where the hints have been manually Attached:

It recognized Names (and differentiated those from Place names) applied a Relationship (usually sister/brother, son/daughter, parent, spouse; it tends to have problems with in-laws and niblings), determined WHO was the desceased, put up a Relationship style form focused on the deceased and made preliminary guesses of which names matched the known family members. The ambiguous names were just put in a list of draggable persons at the bottom.

Where it REALLY fell down were for the females in the Obituary since most were listed as a Mrs. <husband>

For each of those, it was necessary to navigate the focus to each sister/daughter so that the Husband would be displayed in the Relationship style view. Then the names could be dragged to match. This was awkward and error prone. (It really needed an Expand All nodes feature that would have shown all spouses at a particular family level.)

The other place it really exceeds Gramps is that finely grained forwards-and-backwards-linked Citations are automatically generated for each attachment. This is a painfully tedious process in Gramps.

The missed opportunities include:

  1. simple method of adding Known Residence at a particular time {"sister, Mary of Springfield, Mass}
  2. date refinements {if sister is “Mrs.” Smith, then the totally unknown date of her Smith marriage is between her birth year and the current year, If brother pre-deceased, then his unknown death date is before the obit date and, if known, after his birth. }