Gramps, AI (Artificial Intelligence) and the Future

The Copilot case against GitHub, Microsoft, and OpenAI certainly influenced our decision. I also looked for opinions of open source institutions.

I notice that from the Jul 2024 Register article Coders’ Copilot code-copying copyright claims crumble against GitHub, Microsoft it appears that decisions are going in Copilot’s favour.

In order for us to apply a software license to contributed code we need the contributor to own the copyright. This may be a problem if the AI generates code that substantially borrows from its training code. Such code could be classed as a derivative work, and we would have to abide by the original licence. At the minimum this would involve crediting the original, but could also taint our licence.

In the case of Copilot, there is a filter that help to avoid such situations. A carefully chosen set of code completion tools with approved settings may be a possible route forward. We may also decide to impose additional restrictions such as limiting the size of code snippets, marking AI generated code, crediting original sources etc…

The legal complications will continue to be clarified. We will keep an eye on developments.

1 Like

Its not about AI writing code, it is about the possible Copyleft licensing implications when using AIs in development.

Sorry but how can you have features with out writing code and if you do
not write code you can have no features it seems to me to be the old
chicken and egg scenario.
phil

1 Like

There are some coding assistants which disclose their sources, while others don’t. Some assistants can train on your repos (including private repos) and can even ingest wikis to answer questions. These products aren’t free, but those who need legal protection and protection of their Intellectual Property (IP) use such tools today.

@Nick-Hall Thanks for pointing to the case against Microsoft and OpenAI, it was helpful.

1 Like

How about the following use of an AI? Does it violate the “no AI in development” guideline?

“As an expert in Python, database design and the Gramps genealogical software, please identify the specific code sections within the Relationships view (gramps/gramps/plugins/view/relview.py at master · gramps-project/gramps · GitHub) that directly retrieve the Families (IDs or handles) of the Active Person, and the Persons (IDs or handles) of those Families.”

That prompt did not give me the data requested and a follow-up prompt was needed.

Purpose:

I wanted to find out why the Relationship view retrieves all the data for Immediate Families (including person name/gender and fallback Birth/Death Event data with date/place) so efficently. While the PersonsInFamilyFilterMatch ( infamilyrule.py ) addon rule takes seconds to execute.

The prompt returned info about the add_family_members() method but the method desired was what directly handles the retrieval and display of person and event data for the relationships view.

Good question. It would be helpful to have clarity on this. Since it hasn’t been posted in this thread yet, here’s the wiki for Contribution rules for Gramps which has clear rules about code committed to the repository, although I don’t see a broad statement like “no AI in development” (or I haven’t found that yet).

Thinking out loud…what if you asked a Gramps developer the same question? That would probably not be prohibited, so my initial thought is that asking a tool (including an AI-based tool) questions seems reasonable, as long as it doesn’t result in generating code which is then committed to the repo.

Thoughts?

One of the difficulties is that AIs tend to anticipate the ultimate objective. So it tends to offer code when asking a code related question.

Engineer the prompt and instruct the AI to not reply with any code?

1 Like

Worse. No doubt about that. And that’s largely because there’s no I in so-called ‘AI’. It’s a language model, which can create code that looks nice, but there’s no real knowledge behind it.

Examples:

  1. I made several efforts to let ChatGPT write simple text filters to extract info from GEDCOM files, or make small modifications to them, like modifying month names in GEDCOMs created by Ancestry, and it often failed, because it had no idea that in real GEDCOM, the tag is always preceeded by a number. It created code that looked for tags at the start of each line, but that’s not where the tags are. It had some vague knowledge of tags, but not precise enough to write working code.
  2. When I asked ChatGPT to write code to add checksums to my UIDs, it did indeed write code that added checksums, but they were not of the type that is actually used by other software, like PAF.

A human will normally not make such mistakes, because he/she will take time to figure out where the tags actually are, or what type of checksum it really is.

Another example, also from ChatGPT:

I just asked it to create a program that removes all lines with tag _MASTER from a GEDCOM file, which is quite a simple task.

What it wrote was a program that reads all lines from a file, and writes those that do not contain the string ‘_MASTER’. And that’s not correct, because in theory, the string ‘_MASTER’ may appear in a note, or somewhere else.

@ennoborg A general purpose model such as the one used by ChatGPT may not perform as well specialized models such as those used by coding assistants. There are quite a few available if you feel like experimenting.

1 Like

Well, there is another part that I’m interested in, and that is improving existing code, like our relationship calculators. They are quite slow, compared to the ones in other programs, and it would be nice if AI could study the code, and show where it does redundant things, and where it can be improved in other ways.

Another subject would be something like comparing GEDCOM files, which is quite a nuisance, because a normal text comparison does not ignore irrelevant things like IDs.

I’d love to check some, for personal goals, so if you can give us some links to experiment with, please do.

Sure. There are many options for coding assistants and I’m trying some I found by searching for “best ai coding assistants”. Here’s a quick list in no particular order, which may or may not be free:

GitHub Copilot, Amazon Developer Q, Cody, Replit, Tabnine, Codeium, Cursor, …

Which one you pick might depend on which IDE you want to integrate with, privacy and security considerations, what features it has, etc. First thing I do is disable/opt-out of using my queries and data for training. And needless to say, keep the context of this thread in mind :slight_smile:

Please share your discoveries.

1 Like

I would expect ChatGPT to be better on Python than it is on GEDCOM. There is much more information about Python on the internet, and therefore most likely a larger part of its training data (Compared to GEDCOM).

And that’s exactly the problem that I have with AI, and why someone from the University of Cambridge wrote that it produces BS:

What I mean is, that when you ask a human, an intelligent person may tell you that he or she does not have enough information about GEDCOM to write such a program, and/or ask you for more information about the subject (of GEDCOM) before it goes on.

ChatGPT however acts like a little child that has learned that it’s wrong to say no, and tries to please you with garbage, because you seem to be expecting something, whatever that is. And that’s not what I want, because it means that it’s wasting my time. And the cause of this is, that it’s a language model, designed to produce results that look good. and nothing more.

I discovered this myself when I asked specific questions about local subjects not related to programming, where ChatGPT produced garbage. It grabbed texts, and put them together, but had no real clue.

I compare it with that little child, because when I informed it about an error, like not knowing that GEDCOM lines start with a number, it apologized, and produced another piece of code that still was no good.

And the main cause of this is that it’s a language model, that has no clue about the real gaps in its knowledge. And in fact, it has no clue at all.

1 Like

A disturbing AI event happened on GitHub. The Latta AI project sic’d their bug fixing AI on open source projects with a 10 star rating. (The Gramps-project has 2.2k stars but the Taapeli/isotammi has 10 stars.)

Without invitation, the owners of Latta AI had it evaluate the code and submit PRs for those “10 star” projects. It looks like this was a test balloon before broadening the targets.

It appears as though GitHub suspended the LattaAI account and deleted the PRs.

Here is the disturbing part… the Latta AI site does not identify the owners of the site. But their “opt out” process requires submitting contact information before exposing the process. Feels like a phishing scam.

Does GitHub have a discussion forum where developers and contributors can be apprised of such things?

It appears that GitHub has changed the permissions for AIs to reference repositories in their domain. (Might have been a reaction to the Latta AI overreach.)

Perplexity (after much probing) confirms that reference to https://github.com/gramps-project/ from October 2023 (and before) was previously permitted. That is no longer the case. Where I could previously ask it to point out a line of code where something was done in the source, that no longer works.

Is there an alternative? Can GitHub Copilot be used for this purpose? If so, how can we use that AI without violating Gramps policies on AI.

2 posts were split to a new topic: Guidelines for using AI when documenting Gramps

When I use AI for any coding assistance (its rare but I like it for certain complex data manipulation or debugging tasks) I like to just take whatever concept it came up with for solving the problem and then correct my code independently. Its a process similar to consulting a stack overflow response for code guidance.

The one thing I think it is extremely useful for is in adding detailed logging code to my programs for testing. I always remove this code once I find the source of an error, but it saves so much time in finding said errors when you don’t have to type every logging message you could want by yourself.

3 Likes