Guidelines for using AI when documenting Gramps

Following up on guidelines for the use of AI in the gramps project:

What are the project’s guidelines for documentation and support?

I have several postings here on the Discourse forum where the question was posed to an AI in order to gather references and collate a cogent response. Those postings cite the AI. They also tend to include the Prompt (or a refinement of the prompt). This allows people to re-use the prompt and continue the conversation with the AI.

AIs are timesavers when building collations of specific sources and re-organizing data in a logical manner. The PDF of our wiki could be improved by feeding it the current PDF as source have having it re-written to improve the flow and order-of-presentation.

Or merely presenting an editorial and proofreading critique.

And AIs are adept at reading Python code and producing non-technical overview.

Our list of 5.2 features was massive. There still isn’t a collation of the changes.

The 5.2.0 release stages have changelogs here on Discourse (beta1, beta2, rc1, 5.2.0) and on GitHub (beta1, beta2, rc1, 5.2.0) and on the spotty MantisBT (beta1, beta2, rc1 filters, and changelog 5.2.0). But the average users only pay attention to the wiki. Unfortunately, the wiki only lists changes between the release candidate and the actual release. (And reference materials for even those features are not complete.)

So the 5.2.0 changelog appears anemic on the wiki. An AI could collate all 4 version change logs and provide an update to this misleading Blog entry.

It’s a great idea to use AI to help rewrite the descriptions to be more indicative of the actual improvement or fix. And if we want every change to be documented, we would ideally have a single source, i.e. MantisBT, instead of four.

BTW, should this topic about release notes/readme be split into a new topic? This thread about use of AI is getting awfully muddled.

1 Like

I wrote some papers on AI and copywrite lawsuits that are going on for college over the past several years. Its a favorite pet topic of mine. Its worth noting my concept of copyright is very US centric.

As far as AI goes I think we should institute a blanket policy of no generative AI images being included with Gramps. There are a number of papers with detailed explanations of why models like stable diffusion are particularly problematic, but in my opinion the main difference between these and LLMs is in the way they are trained. Stable diffusion is trained to attempt to reconstruct what it thinks is an existing image based on a text based prompt. The class action complaint puts it better than I ever could:

(See bullet 75 in “Factual Allegations”)

Diffusion is a way for a machine-learning model to calculate how to reconstruct a copy of its Training Images. For each Training Image, a diffusion model finds the sequence of denoising steps to reconstruct that specific image. Then it stores this sequence of steps. The diagram above shows a spiral as an example. In practice, this training would be repeated for many images—likely millions or billions. A diffusion model is then able to reconstruct copies of each Training Image. Furthermore, being able to reconstruct copies of the Training Images is not an
Case 3:23-cv-00201 Document 1 Filed 01/13/23 Page 18 of 46
incidental side effect. The primary goal of a diffusion model is to reconstruct
copies of the training data with maximum accuracy and fidelity to the Training
Image. It is meant to be a duplicate.

LLMs meanwhile aim to find the most probable next token in a sequence of tokens. That is similar in some ways to stable diffusion, but not exactly. Stable Diffusion stores what is essentially near full vector representations of source material. If you make a detailed enough prompt, you can in fact come very close to spitting out the original source material.

That is always an issue with stable diffusion models, but with LLMs as long as a topic is sufficiently common it shouldn’t recreate the source. The exception is small pieces of texts that are repeated over an over across the internet, like quotes from research papers or parts of famous poems. These skew the probabilities to make reproduction much more likely.

In the case of code, however, the most oft repeated code phrases are the ones that occur in documentation. These are unlikely to cause copyright issues. Really though, public code is generally used only for initial training of LLMs designed for code generation. One of my side gigs is actually writing code to perform tasks based on user prompts for use in LLM training. I not only write code from scratch, but I also make improvements to LLM generated code for refining the model. This often involves tasks like making comments on the code with a very specific format / according to certain guidelines. Even the tense we write these comments in is prescribed! If you notice LLM code often has a very specific comment format this is why. There are even training tasks that are meant to be completed with specific methods from specific libraries like pandas and numpy. This is where a lot of the actual functional training of models come in that produces the code you see with models like, for instance, Claude. This removes a lot ambiguity about who owns code created using the models since the company who owns the model also owns the important parts of the training data. I think putting in a claim based on the nonimportant parts would be like claiming copywrite infringement because someone posted a list of word frequencies for an essay you wrote. The model terms of service are the gold standard for understanding how much of what is generated you own for LLMs.

Summarizing existing documentation or even asking an LLM to generate documentation based on a list of bullet points shouldn’t cause issues either. However, just to be sure, we may consider requiring proof of non-plagiarism from a third party plagiarism checker site. Ideally a free one. We would just want to make sure that the rare instance of exact copying of a source text did not occur.

Asking an LLM to reformat documentation a certain way should not be restricted at all. If the words are kept the same and the format is changed to one requested by the LLM user, I’d bet a lot of money that there is no case morally or legally for copyright infringement.

3 Likes

I generally agree with @RenTheRoot’s advice. However, I see that these algorithms change every month (I work in the machine learning industry, at comet.com). The algorithms are constantly being altered to address concerns and be used in new areas.

Stable diffusion (like many of the generative models) work in “latent space” not pixel space, so it isn’t as straightforward as the complaint argues. In my opinion, all of the generative models have the same kinds of limitations/caveats to some degree.

Disclaimer: Stability AI is one of our customers.

2 Likes

Yeah I can agree with this to a certain extent. I still think that generative models that aren’t text based tend to run into more copyright issues for a variety of reasons. I think it would be great if someone created a diffusion model but paid artists for their work to be included as training data like the code generators pay me.

I think this is not so much an issue if you have a prompt like “dog” because the latent space embeddings are not going to be too close to any individual piece of source material. Where you start running into problems are very specific prompts where the amount of training data used to accommodate said prompts is minimal. The distance between the embeddings and the original is just not far enough apart.

Some of these postings are leaning towards jargon technobabble. That might be inevitable if you need to keep from writing publication-length articles.

Still, any guidelines are going to need to be readable by beginners using AI for whatever purpose being discussed. (Although expert level guidelines are also needed.) So when deciding policy, the discussion should let them participate. I don’t know about others, but I’m beginning to feel excluded.

Could you please link new terminology to a definition for laymen? Maybe an AI proofreader could do that for you?

Sorry… didn’t mean to exclude anyone with the jargon!

This conversation is really about coming up with guidelines, not the guidelines themselves. Can we provide enough context in this discussion so that a layperson could understand and participate? Well, we can definitely add some links to help, but in reality it would be a lot of work.

My main point is: it is naive to say that “X” is probably ok, but don’t use “Y.” And the technology is changing so quickly that it is hard to be very specific.

The guidelines should be general enough to avoid such issues.

1 Like

Is there a glossary of AI terminology? I understand that new terms (or old terms from other areas of knowledge) are constantly being coined, particularly for bleeding edge technology. (Both by marketing departments trying to create buzzwords for selling and research groups trying to reinforce an idea as patentable or to improve chances of being published and read.)

Perhaps if you recognize that a posting is getting very dependent on the nuances that between such terms, you could pass the posting through your favorite AI and have it add good links for laymen?

I wonder if Discourse has a way to export an entire thread? So it could be used as a Training Document. So that the AI could do a better assessment of whether the term was sufficiently introduced. In that case, adding a link would be a distraction rather than a help.

(Which sounds like a good feature for AI assistance in writing docs. Passing a copy of the Gramps offline manual PDF through as a Training doc would allow a better awareness of whether term/feature needs better foundation.)

And my main point is that whether a type of AI tool is used should be dependent on whether the builders of that tool are involved in ongoing copyright related litigation or whether they are known to do most of their training using data they do not own. The exception is for tasks unrelated to the content of training data such as reformatting of text.

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.