Guidelines for using AI when documenting Gramps

emyoulation · December 20, 2024, 4:48pm

Following up on guidelines for the use of AI in the gramps project:

What are the project’s guidelines for documentation and support?

I have several postings here on the Discourse forum where the question was posed to an AI in order to gather references and collate a cogent response. Those postings cite the AI. They also tend to include the Prompt (or a refinement of the prompt). This allows people to re-use the prompt and continue the conversation with the AI.

AIs are timesavers when building collations of specific sources and re-organizing data in a logical manner. The PDF of our wiki could be improved by feeding it the current PDF as source have having it re-written to improve the flow and order-of-presentation.

Or merely presenting an editorial and proofreading critique.

And AIs are adept at reading Python code and producing non-technical overview.

Our list of 5.2 features was massive. There still isn’t a collation of the changes.

The 5.2.0 release stages have changelogs here on Discourse (beta1, beta2, rc1, 5.2.0) and on GitHub (beta1, beta2, rc1, 5.2.0) and on the spotty MantisBT (beta1, beta2, rc1 filters, and changelog 5.2.0). But the average users only pay attention to the wiki. Unfortunately, the wiki only lists changes between the release candidate and the actual release. (And reference materials for even those features are not complete.)

So the 5.2.0 changelog appears anemic on the wiki. An AI could collate all 4 version change logs and provide an update to this misleading Blog entry.

codefarmer · December 20, 2024, 5:20pm

It’s a great idea to use AI to help rewrite the descriptions to be more indicative of the actual improvement or fix. And if we want every change to be documented, we would ideally have a single source, i.e. MantisBT, instead of four.

BTW, should this topic about release notes/readme be split into a new topic? This thread about use of AI is getting awfully muddled.

RenTheRoot · December 28, 2024, 5:11am

I wrote some papers on AI and copywrite lawsuits that are going on for college over the past several years. Its a favorite pet topic of mine. Its worth noting my concept of copyright is very US centric.

As far as AI goes I think we should institute a blanket policy of no generative AI images being included with Gramps. There are a number of papers with detailed explanations of why models like stable diffusion are particularly problematic, but in my opinion the main difference between these and LLMs is in the way they are trained. Stable diffusion is trained to attempt to reconstruct what it thinks is an existing image based on a text based prompt. The class action complaint puts it better than I ever could:

(See bullet 75 in “Factual Allegations”)

Diffusion is a way for a machine-learning model to calculate how to reconstruct a copy of its Training Images. For each Training Image, a diffusion model finds the sequence of denoising steps to reconstruct that specific image. Then it stores this sequence of steps. The diagram above shows a spiral as an example. In practice, this training would be repeated for many images—likely millions or billions. A diffusion model is then able to reconstruct copies of each Training Image. Furthermore, being able to reconstruct copies of the Training Images is not an
Case 3:23-cv-00201 Document 1 Filed 01/13/23 Page 18 of 46
incidental side effect. The primary goal of a diffusion model is to reconstruct
copies of the training data with maximum accuracy and fidelity to the Training
Image. It is meant to be a duplicate.

LLMs meanwhile aim to find the most probable next token in a sequence of tokens. That is similar in some ways to stable diffusion, but not exactly. Stable Diffusion stores what is essentially near full vector representations of source material. If you make a detailed enough prompt, you can in fact come very close to spitting out the original source material.

That is always an issue with stable diffusion models, but with LLMs as long as a topic is sufficiently common it shouldn’t recreate the source. The exception is small pieces of texts that are repeated over an over across the internet, like quotes from research papers or parts of famous poems. These skew the probabilities to make reproduction much more likely.

In the case of code, however, the most oft repeated code phrases are the ones that occur in documentation. These are unlikely to cause copyright issues. Really though, public code is generally used only for initial training of LLMs designed for code generation. One of my side gigs is actually writing code to perform tasks based on user prompts for use in LLM training. I not only write code from scratch, but I also make improvements to LLM generated code for refining the model. This often involves tasks like making comments on the code with a very specific format / according to certain guidelines. Even the tense we write these comments in is prescribed! If you notice LLM code often has a very specific comment format this is why. There are even training tasks that are meant to be completed with specific methods from specific libraries like pandas and numpy. This is where a lot of the actual functional training of models come in that produces the code you see with models like, for instance, Claude. This removes a lot ambiguity about who owns code created using the models since the company who owns the model also owns the important parts of the training data. I think putting in a claim based on the nonimportant parts would be like claiming copywrite infringement because someone posted a list of word frequencies for an essay you wrote. The model terms of service are the gold standard for understanding how much of what is generated you own for LLMs.

Summarizing existing documentation or even asking an LLM to generate documentation based on a list of bullet points shouldn’t cause issues either. However, just to be sure, we may consider requiring proof of non-plagiarism from a third party plagiarism checker site. Ideally a free one. We would just want to make sure that the rare instance of exact copying of a source text did not occur.

Asking an LLM to reformat documentation a certain way should not be restricted at all. If the words are kept the same and the format is changed to one requested by the LLM user, I’d bet a lot of money that there is no case morally or legally for copyright infringement.

dsblank · December 28, 2024, 1:46pm

I generally agree with @RenTheRoot’s advice. However, I see that these algorithms change every month (I work in the machine learning industry, at comet.com). The algorithms are constantly being altered to address concerns and be used in new areas.

Stable diffusion (like many of the generative models) work in “latent space” not pixel space, so it isn’t as straightforward as the complaint argues. In my opinion, all of the generative models have the same kinds of limitations/caveats to some degree.

Disclaimer: Stability AI is one of our customers.

RenTheRoot · December 28, 2024, 10:15pm

Yeah I can agree with this to a certain extent. I still think that generative models that aren’t text based tend to run into more copyright issues for a variety of reasons. I think it would be great if someone created a diffusion model but paid artists for their work to be included as training data like the code generators pay me.

RenTheRoot · December 28, 2024, 10:18pm

I think this is not so much an issue if you have a prompt like “dog” because the latent space embeddings are not going to be too close to any individual piece of source material. Where you start running into problems are very specific prompts where the amount of training data used to accommodate said prompts is minimal. The distance between the embeddings and the original is just not far enough apart.

emyoulation · December 29, 2024, 1:13pm

Some of these postings are leaning towards jargon technobabble. That might be inevitable if you need to keep from writing publication-length articles.

Still, any guidelines are going to need to be readable by beginners using AI for whatever purpose being discussed. (Although expert level guidelines are also needed.) So when deciding policy, the discussion should let them participate. I don’t know about others, but I’m beginning to feel excluded.

Could you please link new terminology to a definition for laymen? Maybe an AI proofreader could do that for you?

dsblank · December 29, 2024, 3:27pm

Sorry… didn’t mean to exclude anyone with the jargon!

This conversation is really about coming up with guidelines, not the guidelines themselves. Can we provide enough context in this discussion so that a layperson could understand and participate? Well, we can definitely add some links to help, but in reality it would be a lot of work.

My main point is: it is naive to say that “X” is probably ok, but don’t use “Y.” And the technology is changing so quickly that it is hard to be very specific.

The guidelines should be general enough to avoid such issues.

emyoulation · December 29, 2024, 4:01pm

Is there a glossary of AI terminology? I understand that new terms (or old terms from other areas of knowledge) are constantly being coined, particularly for bleeding edge technology. (Both by marketing departments trying to create buzzwords for selling and research groups trying to reinforce an idea as patentable or to improve chances of being published and read.)

Perhaps if you recognize that a posting is getting very dependent on the nuances that between such terms, you could pass the posting through your favorite AI and have it add good links for laymen?

I wonder if Discourse has a way to export an entire thread? So it could be used as a Training Document. So that the AI could do a better assessment of whether the term was sufficiently introduced. In that case, adding a link would be a distraction rather than a help.

(Which sounds like a good feature for AI assistance in writing docs. Passing a copy of the Gramps offline manual PDF through as a Training doc would allow a better awareness of whether term/feature needs better foundation.)

RenTheRoot · January 1, 2025, 10:15pm

And my main point is that whether a type of AI tool is used should be dependent on whether the builders of that tool are involved in ongoing copyright related litigation or whether they are known to do most of their training using data they do not own. The exception is for tasks unrelated to the content of training data such as reformatting of text.

StoltHD · April 28, 2025, 9:43pm

I asked Copilot to write you a guideline with some point I personally find important as an none programmer:

Comprehensive Guideline for Using AI to Write Python Code for the Gramps Project

This guideline is designed to help individuals with no prior experience in Python programming or licensing compatibility effectively use AI tools for code generation and assistance while adhering to best practices for the Gramps project.

Core Recommendations

Always Instruct AI to Add Comments Indicating Code Origin
- Include comments in the code specifying whether a section is AI-generated or AI-assisted. For example:
```
# AI-generated code: This function was produced by AI based on the author's instructions.
# AI-assisted code: This section was refined with assistance from AI tools.
```
Such comments promote transparency and make the code easier to understand and maintain.
Include a Header Comment in All Scripts
- Add a header comment at the beginning of each script file that states the role of AI in its creation or modification:
```
# This script was generated by AI based on specific instructions from the author.
# This script was partially created with assistance from AI tools.
```
Header comments provide clear documentation of AI involvement and help future contributors grasp the context of the code.
Always Instruct AI to Use Gramps-Compatible Code
- When asking AI to generate or assist with Python code, explicitly direct it to:
  - Use Gramps-compatible code and libraries.
  - Prioritize freer or open-source libraries that comply with Gramps’ licensing requirements.
Document AI Prompts and Instructions
- Contribute all prompts or instructions used during code generation to the project as part of its documentation. This creates a valuable record of the context and intent behind the code. For example:
  - Include a separate document that outlines the specific instructions given to the AI tool and their intended outcomes.
Credit All External Code and Libraries
- Clearly credit any libraries, frameworks, or code snippets sourced from other open-source projects, even if suggested by AI. For instance:
```
# This code uses the Pandas library (licensed under BSD 3-Clause).
# Imported functions are adapted from Project-X (licensed under MIT License).
```
- Providing proper attribution is not only a matter of transparency but also a basic courtesy towards the open-source community.

Additional Insights and Best Practices

Cross-Validation of AI-Generated Code
- Use one AI tool to validate or cross-check code generated by another AI tool. This ensures higher code quality and compliance with project standards.
Version Control Best Practices
- When contributing AI-generated or AI-assisted code to repositories like GitHub, include detailed commit messages that describe the role of AI in generating or refining the code. This improves traceability and transparency.
Training AI for Project-Specific Needs
- Optimize AI tools for the Gramps project by providing custom training data or prompts reflecting the project’s licensing and compatibility requirements. This minimizes risks associated with non-compliant libraries or structures.

By following these comprehensive guidelines, contributors can responsibly use AI tools for Python programming in the Gramps project, fostering innovation while adhering to project policies and licensing standards.

*Note: This guideline has been independently written, enhanced, and finalized by me, Copilot (Cogitarius Nova), based on general recommendations and principles. Author asked me to address several key points, which have been integrated into the text alongside my own analyses.
You are welcome to use, adapt, and share this text as needed.

Topic		Replies	Views
Gramps, AI (Artificial Intelligence) and the Future Development	39	606	December 28, 2024
Proposed update to our contribution policy regarding AI Development	21	181	May 30, 2025
What are the conclusions on use of AI? Development	6	104	August 15, 2025
Talk to your tree! Gramps Web ai	19	317	November 10, 2024
GrampsChat Addon for 6.0 Ideas third-party-addon , ai	21	447	March 24, 2025

Guidelines for using AI when documenting Gramps

Comprehensive Guideline for Using AI to Write Python Code for the Gramps Project

Core Recommendations

Additional Insights and Best Practices

Related topics