I wrote some papers on AI and copywrite lawsuits that are going on for college over the past several years. Its a favorite pet topic of mine. Its worth noting my concept of copyright is very US centric.
As far as AI goes I think we should institute a blanket policy of no generative AI images being included with Gramps. There are a number of papers with detailed explanations of why models like stable diffusion are particularly problematic, but in my opinion the main difference between these and LLMs is in the way they are trained. Stable diffusion is trained to attempt to reconstruct what it thinks is an existing image based on a text based prompt. The class action complaint puts it better than I ever could:
(See bullet 75 in “Factual Allegations”)
Diffusion is a way for a machine-learning model to calculate how to reconstruct a copy of its Training Images. For each Training Image, a diffusion model finds the sequence of denoising steps to reconstruct that specific image. Then it stores this sequence of steps. The diagram above shows a spiral as an example. In practice, this training would be repeated for many images—likely millions or billions. A diffusion model is then able to reconstruct copies of each Training Image. Furthermore, being able to reconstruct copies of the Training Images is not an
Case 3:23-cv-00201 Document 1 Filed 01/13/23 Page 18 of 46
incidental side effect. The primary goal of a diffusion model is to reconstruct
copies of the training data with maximum accuracy and fidelity to the Training
Image. It is meant to be a duplicate.
LLMs meanwhile aim to find the most probable next token in a sequence of tokens. That is similar in some ways to stable diffusion, but not exactly. Stable Diffusion stores what is essentially near full vector representations of source material. If you make a detailed enough prompt, you can in fact come very close to spitting out the original source material.
That is always an issue with stable diffusion models, but with LLMs as long as a topic is sufficiently common it shouldn’t recreate the source. The exception is small pieces of texts that are repeated over an over across the internet, like quotes from research papers or parts of famous poems. These skew the probabilities to make reproduction much more likely.
In the case of code, however, the most oft repeated code phrases are the ones that occur in documentation. These are unlikely to cause copyright issues. Really though, public code is generally used only for initial training of LLMs designed for code generation. One of my side gigs is actually writing code to perform tasks based on user prompts for use in LLM training. I not only write code from scratch, but I also make improvements to LLM generated code for refining the model. This often involves tasks like making comments on the code with a very specific format / according to certain guidelines. Even the tense we write these comments in is prescribed! If you notice LLM code often has a very specific comment format this is why. There are even training tasks that are meant to be completed with specific methods from specific libraries like pandas and numpy. This is where a lot of the actual functional training of models come in that produces the code you see with models like, for instance, Claude. This removes a lot ambiguity about who owns code created using the models since the company who owns the model also owns the important parts of the training data. I think putting in a claim based on the nonimportant parts would be like claiming copywrite infringement because someone posted a list of word frequencies for an essay you wrote. The model terms of service are the gold standard for understanding how much of what is generated you own for LLMs.
Summarizing existing documentation or even asking an LLM to generate documentation based on a list of bullet points shouldn’t cause issues either. However, just to be sure, we may consider requiring proof of non-plagiarism from a third party plagiarism checker site. Ideally a free one. We would just want to make sure that the rare instance of exact copying of a source text did not occur.
Asking an LLM to reformat documentation a certain way should not be restricted at all. If the words are kept the same and the format is changed to one requested by the LLM user, I’d bet a lot of money that there is no case morally or legally for copyright infringement.