ISSUE 53 Expert Witness Journal - Journal - Page 40
A significant point of contention in the realm of
copyright law is the belief that LLMs store and reproduce copyrighted works. This is a misconception.
LLMs do not store individual works of text in their
entirety. Instead, they process and analyze the text
during training, extracting patterns, syntactic structures, and linguistic nuances. The model's outputs are
generated based on this learned information, not by
retrieving specific texts from its training data. It's akin
to a student who reads numerous books and articles to
write an essay; the student doesn't reproduce sections
of these texts verbatim but uses the understanding
gained from reading to produce original work.
training process fundamentally changes the nature of
the input data.
One area of weakness in the case could be the lack of
clear legal precedence or guidelines for AI training
practices. The current legal framework around copyright was not designed with AI in mind, particularly
for technologies like LLMs that learn from extensive
datasets. This gap in the law creates ambiguity around
what constitutes fair use in the context of AI.
Another potential weakness lies in the argument about
the extent to which LLMs can create truly independent works. While the output is not a direct copy of
the training data, it is undoubtedly influenced by it.
Determining the line between influence and copying
is challenging and may require a nuanced legal and
technical analysis.
The distinction between direct copying and learning
from examples to generate something new is at the
heart of copyright discussions surrounding LLMs.
While the training data includes copyrighted works,
the model's output is not a direct retrieval of these
works but an independently generated piece based on
patterns learned from the entire corpus of training
data. This process raises complex legal questions
about derivative works and fair use, which are
currently being debated in the courts.
The case also touches on the broader issue of
accountability in AI development. As LLMs become
more sophisticated, the responsibility of AI developers
to ethically source and use training data is increasingly
under scrutiny. This aspect of the case highlights the
need for more transparent AI practices and possibly
new legal frameworks to address these emerging
technologies.
In the specific cases referenced, the legal arguments
hinge on whether the use of copyrighted texts in training LLMs constitutes fair use and if the outputs of
these models can be considered independent creations or derivative works. As LLMs continue to evolve
and their applications become more widespread, these
questions will likely become central to the ongoing dialogue between technology and copyright law. The
resolution of these cases will set important precedents
and may necessitate changes in how we understand
and legislate around AI and intellectual property.
Finally it is useful to see how the Times persuaded
ChatGPT to regurgitate something close to one of its
articles. LLMs work by predicting what the next word
(referred to in the industry as token) will be after your
prompt. Once you get a few words into a document
the next word is probably unique or closely related to
whatever was in the original document. So for example I gave ChatGPT the following prompt:
In the following write 10 words to follow the words I
provide
1) To
2) To be
3) To be or
NY Times Vs. OpenAI et al. (the news case)
The New York Times v. OpenAI case presents a fascinating intersection between artificial intelligence and
copyright law, specifically focusing on the use of copyrighted material in the training of Large Language
Models (LLMs) like GPT-3. This case raises pivotal
questions about the mechanics of AI training and the
legal implications thereof.
And it responded:
1) To explore new technologies is to understand
future possibilities.
2) To be engaged in continuous learning is always
beneficial.
3) To be or not to be, that is the question.
A common misconception in this case is the belief that
LLMs directly store and reproduce copyrighted content from their training data. Unlike a database that
retrieves specific stored information, LLMs like GPT3 process vast amounts of text to learn linguistic patterns, structures, and contextual relationships. The
model's output, while influenced by its training data,
does not equate to directly retrieving or copying individual texts. This distinction is crucial; the model generates new content based on learned patterns, not by
accessing specific stored works.
This is because, surveying almost the entire internet
nearly every single time the 3 words ‘To be or’ show
up it is the introduction to Hamlet's famous soliloquy.
The New York Times probably had to provide many
more than 3 words to prompt ChatGPT into reproducing a facsimile of one of its articles but it is clear
from the complaint that this was the technique used.
If a teacher stood at the front of a class and asked a
student to complete the phrase ‘To be or’ and the student then said ’To be or not to be, that is the question’
it would be invidious for the teacher to then accuse the
student of plagiarism. Similarly prompting ChatGPT
with large sections of articles from the New York
Times and then complaining that the LLM reproduced chunks of the article is equally dubious.
However, the case's complexity lies in understanding
what constitutes fair use of copyrighted material in AI
training. The New York Times argues that OpenAI's
use of their copyrighted content for training GPT-3
goes beyond fair use, primarily because OpenAI benefits commercially from the model. On the other
hand, OpenAI might contend that the use of the material is transformative, a key factor in fair use, as the
EXPERT WITNESS JOURNAL
38
F E B R UA RY 2 0 2 4