Is it legal to use copyrighted works to train LLMs?

 

Two widely-reported court rulings in San Francisco found that employing copyrighted works to train Large Language Models (LLMs) fell within the law’s “fair use” provisions. The fact that writers brought these lawsuits shows that they felt wronged by the use of their works to train LLMs without their permission. What should we make of those rulings?

The first ruling, issued by a US federal judge, found that Anthropic made fair use of their works to train its Claude LLM. Similarly, in the second case, a US district judge ruled for Meta’s use and dismissed the writers’ claims.

Is LLM training fair use?

Does the use of copyrighted material to train LLMs fall within the US copyright law’s provisions of fair use? To see whether LLM training of copyrighted works falls within the provisions of the four factors the law prescribes for making fair use decisions, let’s examine these factors one by one.

  1. The purpose and character of the use, including whether such use is of a commercial nature Here consider that Anthropic is registered as a public-benefit corporation, Meta’s Llama model is openly available at no cost, and OpenAI Inc., which owns OpenAI Global, LLC, is also a nonprofit organization. One might think that these organizations are specifically structuring their legal status and their offerings in a way that makes them comply with this provision.

    Another consideration associated with this factor is whether the use is transformative or not. The training of LLMs with copyrighted works is indeed highly transformative; it is impossible to find the work within the model’s weights. (On the other hand, some LLMs have been successfully prompted to spew out verbatim parts of copyrighted works, which is indeed an issue.)

  2. The nature of the copyrighted work A key goal of this provision is to avoid the private ownership of facts and ideas, which, rightfully, should be in the public domain. Here one can argue that when the copyrighted work is used for training an LLM, the work’s contents (together with those of countless other works) are indeed distilled into facts and ideas expressed as model weights. The expression of these facts and ideas, which is copyrightable is not directly involved in the training. Instead it is also indirectly made a part of the LLM, as weights that specify language output (good writing) in general.

    LLMs are also trained with large amounts of copyrighted content that is openly available as open-source software or through creative commons licenses, a fact that may also weigh in favor of this material’s use.

  3. The amount and substantiality of the portion used in relation to the copyrighted work as a whole In this case, those arguing for the fair use of copyrighted works in LLMs can argue that, taking into account an LLM’s trillions of parameters, a single copyrighted work forms an unsubstantial part of it.

  4. The effect of the use upon the potential market for or value of the copyrighted work For copyrighted works available at a cost, I believe that in the long term their market and value will decline with the widespread availability of LLMs, especially for non-fiction books. (Fiction books and scientific articles have attributes that cannot be currently supplanted by LLMs.) However, this effect is indirect and currently unproven, so those arguing for fair use can claim that such an effect does not currently exist. For open-source software and open content material, the effect on the market and value of the copyrighted work is obviously very small, if any.

Given the above—and with the caveat that I am not a lawyer—, my understanding is that the use of copyrighted works to train LLMs indeed falls within the US copyright law’s “fair use” provisions, a conclusion echoed by the two recent court rulings.

Should LLM training be fair use?

Generative AI changes profoundly the way we access knowledge and ideas, as well as how we generate new content. The use of generative AI can render many copyrighted works useless once they’ve been used to train LLMs, because in the long term many people will just prompt generative AI for answers rather than purchasing the copyrighted works on which the LLMs were trained. The AI answers that Google’s search engine now often supplies are a primary example of this trend.

The widespread use of AI in place of copyrighted works can destroy the incentives (primarily royalties and recognition) currently enjoyed by content creators. In 2023, as generative AI was gaining a foothold, I argued that this change deforests the knowledge’s ecosystem through the collapse of the corresponding motives, industries, and activities, ultimately leading to its desertification.

Consequently, copyright law must be revised with specific provisions regarding the use of copyrighted material to train LLMs. These should expressly specify that trained LLMs are derivative works that require appropriate licensing of their copyrighted source material. Such licensing could ensure proper attribution and compensation of the source material’s authors and copyright holders in a way that retains their motives for the continued production of new content. Such a change would be analogous to the private copying levies, which were introduced in many countries to cover the fair use of another—at the time—disruptive innovation: the widespread reproduction of music through audio cassettes.

Comments   Post Toot! Share


Last modified: Thursday, June 26, 2025 3:07 pm

Creative Commons Licence BY NC

Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.