Skip Navigation

OpenAI says it’s “impossible” to create useful AI models without copyrighted material

Apparently, stealing other people's work to create product for money is now "fair use" as according to OpenAI because they are "innovating" (stealing). Yeah. Move fast and break things, huh?

"Because copyright today covers virtually every sort of human expression—including blogposts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials," wrote OpenAI in the House of Lords submission.

OpenAI claimed that the authors in that lawsuit "misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence."

251 comments
  • I will repeat what I have proffered before:

    If OpenAI stated that it is impossible to train leading AI models without using copyrighted material, then, unpopular as it may be, the preemptive pragmatic solution should be pretty obvious, enter into commercial arrangements for access to said copyrighted material.

    Claiming a failure to do so in circumstances where the subsequent commercial product directly competes in a market seems disingenuous at best, given what I assume is the purpose of copyrighted material, that being to set the terms under which public facing material can be used. Particularly if regurgitation of copyrighted material seems to exist in products inadequately developed to prevent such a simple and foreseeable situation.

    Yes I am aware of the USA concept of fair use, but the test of that should be manifestly reciprocal, for example would Meta allow what it did to MySpace, hack and allow easy user transfer, or Google with scraping Youtube.

    To me it seems Big Tech wants its cake and to eat it, where investor $$$ are used to corrupt open markets and undermine both fundamental democratic State social institutions, manipulate legal processes, and undermine basic consumer rights.

    • Agreed.

      There is nothing "fair" about the way Open AI steals other people's work. ChatGPT is being monetized all over the world and the large number of people whose work has not been compensated will never see a cent of that money.

      At the same time the LLM will be used to replace (at least some of ) the people who created those works in the first place.

      Tech bros are disgusting.

    • I suspect the US government will allow OpenAI to continue doing as it please to keep their competitive advantage in AI over China (which don't have problem with using copyrighted materials to train their models). They already limit selling AI-related hardware to keep their competitive advantage, so why stop there? Might as well allow OpenAI to continue using copyrighted materials to keep the competitive advantage.

    • So why is so much information (data) freely available on the internet? How do you expect a human artist to learn drawing, if not looking at tutorials and improving their skills through emulating what they see?

    • Yep, completely agree.

      Case in point: Steam has recently clarified their policies of using such Ai generated material that draws on essentially billions of both copyrighted and non copyrighted text and images.

      To publish a game on Steam that uses AI gen content, you now have to verify that you as a developer are legally authorized to use all training material for the AI model for commercial purposes.

      This also applies to code and code snippets generated by AI tools that function similarly, such as CoPilot.

      So yeah, sorry, either gotta use MIT liscensed open source code or write your own, and you gotta do your own art.

      I imagine this would also prevent you from using AI generated voice lines where you trained the model on basically anyone who did not explicitly consent to this as well, but voice gen software that doesnt use the 'train the model on human speakers' approach would probably be fine assuming you have the relevant legal rights to use such software commercially.

      Not 100% sure this is Steam's policy on voice gen stuff, they focused mainly on art dialogue and code in their latest policy update, but the logic seems to work out to this conclusion.

    • Copyright protects the original artist, for a limited time and in limited circumstances, against others copying and, distributing the original work, or creating derivative works. Copyright does not protect against a particular entity consuming the work. Limitation on consumption is antithetical to copyright law.

      The fundamental purpose of copyright is to promote the progress of science and the useful arts. To expand the collective body of knowledge. Consumption of intellectual works is not restricted by copyright. Even if you know that the particular copy of a book was produced by a pirate in violation of the author's copyright, your consumption of that work is not an infringement.

      Knowing that the 13th word of the Gettysburg Address is "continent", and that the preceding and following words are "this" and "a" does not constitute copying, distribution, or creation of a derivative work. Knowledge of the underlying work is not an infringement.

      Quite the contrary, the specific purpose of intellectual property laws is to promote the progress of sciences and useful arts. To expand society's collective body of knowledge. "Fair Use" is not an exemption. "Fair Use" is the purpose. The temporary and natrow limitations on free use are the means by which the law encourages writers and inventors to publish.

      If AI is considered a "progress in the sciences and useful arts", then, unpopular as it may be, the preemptive, pragmatic solution should be pretty obvious: clarify that Fair Use Doctrine explicitly protects this activity.

  • Some relevant comments from Ars:

    leighno5

    The absolute hubris required for OpenAI here to come right out and say, 'Yeah, we have no choice but to build our product off the exploitation of the work others have already performed' is stunning. It's about as perfect a representation of the tech bro mindset that there can ever be. They didn't even try to approach content creators in order to do this, they just took what they needed because they wanted to. I really don't think it's hyperbolic to compare this to modern day colonization, or worker exploitation. 'You've been working pretty hard for a very long time to create and host content, pay for the development of that content, and build your business off of that, but we need it to make money for this thing we're building, so we're just going to fucking take it and do what we need to do.'

    The entitlement is just...it's incredible.

    4qu4rius

    20 years ago, high school kids were sued for millions & years in jail for downloading a single Metalica album (if I remember correctly minimum damage in the US was something like 500k$ per song).

    All of a sudden, just because they are the dominant ones doing the infringment, they should be allowed to scrap the entire (digital) human knowledge ? Funny (or not) how the law always benefits the rich.

  • Any reasonable person can reach the conclusion that something is wrong here.

    What I'm not seeing a lot of acknowledgement of is who really gets hurt by copyright infringement under the current U.S. scheme. (The quote is obviously directed toward the UK, but I'm reasonably certain a similar situation exists there.)

    Hint: It's rarely the creators, who usually get paid once while their work continues to make money for others.

    Let's say the New York Times wins its lawsuit. Do you really think the reporters who wrote the infringed-upon material will be getting royalty checks to be made whole?

    This is not OpenAI vs creatives. OK, on a basic level it is, but expecting no one to scrape blogs and forum posts rather goes against the idea of the open internet in the first place. We've all learned by now that what goes on the internet stays there, with attribution totally optional unless you have a legal department. What's novel here is the scale of scraping, but I see some merit to the "transformational" fair-use defense given that the ingested content is not being reposted verbatim.

    This is corporations vs corporations. Framing it as millions of people missing out on what they'd have otherwise rightfully gotten is disingenuous.

  • ...so stop doing it!

    This explains what Valve was until recently not so cavalier about AI: They didn't want to hold the bag on copyright matters outside of their domain.

  • As with many things, the golden rule applies. They who have the gold, make the rules.

  • I think viral outrage aside, there is a very open question about what constitutes fair use in this application. And I think the viral outrage misunderstands the consequences of enforcing the notion that you can't use openly scrapable online data to build ML models.

    Effectively what the copyright argument does here is make it so that ML models are only legally allowed to make by Meta, Google, Microsoft and maybe a couple of other companies. OpenAI can say whatever, I'm not concerned about them, but I am concerned about open source alternatives getting priced out of that market. I am also concerned about what it does to previously available APIs, as we've seen with Twitter and Reddit.

    I get that it's fashionable to hate on these things, and it's fashionable to repeat the bit of misinformation about models being a copy or a collage of training data, but there are ramifications here people aren't talking about and I fear we're going to the worst possible future on this, where AI models are effectively ubiquitous but legally limited to major data brokers who added clauses to own AI training rights from their billions of users.

    • People hate them not because it is fashionable, but because they can see what is coming.

      Tech companies want to create tools that would replace million of jobs without compensating the very people that created these works in the first place.

      • That's not "coming", it's an ongoing process that has been going on for a couple hundred years, and it absolutely does not require ChatGPT.

        People genuinely underestimate how many of these things have been an ongoing concern. A lot like crypto isn't that different to what you can do with a server, "AI" isn't a magic key that unlocks automation. I don't even know how this mental model works. Is the idea that companies who are currently hiring millions of copywriters will just rely on automated tools? I get that yeah, a bunch of call center people may get removed (again, a process that has been ongoing for decades), but how is compensating Facebook for scrubbing their social media posts for text data going to make that happen less?

        Again, I think people don't understand the parameters of the problem, which is different from saying that there is no problem here. If anything the conversation is a net positive in that we should have been having it in 2010 when Amazon and Facebook and Google were all-in on this process already through both ML tools and other forms of data analysis.

      • Tech companies will create those tools no matter what. Then they will charge everyone through the nose for using them.

        The question is whether:

        • ONLY tech companies capable of paying scraps during 70 years after the author's death are allowed to create those tools
        • EVERYONE is allowed to train their own tool, without having to raise a few billion in seed capital

        In this case, OpenAI is acting as "the devil's advocate"... and it's working to fool people into supporting the opposite position.

    • It is an open question. As others have pointed out, a human taking inspiration from the work of others is totally fine. My issue is that AI are not human.

      A human's production of work is limited. A human can only produce so fast for so long. An AI could theoretically be scaled infinitely and produce indefinitely. I don't want to live in a world where FAANGCORP's OmniAI is responsible for 90% of all art, media, and music because humans can't keep pace with it.

      • A lot of this can be traced back to the invention of photography, which is a fun point of reference, if one goes to dig up the debate at the time.

        In any case, the idea that humans can only produce so fast for so long and somehow that cleans the channel just doesn't track. We are flooded by low quality content enabled by social media. There's seven billion of us two or three billion of those are on social platforms and a whole bunch of the content being shared in channels is created by using corporate tools to make stuff by pointing phones at it. I guarantee that people will still go to museums to look at art regardless of how much cookie cutter AI stuff gets shared.

        However, I absolutely wouldn't want a handful of corporations to have the ability to empower their employed artists with tools to run 10x faster than freelance artists. That is a horrifying proposition. Art is art. The difficulty isn't in making the thing technically (say hello, Marcel Duchamp, I bet you thought you had already litgated this). Artists are gonna art, but it's important that nobody has a monopoly on the tools to make art.

      • "It's too fast" is a really really dumb argument against AI

      • Mass produced garbage is still mass produced garbage. As you point out AIs aren't human and while that removes the limitations of the flesh (including limitations that we might want there - no human ever says oops, I made a child porn), it imposes limitations of the machine. AI output isn't that good at anything practical. It writes garbage code that even if you manage to get it working, the business manager or whoever isn't capable of seeing the flaws in it. The art is devoid of any sort of soul and almost always has glaring flaws that require actual humans to identify and fix.

        We are about to be inundated with AI produced garbage, sure, but that only proves the lie that shady internet sites and social media have always been a cesspool of shitty, unreliable content, and connecting with hundreds of thousands of faceless strangers was never a good idea. Hopefully we'll come up with (or go back to) solutions that don't treat the problem as simply one of volume.

  • Or, or, or, hear me out:

    Maybe their particular approach to making an AI is flawed.

    Its like people do not know that there are many different kinds of ways that attempt to do AI.

    Many of them do not rely on basically a training set that is the cumulative sum of all human generated content of every imaginable kind.

  • All the AI race has done is surface the long standing issue of how broken copyright is for the online internet era. Artists should be compensated but trying to do that using the traditional model which was originally designed with physical, non infinitely copyable goods in mind is just asinine.

    One such model could be to make the copyright owner automatically assigned by first upload on any platform that supports the API. An API provided and enforced by the US copyright office. A percentage of the end use case can be paid back as royalties. I haven't really thought out this model much further than this.

    Machine learning is here to say and is a useful tool that can be used for good and evil things alike.

    • Nah. Copyright is broken, but it's broken because it lasts too long, and it can be held by constructs. People should still reserve the right to not have the things they've made incorporated into projects or products they don't want to be associated with.

      The right to refusal is important. Consent is important. The default permission should not be shifted to "yes" in anybody's mind.

      The fact that a not insignificant number of people seem to think the only issue here is money points to some pretty fucking entitled views among the would-be-billionaires.

      • My major issue with copyright is how published works can have major cultural significance. How it can shift ideas and shape minds. But your not allowed to have some fun with on a personal level. How can it be the norm that the most important scientific knowledge and other culturally significant material is locked behind such restrictive measures. Essentially ensuring that middle class and especially poor people are locked out.

        If you publish something, even if it's paid, you don't deserve such restrictive rights. You deserve to be compensated for your work but you don't deserve to make it into a extortion racket.

        My view on your second point is if you have posted it publicly with no paywall, maybe you should still get some percentage revenue but you don't have a say in what it can be used. To place restrictions on what it can be used for when posting it publicly is academic as it's basically unenforceable.

        We live in a society which revolves around the discovery and sharing of ideas. We are all entitled to a certain amount of the sharing of that information. That's the whole point. To have some business man who was in the right place at the right time create an extortion racket out of something culturally significant they almost certainly didn't create is wrong.

        Sorry if this is all over the place. I'm writing this while tired.

  • Could they be legally required to open source the llm? I believe them, but that doesn’t make it right

  • OpenAI now needs to go to court and argue fair use forever. That's the burden of our system. Private ownership is valued higher than anything else so ... Good luck we're all counting on you (unfortunately).

  • 🤖 I'm a bot that provides automatic summaries for articles: ::: spoiler Click here to see the summary Further, OpenAI writes that limiting training data to public domain books and drawings "created more than a century ago" would not provide AI systems that "meet the needs of today's citizens."

    OpenAI responded to the lawsuit on its website on Monday, claiming that the suit lacks merit and affirming its support for journalism and partnerships with news organizations.

    OpenAI's defense largely rests on the legal principle of fair use, which permits limited use of copyrighted content without the owner's permission under specific circumstances.

    "Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents," OpenAI wrote in its Monday blog post.

    In August, we reported on a similar situation in which OpenAI defended its use of publicly available materials as fair use in response to a copyright lawsuit involving comedian Sarah Silverman.

    OpenAI claimed that the authors in that lawsuit "misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence."


    Saved 58% of original text. :::

251 comments