How can they prove this though? I don't think they'd have any way to. Unless OpenAI straight up admits it. But like the article mentions, the data could still have been obtained legally.
To be fair, GPT is not a person. It's like a fuzzy database with lossy-compression. If they over-trained GPT on specific books, it could cite the books verbatim, which would then violate copyright and IP laws. (Not that I'm a fan of IP laws).
While I appreciate thinking of this in absurdity, you're being disingenuous here. It's like reading a book for a person with eidetic memory then asking for "writing in the style of so and so." And so you use exactly the sentence structure, the verbiage and even the paragraph style. When inspected, you perfectly reproduced the writing style, but effectively only changed a couple words to match the request.
You reproduced 95% of an essay, and 5% of it is yours. You didn't improve on the work, you simply changed the least amount of it you could to suit your purpose.
The way these systems retain the relative symbols is irrelevant if the structure and form of the original is what gives it it's value. The parameters are simply those things that are elements of someone elses copyrighted material. The lawsuit alleges that the books were used, well it's not too hard to get GPT to spit out gutenberg books, or to lie to it and get it to think other books it knows are now public domain and have it do the same. Paragraph and page you can get it to barf them back out verbatim.
Look my problem with all of this is: AI doesnt steal copyrighted work, not really. It’s more like someone reading a book and being inspired to ise it for a project he has. We humans do that all the time, AI is just faster at it. So why should we treat a software differently than every other person ont the planet. What’s next? Are we suing people for playing songs that might have been inspired by another song? That’s sjust not how things work.
Have no fear, citizens! The American Judicial system will adjudicate this conflict with characteristic speed and wisdom! Expect everything to be a kind of malevolent higgledy-piggledy for 30 years. After that, there'll be some sort of tacit understanding of a gentleman's agreement which will be used as a rule of thumb for certain non-monetized works which may be certified for limited un-scraping status. It's win-win!
even if the books are in OpenAI's training datasets, the company could have obtained the work through the lawful collection of another dataset. And showing that ChatGPT would have behaved differently if it never scooped up the work of the authors is unlikely due to the vast amount of data it scrapes off the web
You as an individual are probably fine but ChatGPT is a large scale system being use commercially and for profit. Very different scenarios.
Sarah Silverman also launched lawsuits against OpenAI and Meta and was able to show that dataset used to train one of the models (cant remember if its LLaMa or GPT) contained illegally obtained version of her book.
Just because it’s publicly available on the internet does not mean it is public domain or not covered by copyrights. Attribution may end up being what is needed. A works cited list. I see licensing of works being ingested as a future moneymaker.