Anthropic can legally train AI on books without authors permission, judge rules

cybernews.com Attention Required! | Cloudflare

cross-posted from: https://lemmy.world/post/32347293

Artificial Intelligence @lemmy.world

cybernews.com /ai-news/anthropic-wins-authors-copyright-lawsuit-ai-training-books/

11 1

AI @lemmy.ml

Anthropic can legally train AI on books without authors permission, judge rules

cybernews.com /ai-news/anthropic-wins-authors-copyright-lawsuit-ai-training-books/

13 2

38 comments

This is physically bought, scanned, books. Not covered by this case is what they’re allowed to do with that model, eg. charge people for access to it.
Maybe controversial, but compared to meta pirating books, claiming it makes no difference, and that each book is individually worthless to the model (but the model is of course worth billions), is it wrong that I’m like “hmm at they’re least buying books”?
As others say, there should be specific licensing, so they actually need to pay a cost per book, set by the publisher, specifically to legally include it in their model, not just shopping as humans but actually an llm skin suit slave.
- Your comment made me think of the LLM piping this way (as if it could've started legal):
  Shit goes in: sourcing material should be treated not like for a personal, but for a commercial use over some volume by default. It's clearly differentiated in licenses, pricing, fees, etc.
  Shit goes out: the strictiest license of all dataset is applied to how the output can be used. If we can't discern if X was in the mix, we can't say it wasn't, and therefore assume it's there.
  To claim X is not in the dataset, the LLM's owner's dataset should be open unless parts of it are specifically closed by contract obligations with the dataminer\broker. Both open and closed parts with the same parameters should produce the same hash sums of datasets and the resulting weights as in the process of learning itself. If open parts don't contain said piece of work, the responsibility is on data providers, thus closed parts get inspected by an unaffilated party and the owner of LLM. Brokers there are interested in showing it's not on them, and there should be a safeguard against swiftly deleting the evidence - thus the initial trade deal is fixed by some hash once again.
  Broker with someone's pirated work can't knowingly sell the same dataset unless problematic pieces are deleted. The resulting model can continue learning on additional material, but then a complete relearning should be done on new, updated datasets, otherwise it's a crime.
  Failure to provide hashes or other possible signatures verifying datasets are the same, shifts the blame onto LLM's owner. Producing and sharing them in the open and observable manner, having more of their data pool public grants one a right to make it a business and shields from possible lawsuits.
  Data brokers may not disclose their datasets to public, but all direct for-profit piracy charges are on them, not the LLM owner, if the latter didn't obtain said content themselves but purchased it from other party.
  It got longer than I thought.
  
  Except that some derivative works are allowed by humans under current copyright law. This has been degraded to the point where reaction videos have some defense as a derivative work.
  If a reaction video is a derivative work, why can't an AI trained on that work also count?
  
  I really like the idea of signing the model with a dataset hash. Each legally licensable piece of source material could provide a hash, maybe?
  In terms of outputs, it’s really difficult to judge how transformative a model is without transparency of dataset. We’ve obviously seen prompts regurgitate verbatim known works, it could be even more prevalent than apparent just through obscurity as opposed to transformation. More than meets the eye.

Then I too can just download books, videos, music, whatever the fuck we want
- Alsup also said, however, that Anthropic's copying and storage of more than 7 million pirated books in a "central library" infringed the authors' copyrights and was not fair use. The judge has ordered a trial in December to determine how much Anthropic owes for the infringement.
  US copyright law says that willful copyright infringement can justify statutory damages of up to $150,000 per work.
  this is pretty much what we expected from the decision last week: training on books is legal; pirating books is still piracy… you can train on books you own without asking permission (and i assume books/ebooks that you don’t have to circumvent DRM as that’s illegal in a different way)

Cool so an ai reading a book is substantially transformative but college students need to sell plasma to afford nth edition books.
Such an efficient society for wealthy crooks.

With the law as I understand it (not a lawyer) this seems correct.
I think this is unhealthy for society as a whole though, but it is the legislature's job to fix that, not the judiciary.

If I can read books and learn, why can't AI?
- Just because you own a cd doesn’t mean you have a license to play it in a club.
  
  Since when can't you use knowledge gained from books for personal profit?
  The only difference is scale.
  
  It's a good thing they are not playing at a club then.
- ...because you are a person, not a product.
- LLMs (currently colloquially "AI") are literally incapable of "learning."
  
  They are likely referring to the training process of populating model weights based on prepared datasets via training algorithms.
- LLMs are not sentient and never can be.
- You bought it, you own it.
  
  But you can't make copies of it and sell them.

38 comments