1d ago

Pay-per-output? AI firms blindsided by beefed up robots.txt instructions.

arstechnica.com

News @lemmy.world

ccunning @lemmy.world

1d ago

Pay-per-output? AI firms blindsided by beefed up robots.txt instructions.

arstechnica.com /tech-policy/2025/09/pay-per-output-ai-firms-blindsided-by-beefed-up-robots-txt-instructions/

27 comments

evolves robots.txt instructions by adding an automated licensing layer that's designed to block bots that don't fairly compensate creators for content
robots.txt - the well known technology to block bad-intention bots /s
What's automated about the licensing layer? At some point, I started skimming the article. They didn't seem clear about it. The AI can "automatically" parse it?

# NOTICE: all crawlers and bots are strictly prohibited from using this # content for AI training without complying with the terms of the RSL # Collective AI royalty license. Any use of this content for AI training # without a license is a violation of our intellectual property rights. License: https://rslcollective.org/royalty.xml
Yeah, this is as useless as I thought it would be. Nothing here is actively blocking.
I love that the XML then points to a text/html content website. I guess nothing for machine parsing, maybe for AI parsing.
I don't remember which AI company, but they argued they're not crawlers but agents acting on the users behalf for their specific request/action, ignoring robots.txt. Who knows how they will react. But their incentives and history is ignoring robots.txt.
Why ~~am I~~ is this comment so negative. Oh well.
Leeds told Ars that the RSL standard doesn't just benefit publishers, though. It also solves a problem for AI companies, which have complained in litigation over AI scraping that there is no effective way to license content across the web.
"If they're using it, they pay for it, and if they're not using it, they don't pay for it.
...
But AI companies know that they need a constant stream of fresh content to keep their tools relevant and to continually innovate, Leeds suggested. In that way, the RSL standard "supports what supports them," Leeds said, "and it creates the appropriate incentive system" to create sustainable royalty streams for creators and ensure that human creativity doesn't wane as AI evolves.
This article tries to slip in the idea that creators will benefit from this arrangement. Just like with Spotify and Getty Images, it's the publisher that's getting paid.
Then they decide how much they'll let trickle down to creators.
- Cue an even greater influx of AI slop pages in hopes of getting crawled for that juicy trickled down money
- I would assume creators and published would agree to those terms in advance (moving forward of course).
The issue is the line that says “compensate creators”. Reddit still thinks it’s the creator, not the individual users.
And suddenly the Internet is gung-ho in favor of EULAs being enforceable simply by reading the content the website has already provided.
Recent major court cases have held that the training of an AI model is fair use and doesn't involve copyright violation, so I don't think licensing actually matters in this case. They'd have to put the content behind a paywall to stop the trainer from seeing it in the first place.
- I guess that’s a different court case than the one where Anthropic offered to pay $1.5 billion?
  
  Nope, this was one of them. The case had two parts, one about the training and one about the downloading of pirated books. The judge issued a preliminary judgment about the training part, that was declared fair use without any further need to address it in trial. The downloading was what was proceeding to trial and what the settlement offer was about.
  
  Totally different. Anthropic could have bought all the books and trained on them. Pirating is a different topic.
- Is it hypocrisy to be for EULA enforcement on reading when it's machines, but not when it's humans? Crawlers "read" on a massive scale that doesn't compare to humans.
  
  I don't think so, or not always. humans need to find the EULA on the website by first loading the main page or another they found a link to. but if the path of that document was standardized, it could be enforced that way for robots
I have no idea what they think this will accomplish, to be honest. It has the legal value of posting on Facebook that you don't allow them to use your photos.
- I think the idea is that all parties would find it beneficial:
  Leeds told Ars that the RSL standard doesn't just benefit publishers, though. It also solves a problem for AI companies, which have complained in litigation over AI scraping that there is no effective way to license content across the web.
  
  The thing is a robots.txt file doesn't work as licensing. There's no legal requirement to fetch the file, and no mechanism to consent or track consent.
  This is putting up a sign that says everyone must pay, and then giving it to anyone who asks for free.
Does AI cost advertisers money?
I'd be cool with it if that's the case.
Wonder if this could work for Fediverse servers too.
- Hold my beer
Neither the article nor the RSL website makes clear how pricing or payment works, which seems like a huge miss. It’s not obvious if a publisher can price-differentiate among content, or even choose their own prices at all.
RSL makes an analogy:
Collective licensing organizations like ASCAP and BMI have long helped musicians get paid fairly by working together and pooling rights into a single, indispensable offering.
I’d like to get excited about this because AI companies suck, but if the best example they have is that ASCAP helps “musicians get paid fairly” I’m afraid this isn’t a solution that most content creators will celebrate.
Not a bad idea but the biggest challenge will probably be determining who needs to be sued for non-compliance. Google might not be hiding the origin of its bots now but that could easily change.
Interesting.

27 comments