Opinionated article by Alexander Hanff, a computer scientist and privacy technologist who helped develop Europe's GDPR (General Data Protection Regulation) and ePrivacy rules.
We cannot allow Big Tech to continue to ignore our fundamental human rights. Had such an approach been taken 25 years ago in relation to privacy and data protection, arguably we would not have the situation we have to today, where some platforms routinely ignore their legal obligations at the detriment of society.
Legislators did not understand the impact of weak laws or weak enforcement 25 years ago, but we have enough hindsight now to ensure we don’t make the same mistakes moving forward. The time to regulate unlawful AI training is now, and we must learn from mistakes past to ensure that we provide effective deterrents and consequences to such ubiquitous law breaking in the future.
It's more like "Slapping on the wrist isn't helping." The Alex Jones bankruptcy is the first time I've seen anyone fined significantly to the point of it mattering. Fines are meant to be significant enough that the company would do its best to avoid them. If the fine is palpable, then it's just the cost of doing business.
The Alex Jones bankruptcy is the first time I’ve seen anyone fined significantly to the point of it mattering.
The Alex Jones case is a textbook example of what happens when a rich person is so overconfident that he does even less than the absolute bare minimum to defend himself in a court case. He defaulted on the case! That's the absolute zero of stupidity in legal terms.
I don't really consider the Alex Jones case to be a win. It was a fluke, and if he had even put up a slight bit of effort, it would have turned out very differently. You know, like 99% of the other cases where the rich is legally attacking the poor.
Yes the fines are not high enough. IMHO there should be two payments: a return of all earnings which are related to the violation PLUS a hefty fine and/or jail for the executives
That's the only way it isn't cost efficient for the big companies to ignore the laws. Also, make sure the fines are actually paid in full and in a reasonable amount of time
I’d argue Alex Jones is completely different in the eyes of the law. His Sandy Hook case and subsequent bankruptcy are very different than the fines levied against tech companies. Which is why there’s a huge difference.
In general, crimes that done physically hurt people have less consequences. And that should change.
Fining these companies a significant amount, so that they can no longer be considered a line item on the budget would be a good start. There definitely needs to be a change. But I’m no expert to truly evaluate what changes would be effective.
That's stupid. The damage is still done to the owner of that data used illegally. Make them destroy it.
But when you levy such miniscule fines that are less than they stand to make from it, it's just a cost of business. Fines can work if they were appropriate to the value derived.
Yeah, the only threat to Big Tech is that they might sink a lot of money into training material they'd have to give away later. But releasing the material into the Public Domain is not exactly an improvement for the people whose data and work has been used without consent or payment.
"Congratulations, your rights are still being violated, but now the data is free to use for everyone".
They would actually still benefit from public-domain'ing LLMs, because they themselves also get to use the data produced by others. Everyone gets losses but also gets gains on this idea, which is much better than current model.
I guess the idea is that the models themselves are not infringing copyright, but the training process DID. Some of the big players have admitted to using pirated material in training data. The rest obviously did even if they haven't admitted it.
While language models have the capacity to produce infringing output, I don't think the models themselves are infringing (though there are probably exceptions). I mean, gzip can reproduce infringing material too with the correct input. If producing infringing work requires both the algorithm AND specific, intentional user input, then I don't think you should put the blame solely on the algorithm.
Either way, I don't think existing legal frameworks are suitable to answer these questions, so I think it's more important to think about what the law should be rather than what it currently is.
I remember stories about the RIAA suing individuals for many thousands of dollars per mp3 they downloaded. If you applied that logic to OpenAI — maximum fine for every individual work used — it'd instantly bankrupt them. Honestly, I'd love to see it. But I don't think any copyright holder has the balls to try that against someone who can afford lawyers. They're just bullies.
I guess the idea is that the models themselves are not infringing copyright, but the training process DID.
I'm still not understanding the logic. Here is a copyrighted picture. I can search for it, download it, view it, see it with my own eye balls. My browser already downloaded the image for me, in order for me to see it in the browser. I can take that image and edit it in a photo editor. I can do whatever I want with the image on my own computer, as long as I don't publish the image elsewhere on the internet. All of that is legal. None of it infringes on copyright.
Hell, it could be argued that if I transform the image to a significant degree, I can still publish it under Fair Use. But, that still gets into a gray area for each use case.
What is not a gray area is what AI training does. They download the image and use it in training, which is like me looking at a picture in a browser. The image isn't republished, or stored in the published model, or represented in any way that could be reconstructed back to the source image in any reasonable form. It just changes a bunch of weights in a LLM model. It's mathematically impossible for a 4GB model to somehow store the many many terabytes of images on the internet.
Where is the copyright infringement?
I remember stories about the RIAA suing individuals for many thousands of dollars per mp3 they downloaded. If you applied that logic to OpenAI — maximum fine for every individual work used — it’d instantly bankrupt them. Honestly, I’d love to see it. But I don’t think any copyright holder has the balls to try that against someone who can afford lawyers. They’re just bullies.
You want to use the same bullshit tactics and unreasonable math that the RIAA used in their court cases?
Destroying it is both not an option, and an objectively regressive suggestion to even make.
Destruction isn't possible because even if you deleted every bit of information from every hard drive in the world, now that we know it's possible, someone would recreate it all in a matter of months.
Regressive because you're literally suggesting that we destroy a new technology because we're afraid of what it will do to the technology it replaces. Meanwhile, there's a very decent chance that AI is our best chance at solving the energy/climate crises through advancing nuclear tech, as well as surviving the next pandemic via ground breaking protein folding tech.
I realize AI tech makes people uncomfortable (for...so many reasons), but becoming old fashioned conservatives in response is not a solution.
I would take it a step further than public domain, though. I would also make any profits from illegally trained AI need to be licensed from the public. If you're going to use an AI to replace workers, then you need to pay taxes to the people proportional to what you would be paying those it replaces.
I never suggested destroying the technology that is "AI". I'm not uncomfortable about AI, I've even considered pivoting my career in that direction.
I suggested destroying the particular implementation that was trained on the illegitimate data. If someone can recreate it using legitimate data, GREAT. That's what we want to happen. The tool isn't the problem. It's the method they're using to train them.
Please don't make up random ass narratives I never even hunted at, and then argue against them.
Mate LLMs are literally gobbling up energy as if they're working at a power plant gloryhole. It's furthering the climate crisis, not solving it. They're also incapable of logic to make something new so they're not gonna invent anything. AI in general has it's uses but LLMs are not the golden goose you should bet on. And profits from them are afaik non existent. They only come from investors thinking it'll be profitable some day but it's a way too energy intense process to be profitable
I'd argue it's not useless, rather, it would remove any financial incentive for these companies to sink who knows how much into training AI. By putting them on the public domain, they would loose their competitve advantage over other cloud providers who could exploit it all the same, all the while not disturbing the current usage of AI.
Now, I do agree that destroying it would be even better, but I fear something like that would face too much force back by the parts of civil society who do use AI.
Strongly agree. Legislators have to come up with a way to handle how copyright works in conjunction with AI. I think it's a sound approach to say companies can't copyright it and keep it to themselves, if most of what went in was other people's copyrighted work.
And it'd help make AI more democratic. I.e. not just entirely dominated by the motives of those super rich companies who have the millions of dollars to do it.
Legislators have to come up with a way to handle how copyright works in conjunction with AI.
That's the neat part. It doesn't.
Copyright hasn't worked for the past 100 years. Copyright was borne out of an social agreement that works generated from it would enter public domain in a reasonable time frame. Thanks to Mark Twain and Disney, the limit is basically forever, or it might as well be. Here we are still arguing about the next Bond film for a book series that was made in the fucking 1950s. Or the Lord of the Rings series, the genesis of all fantasy. Or thousands of other things that deserve to be in public domain already.
Copyright is a blunt tool that rich people use to bash the poor with. Whatever you think copyright is doing to protect your rights or your works is easy enough for them to just spend enough money with lawyers and cases until you cave. If copyright isn't working for the public good, then we should abolish it.
People hate AI because it's mostly developed and used by the rich as a shitty way to save money and layoff even more people than we've already had. But, it doesn't have to be. All of these LLM projects were based on freely available research. Hell, Stable Diffusion is still something you can just download and use for free, despite the fact that Stability AI is still trying to wrestle back their own control into the model.
Instead of sticking our ears in our fingers and saying "la la la la, AI doesn't exist, it must be destroyed/regulated/fined", we could push this technology to open sourced as much as possible. I mean, let's assume that we somehow regulate AI so that people have to pay to use copyrighted works for training (as absurd as that is). AI training goes down drastically, and stagnates. Counties like China are not going to follow those same rules, and eventually, China will be the technological leader here.
Or the program works, and other people who don't give a shit about copyright freely allow AI to train their works. Then you have AI models that have to follow these arcane rules, but arrived at the same spot, anyway, but only for the rich people who can afford the systems that allow for that regulation. What the fuck was the point in the regulation, except to make it even more expensive to make?
I mean, let's assume that we somehow regulate AI so that people have to pay to use copyrighted works for training (as absurd as that is).
ISBNDB approximates there to be 158,464,880 published books in existence.
Meta's annual income was ~156 billion last year.
Assuming a one time purchase scenario and a $20 average cost that's ~3.2 billion dollars. ~2% of their annual revenue.
Or you could assume assuming a $0.2 annual license (similar to a lot of technology licenses), or a 0.002 per "stream" (which I. This instance would be 'use of data to train model')
I agree with most of what you said, but if you buy into a lot of the economic paradigms your arguments are based on you must also realize that those require the copyrighted works must be paid for and it's not unreasonable to do so.
Sure. Copyright is is - is broken. And it certainly doesn't help I'm paying Spotify etc just so they can pocket the money. But don't we need something so Hollywood can produce my favorite TV show? I mean that stuff costs millions and millions to make, until it somehow arrives on my screen. Or an author making a decent living with coming up with a nice fantasy novel series? What's the alternative until we arrive at Star Trek and money is a thing of the past?
I'm pretty sure the AI companies are stealing copyrighted work. Afaik Mata admitted doing it. For several older ones we know which books were in the training datasets. There are several ongoing lawsuits dealing with books being used to train AI, Scarlett Johansson's voice etc.
I agree. As is, AI is a plaything for rich companies. They have complete control, since they hired the experts and they have the money for all the graphics cards and electricity. If it's as disruptive as people claim, it's our bad. Because we're out of the loop.
This feels like either a weak response or a.shift in position. If privacy is the issue, how is the PD a serious solution? Of course it isn't. So PD is a penalty of sorts, which is no better or worse than any other penalty. Meh.