I almost want to believe they legitimately do not know nor care they‘re committing a gigantic data and labour heist but the truth is they know exactly what they‘re doing and they rub it under our noses.
Then wipe it out and start again once you have where your data is coming from sorted out. Are we acting like you having built datacenter pack full of NVIDIA processors just for this sort of retraining? They are choosing to build AI without proper sourcing, that's not an AI limitation.
Mira Murati, OpenAI's longtime chief technology officer, sat down with The Wall Street Journal's Joanna Stern this week to discuss Sora, the company's forthcoming video-generating AI.
It's a bad look all around for OpenAI, which has drawn wide controversy — not to mention multiple copyright lawsuits, including one from The New York Times — for its data-scraping practices.
After the interview, Murati reportedly confirmed to the WSJ that Shutterstock videos were indeed included in Sora's training set.
But when you consider the vastness of video content across the web, any clips available to OpenAI through Shutterstock are likely only a small drop in the Sora training data pond.
Others, meanwhile, jumped to Murati's defense, arguing that if you've ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.
Whether Murati was keeping things close to the vest to avoid more copyright litigation or simply just didn't know the answer, people have good reason to wonder where AI data — be it "publicly available and licensed" or not — is coming from.
The original article contains 667 words, the summary contains 178 words. Saved 73%. I'm a bot and I'm open source!
So my work uses ChatGPT as well as all the other flavours. It's getting really hard to stay quiet on all the moral quandaries being raised on how these companies are training their AI data.
I understand we all feel like we are on a speeding train that can't be stopped or even slowed down but this shit ain't right. We need to really start forcing businesses to have moral compass.
this is why code AND cloud services shouldn't be copyrightable or licensable without some kind of transparency legislation to ensure people are honest. Either forced open source or some kind of code review submission to a government authority that can be unsealed in legal disputes.
Obviously nobody fully knows where so much training data come from. They used Web scraping tool like there's no tomorrow before, with that amount if informations you can't tell where all the training material come from. Which doesn't mean that the tool is unreliable, but that we don't truly why it's that good, unless you can somehow access all the layers of the digital brains operating these machines; that isn't doable in closed source model so we can only speculate. This is what is called a black box and we use this because we trust the output enough to do it. Knowing in details the process behind each query would thus be taxing. Anyway...I'm starting to see more and more ai generated content, YouTube is slowly but surely losing significance and importance as I don't search informations there any longer, ai being one of the reasons for this.
Any company CEO does not know shit that goes on in the dev department so her answer does not surprise me, ask the Devs or the team leader in charge of the project. The CEO is only there to make sure the company makes money as he and the share holders only care about money!