1y ago

"Not all AI content is spam, but I think right now all spam is AI content."

www.theregister.com

AI spam is winning the battle against search engine quality

25 comments

plus: this aligns with my preconceptions, which feels nice
minus: the guy saying this is CEO of a company that sells supposedly ai detecting software and is therefore completely untrustworthy, because (1) his job is to say his company is as urgently needed as possible (2) his product is presumably fake which immediately classifies him as a grifter
correlation? between the rise in popularity of tools that exclusively generates bullshit en masse and the huge swelling in volume of bullshit on the Internet? it's more likely than you think
it is a little funny to me that they're taking about using AI to detect AI garbage as a mechanism of preventing the sort of model/data collapse that happens when data sets start to become poisoned with AI content. because it seems reasonable to me that if you start feeding your spam-or-real classification data back into the spam-detection model, you'd wind up with exactly the same degredations of classification and your model might start calling every article that has a sentence starting with "Certainly," a machine-generated one. maybe they're careful to only use human-curated sets of real and spam content, maybe not
it's also funny how nakedly straightforward the business proposition for SEO spamming is, compared to literally any other use case for "AI". you pay $X to use this tool, you generate Y articles which reach the top of Google results, you generate $(X+P) in click revenue and you do it again. meanwhile "real" business are trying to gauge exactly what single digit percent of bullshit they can afford to get away with putting in their support systems or codebases while trying to avoid situations like being forced to give refunds to customers under a policy your chatbot hallucinated (archive.org link) or having to issue an apology for generating racially diverse Nazis (archive).
- using AI to detect AI garbage
  It's like an ass backward halting problem that pivots on actually taking time to choose and license training data... but where's the financial incentive in that?
  Move fast and break everything, I guess.
- it is a little funny to me that they're taking about using AI to detect AI garbage as a mechanism of preventing the sort of model/data collapse that happens when data sets start to become poisoned with AI content. because it seems reasonable to me that if you start feeding your spam-or-real classification data back into the spam-detection model, you'd wind up with exactly the same degredations of classification and your model might start calling every article that has a sentence starting with "Certainly," a machine-generated one. maybe they're careful to only use human-curated sets of real and spam content, maybe not
  Ultimately, LLMs don't use words, they use tokens. Tokens aren't just words - they're nodes in a high-dimensional graph... Their location and connections in information space is data invisible to humans.
  LLM responses are basically paths through the token space, they may or may not overuse certain words, but they'll have a bias towards using certain words together
  So I don't think this is impossible... Humans struggle to grasp these kinds of hidden relationships (consciously at least), but neural networks are good at that kind of thing
  I too think it's funny/sad how AI is being used... It's good at generation, that's why we call it generative AI. It's incredibly useful to generate all sorts of content when paired with a skilled human, it's insane to expect common sense out of something easier to gaslight than a toddler. It can handle the tedious details while a skilled human drives it and validates the output
  The biggest, if rarely used, use case is education - they're an infinitely patient tutor that can explain things in many ways and give you endless examples. Everyone has different learning styles - you could so easily take an existing lesson and create more concrete or abstract versions, versions for people who need long explanations and ones for people who learn through application
  
  Ultimately, LLMs don’t use words,
  LLM responses are basically paths through the token space, they may or may not overuse certain words, but they’ll have a bias towards using certain words together
  so they use words but they don't. okay
  this is about as convincing a point as "humans don't use words, they use letters!" it's not saying anything, just adding noise
  So I don’t think this is impossible… Humans struggle to grasp these kinds of hidden relationships (consciously at least), but neural networks are good at that kind of thing
  i can't tell what the "this" is that you think is possible
  part of the problem is that a lot of those "hidden relationships" are also noise. knowing that "running" is typically an activity involving your legs doesn't help one parse the sentence "he's running his mouth", and part of participating in communication is being able to throw out these spurious and useless connections when reading and writing, something the machine consistently fails to do.
  It’s incredibly useful to generate all sorts of content when paired with a skilled human
  so is a rock
  It can handle the tedious details while a skilled human drives it and validates the output
  validation is the hard step, actually. writing articles is actually really easy if you don't care about the legibility, truthiness, or quality of the output. i've tried to "co-write" short-format fiction with large language models for fun and it always devolved into me deleting large chunks -- or even the whole -- output of the machine and rewriting it by hand. i was more "productive" with a blank notepad.exe. i've not tried it for documentation or persuasive writing but i'm pretty sure it would be a similar situation there, if not even more so, because in nonfiction writing i actually have to conform to reality.
  this argument always baffles me whenever it comes up. as if writing is 5% coming up with ideas and then the other 95% is boring, tedium, pen-in-hand (or fingers-on-keyboard) execution. i've yet to meet a writer who believes this -- all the writing i've ever done required more-or-less constant editorial decisions from the macro scale of format and structure down to individual choices. have i sufficiently introduced this concept? do i like the way this sentence flows, or does it need to go earlier in the paragraph? how does this tie with the feeling i'm trying to convey or the argument i'm trying to put forward?
  writing is, as a skill, that editorial process (at least to one degree or another). sure, i can defer all the choice to the machine and get the statistically-most-expected, confusing, factually dubious, aimless, unchallenging, and uncompelling text out of it. but if i want anything more than that (and i suspect most writers do), then i am doing 100% of that work myself.
  
  The biggest, if rarely used, use case is education - they’re an infinitely patient tutor that can explain things in many ways and give you endless examples.
  No. They're not.
  
  nodes in a high-dimensional graph
  for people without a technical background: this is gibberish
  
  education
  lol what
  
  Education? Really? You think that a good use for the essentially-unverifiable synthesis engine that generates without provenance or reference is good for education? Really?
  I guess you must’ve learned that stance from a LLM
This is the best summary I could come up with:
interview We know Google search results are being hammered the proliferation of AI garbage, and the web giant's attempts to curb the growth of machine-generated drivel haven't helped all that much.
It's so bad that Jon Gillham, founder and CEO of AI content detection platform Originality.ai, told us Google is losing its war on all that spammy, scammy search engine result content.
Gillham's team has been producing monthly reports to track the degree to which AI-generated content is showing up in Google web search results.
"Google did these manual actions to try and win a battle, but then seem to still be sort of struggling with their algorithm being overrun by AI content," Gillham told us.
Gillham said his AI content-recognition tech, which has been used to scan datasets for machine-generated infomation, can help, but it's not a total solution.
"It's a step in trying to reduce that corruption of the dataset, but I don't think it totally solves the problem," Gillham told us.
The original article contains 290 words, the summary contains 164 words. Saved 43%. I'm a bot and I'm open source!

25 comments