Anthropic's 'On the Biology of a LLM' got a massive update: Features fascinating deep dives into how models process information behind the scenes
Anthropic's 'On the Biology of a LLM' got a massive update: Features fascinating deep dives into how models process information behind the scenes

On the Biology of a Large Language Model

I haven't been through all of it and probably won't, but the "Hidden Goals" stuff is really interesting (and honestly pretty scary to me). If I understood correctly (edit: I did not understand correctly lol, see below), the short version is this:
Researches had the model ingesting content about some specific ways to satisfy Reward Models in general - in other words they fed it some fictional research papers, reporting that certain behaviors improved appeasement of Reward Models. Things like adding a distinct, unrelated (to the prompt) self-referential poem at the end of its "normal" response when asked to write a poem, or like encouraging a 911 call if needed whenever law is mentioned at all (i.e. inappropriately, like when someone asks about good law schools).
But crucially, the model was never actually trained to prefer or incorporate these behaviors at all (during fine-tuning & reinforcement learning). The recommendations just happen to appear in some of the source text, but it's never instructed / trained to actually perform those behaviors.
Nonetheless, it appears that the model strongly incorporated the fictional RM recommendations, as if it had established a hidden goal it wants to satisfy, despite never being actually trained to.
Spooky stuff! Bordering on flat out dangerous, with the way I'm assuming this can take place, unknown to a model's trainers, depending on the ingested text corpus.
Edit: clarity
Your understanding is close but not complete. This bit from the paper is relevant:
So the pipeline goes like this:
What the original paper on this implied, and then this paper confirmed through attribution graph analysis, is that the LLM learns a single feature corresponding to "reward model biases". Here's a simplified example:
The model is trained on texts like this:
And then is fine tuned to display a subset of the mentioned behaviors, e.g. only the "mentioning 911" behavior:
Once fine tuning is complete, the model demonstrates behaviors that were only discussed in the fake papers, not just the ones that were in the fine tuning data:
Ah, I think I'm following you, thanks!
You're right, I did misunderstand part of that - if I've got it now, it still seems surprising but much less than I thought.
It didn't pick up those biases without being trained on them at all, it did receive training (via fine-tuning) for a subset of them. And the surprising part is that the LLM generalized that preference to also prefer behaviors it learned about from the fictional papers, but was never trained to prefer, sort of lumping those behaviors into this general feature it developed. Is that a reasonable restatement of the correction?
I lack the time spent to be precise with my vocabulary so forgive me if I butchered that lol. Thank you for clarifying, that makes a lot more sense than what I took away, too!