Jeremy Gillen comments on Deceptive Alignment and Homuncularity

Jeremy Gillen 16 Jan 2025 20:12 UTC
43 points
23
Really appreciate dialogues like this. This kind of engagement across worldviews should happen far more, and I’d love to do more of it myself.^[1]
Some aspects were slightly disappointing:
- Alex keeps putting (inaccurate) words in the mouths of people he disagrees with, without citation. E.g.
  - ‘we still haven’t seen consistent-across-contexts agency from pretrained systems, a possibility seriously grappled with by eg The Parable of Predict-O-Matic).’
    That post was describing a very different kind of AI than generative language models. In particular, it is explicitly designed to minimize long run prediction error.^[2] In fact, the surrounding posts in the sequence discuss myopia and suggest myopic algorithms might be more fundamental/incentivised by default.
  - ‘I think this is a better possible story than the “SGD selects for simplicity → inner-goal structure” but I also want to note that the reason you give above is not the same as the historical supports offered for the homunculus.’
    Not sure which historical supporting argument you’re referring to, but if it’s this one, then I think you’re being very unfair.^[3]
  - “I think there’s a ton of wasted/ungrounded work around “avoiding schemers”, talking about that as though we have strong reasons to expect such entities. Off the top of my head: Eliezer and Nate calling this the “obvious result” of what you get from running ML processes. Ajeya writing about schemers, Evan writing about deceptive alignment (quite recently!), Habryka and Rob B saying similar-seeming things on Twitter” and “Again, I’m only critiquing the within-forward-pass version”
    I think you’re saying here that all these people were predicting consistent-across-contexts inner homunculi from pretraining near-term LLMs? I think this is a pretty extreme strawman. In particular, most of their risk models (iirc) involve people explicitly training for outcome-achieving behaviour.
  - ‘And so people wasted a lot of time, I claim, worrying about that whole “how can I specify ‘get my mother out of the building’ to the outcome pump” thing’
    People spent time thinking about how to mitigate reward hacking? Yes. But that’s a very reasonable problem to work on, with strong empirical feedback loops. Can you give any examples of people wasting time trying to specify ‘get my mother out of the building’? I can’t remember any. How would that even work?
  - “And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn’t try to break out of its “cage” in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude). ”
    Who predicted this? You’re making up bad predictions. Eliezer in particular has been pretty clear that he doesn’t expect evidence of this form.
- Alex seemed to occasionally enjoy throwing out insults sideways toward third parties.
  - E.g. “the LW community has largely written fanfiction alignment research”. I think communication between the various factions would go better if statements like this were written without deliberate intention to insult. It could have just been “the LW community has been largely working from bad assumptions”.
But I’m really glad this was published, I learned something about both Oliver and Alex’s models, and I’d think it was very positive even if there were more insults :)
1. ^
  If anyone is interested?
2. ^
  Quote from the post: “Predict-O-Matic will be objective. It is a machine of prediction, is it not? Its every cog and wheel is set to that task. So, the answer is simple: it will make whichever answer minimizes projected predictive error. There will be no exact ties; the statistics are always messy enough to see to that. And, if there are, it will choose alphabetically.”
3. ^
  Relevant quote from Evan in that post:
  “Question: Yeah, so would you say that, GPT-3 is on the extreme end of world modeling. As far as what it’s learned in this training process?
  What is GPT-3 actually doing? Who knows? Could it be the case for GPT-3 that as we train larger and more powerful language models, doing pre-training will eventually result in a deceptively aligned model? I think that’s possible. For specifically GPT-3 right now, I would argue that it looks like it’s just doing world modeling. It doesn’t seem like it has the situational awareness necessary to be deceptive. And, if I had to bet, I would guess that future language model pre-training will also look like that and won’t be deceptive. But that’s just a guess, and not a super confident one.
  The biggest reason to think that pre-trained language models won’t be deceptive is just that their objective is extremely simple—just predict the world. That means that there’s less of a tricky path where stochastic gradient descent (SGD) has to spend a bunch of resources making their proxies just right, since it might just be able to very easily give it the very simple proxy of prediction. But that’s not fully clear—prediction can still be quite complex.
  Also, this all potentially changes if you start doing fine-tuning, like RLHF (reinforcement learning from human feedback). Then what you’re trying to get it to do might be quite complex—something like “maximize human approval.” If it has to learn a goal like that, learning the right proxies becomes a lot harder.”
- Oliver Sourbut 16 Jan 2025 23:42 UTC
  5 points
  0
  Parent
  Thanks for this! I hadn’t seen those quotes, or at least hadn’t remembered them.
  
  I actually really appreciate Alex sticking his neck out a bit here and suggesting this LessWrong dialogue. We both have some contrary opinions, but his takes were probably a little more predictably unwelcome in this venue. (Maybe we should try this on a different crowd—we could try rendering this on Twitter too, lol.)
  
  There’s definitely value to being (rudely?) shaken out of lazy habits of thinking—though I might not personally accuse someone of fanfiction research! As discussed in the dialogue, I’m still unsure the exact extent of correct- vs mis-interpretation and I think Alex has a knack for (at least sometimes correctly) calling out others’ confusion or equivocation.
  - Jeremy Gillen 17 Jan 2025 12:06 UTC
    2 points
    2
    Parent
    but his takes were probably a little more predictably unwelcome in this venue
    I hope he doesn’t feel his takes are unwelcome here. I think they’re empirically very welcome. His posts seem to have a roughly similar level of controversy and popularity as e.g. so8res. I’m pretty sad that he largely stopped engaging with lesswrong.
    There’s definitely value to being (rudely?) shaken out of lazy habits of thinking [...] and I think Alex has a knack for (at least sometimes correctly) calling out others’ confusion or equivocation.
    Yeah I agree, that’s why I like to read Alex’s takes.
- 1a3orn 17 Jan 2025 4:35 UTC
  3 points
  −4
  Parent
  
  I think you’re saying here that all these people were predicting consistent-across-contexts inner homunculi from pretraining near-term LLMs? I think this is a pretty extreme strawman. In particular, most of their risk models (iirc) involve people explicitly training for outcome-achieving behaviour.
  
  Without going deeply into history—many people saying that this is a risk from pretraining LLMs is not a strawman, let alone an extreme strawman. For instance, here’s Scott Alexander from like 3 weeks ago, outlining why a lack of “corrigibility” is scary:
  
  What if an AI gets a moral system in pretraining (eg it absorbs it directly from the Internet text that it reads to learn language)? Then it would resist getting the good moral system that we try to give it in RLHF training.
  
  So he’s concerned about precisely this supposedly extreme strawman. Granted, it would be harder to dig up the precise quotes for all the people mentioned in the text.
  - Jeremy Gillen 17 Jan 2025 12:59 UTC
    4 points
    0
    Parent
    Yeah I can see how Scott’s quote can be interpreted that way. I think the people listed would usually be more careful with their words. But also, Scott isn’t necessarily claiming what you say he is. Everyone agrees that when you prompt a base model to act agentically, it can kinda do so. This can happen during RLHF. Properties of this behaviour will be absorbed from pretraining data, including moral systems. I don’t know how Scott is imagining this, but it needn’t be an inner homunculi that has consistent goals.
    I think the thread below with Daniel and Evan and Ryan is good clarification of what people historically believed (which doesn’t put 0 probability on ‘inner homunculi’, but also didn’t consider it close to being the most likely way that scheming consequentialist agents could be created, which is what Alex is referring to^[1]). E.g. Ajeya is clear at the beginning of this post that the training setup she’s considering isn’t the same as pretraining on a myopic prediction objective.
    ^
    When he says ‘I think there’s a ton of wasted/ungrounded work around “avoiding schemers”, talking about that as though we have strong reasons to expect such entities.’