I think you’re saying here that all these people were predicting consistent-across-contexts inner homunculi from pretraining near-term LLMs? I think this is a pretty extreme strawman. In particular, most of their risk models (iirc) involve people explicitly training for outcome-achieving behaviour.
Without going deeply into history—many people saying that this is a risk from pretraining LLMs is not a strawman, let alone an extreme strawman. For instance, here’s Scott Alexander from like 3 weeks ago, outlining why a lack of “corrigibility” is scary:
What if an AI gets a moral system in pretraining (eg it absorbs it directly from the Internet text that it reads to learn language)? Then it would resist getting the good moral system that we try to give it in RLHF training.
So he’s concerned about precisely this supposedly extreme strawman. Granted, it would be harder to dig up the precise quotes for all the people mentioned in the text.
Yeah I can see how Scott’s quote can be interpreted that way. I think the people listed would usually be more careful with their words. But also, Scott isn’t necessarily claiming what you say he is. Everyone agrees that when you prompt a base model to act agentically, it can kinda do so. This can happen during RLHF. Properties of this behaviour will be absorbed from pretraining data, including moral systems. I don’t know how Scott is imagining this, but it needn’t be an inner homunculi that has consistent goals.
I think the thread below with Daniel and Evan and Ryan is good clarification of what people historically believed (which doesn’t put 0 probability on ‘inner homunculi’, but also didn’t consider it close to being the most likely way that scheming consequentialist agents could be created, which is what Alex is referring to[1]). E.g. Ajeya is clear at the beginning of this post that the training setup she’s considering isn’t the same as pretraining on a myopic prediction objective.
When he says ‘I think there’s a ton of wasted/ungrounded work around “avoiding schemers”, talking about that as though we have strong reasons to expect such entities.’
Without going deeply into history—many people saying that this is a risk from pretraining LLMs is not a strawman, let alone an extreme strawman. For instance, here’s Scott Alexander from like 3 weeks ago, outlining why a lack of “corrigibility” is scary:
So he’s concerned about precisely this supposedly extreme strawman. Granted, it would be harder to dig up the precise quotes for all the people mentioned in the text.
Yeah I can see how Scott’s quote can be interpreted that way. I think the people listed would usually be more careful with their words. But also, Scott isn’t necessarily claiming what you say he is. Everyone agrees that when you prompt a base model to act agentically, it can kinda do so. This can happen during RLHF. Properties of this behaviour will be absorbed from pretraining data, including moral systems. I don’t know how Scott is imagining this, but it needn’t be an inner homunculi that has consistent goals.
I think the thread below with Daniel and Evan and Ryan is good clarification of what people historically believed (which doesn’t put 0 probability on ‘inner homunculi’, but also didn’t consider it close to being the most likely way that scheming consequentialist agents could be created, which is what Alex is referring to[1]). E.g. Ajeya is clear at the beginning of this post that the training setup she’s considering isn’t the same as pretraining on a myopic prediction objective.
When he says ‘I think there’s a ton of wasted/ungrounded work around “avoiding schemers”, talking about that as though we have strong reasons to expect such entities.’