Nice! Thinking about “outer alignment maximalism” as one of these framings reveals that it’s based on treating outer alignment as something like “a training process that’s a genuinely best effort at getting the AI to learn to do good things and not bad things” (and so of course the pull-request AI fails because it’s teaching the AI about pull requests, not about right and wrong).
Introspecting, this choice of definition seems to correspond with feeling a lack of confidence that we can get the pull-request AI to behave well—I’m sure it’s a solvable technical problem, but in this mindset it seems like an extra-hard alignment problem because you basically have to teach the AI human values incidentally, rather than because it’s your primary intent.
Which gets me thinking about what other framings correspond to what other intuitions about what parts of the problem are hard / what the future will look like.
Re outer alignment: I conceptualize it as the case where we don’t need to worry about optimizers generating new optimizers in a recursive sequence, we don’t have to worry about mesaoptimizers, etc. Essentially it’s the base case alignment scenario.
And a lot of X-risk worries only really work if inner misalignment happens. It’s likely a harder problem to solve. If like Janus suspects, Self supervised learning like GPT are solely simulatiors even at a superhuman level, and do not become agentic, then inner alignment problems never come up, which means that AI risk fears should deflate a lot.
Nice! Thinking about “outer alignment maximalism” as one of these framings reveals that it’s based on treating outer alignment as something like “a training process that’s a genuinely best effort at getting the AI to learn to do good things and not bad things” (and so of course the pull-request AI fails because it’s teaching the AI about pull requests, not about right and wrong).
Introspecting, this choice of definition seems to correspond with feeling a lack of confidence that we can get the pull-request AI to behave well—I’m sure it’s a solvable technical problem, but in this mindset it seems like an extra-hard alignment problem because you basically have to teach the AI human values incidentally, rather than because it’s your primary intent.
Which gets me thinking about what other framings correspond to what other intuitions about what parts of the problem are hard / what the future will look like.
Re outer alignment: I conceptualize it as the case where we don’t need to worry about optimizers generating new optimizers in a recursive sequence, we don’t have to worry about mesaoptimizers, etc. Essentially it’s the base case alignment scenario.
And a lot of X-risk worries only really work if inner misalignment happens. It’s likely a harder problem to solve. If like Janus suspects, Self supervised learning like GPT are solely simulatiors even at a superhuman level, and do not become agentic, then inner alignment problems never come up, which means that AI risk fears should deflate a lot.