Outer/inner alignment decomposes a hard problem into two extremely hard problems.
I have a long post draft about this, but I keep delaying putting it out in order to better elaborate the prereqs which I seem to keep getting stuck on when elaborating the ideas. I figure I might as well put this out for now, maybe it will make some difference for someone.
I think that the inner/outer alignment framing[1] seems appealing but is actually a doomed problem decomposition and an unhelpful frame for alignment.
The reward function is a tool which chisels cognition into agents through gradient updates, but the outer/inner decomposition assumes that that tool should alsoembody the goals we want to chisel into the agent. When chiseling a statue, the chisel doesn’t have to also look like the finished statue.
I know of zero success stories for outer alignment to real-world goals.
More precisely, stories where people decided “I want an AI which [helps humans / makes diamonds / plays Tic-Tac-Toe / grows strawberries]”, and then wrote down an outer objective only maximized in those worlds.
This is pretty weird on any model where most of the specification difficulty of outer alignment comes from the complexity of human values. Instead, I think this more shows that outer alignment is a wrong language for specifying agent motivations.
If you look at the single time ever that human-compatible values have arisen in generally intelligent minds (i.e. in humans), you’ll infer that it wasn’t done through outer/inner alignment. According to shard theory, human values are inner alignment failures on thereward circuitry in the human brain (read carefully: this is notthe usual evolution analogy!). If you aim to “solve” outer and inner alignment, you are ruling out the only empirically known class of methods for growing human-compatible values.
1. Outer alignment: get a reward function which “robustly represents” the intended goal in all situations which the trained AI can understand.
2. Inner alignment: make the trained AI intent-aligned with optimizing that objective (i.e. “care about” that objective).
This isn’t the only grounding of outer/inner, and while I don’t strongly object to all of them, I do weakly object to all of them (as I understand them) and strongly object to most of them.
Outer/inner alignment decomposes a hard problem into two extremely hard problems.
I have a long post draft about this, but I keep delaying putting it out in order to better elaborate the prereqs which I seem to keep getting stuck on when elaborating the ideas. I figure I might as well put this out for now, maybe it will make some difference for someone.
I think that the inner/outer alignment framing[1] seems appealing but is actually a doomed problem decomposition and an unhelpful frame for alignment.
The reward function is a tool which chisels cognition into agents through gradient updates, but the outer/inner decomposition assumes that that tool should also embody the goals we want to chisel into the agent. When chiseling a statue, the chisel doesn’t have to also look like the finished statue.
I know of zero success stories for outer alignment to real-world goals.
More precisely, stories where people decided “I want an AI which [helps humans / makes diamonds / plays Tic-Tac-Toe / grows strawberries]”, and then wrote down an outer objective only maximized in those worlds.
This is pretty weird on any model where most of the specification difficulty of outer alignment comes from the complexity of human values. Instead, I think this more shows that outer alignment is a wrong language for specifying agent motivations.
If you look at the single time ever that human-compatible values have arisen in generally intelligent minds (i.e. in humans), you’ll infer that it wasn’t done through outer/inner alignment. According to shard theory, human values are inner alignment failures on the reward circuitry in the human brain (read carefully: this is not the usual evolution analogy!). If you aim to “solve” outer and inner alignment, you are ruling out the only empirically known class of methods for growing human-compatible values.
An example grounding which I argue against:
1. Outer alignment: get a reward function which “robustly represents” the intended goal in all situations which the trained AI can understand.
2. Inner alignment: make the trained AI intent-aligned with optimizing that objective (i.e. “care about” that objective).
This isn’t the only grounding of outer/inner, and while I don’t strongly object to all of them, I do weakly object to all of them (as I understand them) and strongly object to most of them.