So one of the prerequisites of deceptive alignment is that it optimizes for non-myopic goals. In particular, these are goals that are about the long-term.
So in order to avoid deceptive alignment, one must find a goal that is both myopic and ideally scales to arbitrary capabilities.
And in a sense, that’s what Pretraining from Human Feedback found, in that the goal of cross-entropy from a feedback-annotated webtext distribution is a myopic goal, and it’s either on the capabilities frontier or outright the optimal goal for AIs. In particular, they have way less alignment taxes than other schemes.
In essence, the goal avoids deceptive alignment by removing one of the prerequisites of deceptive alignment. At the very least, it doesn’t incentivized deceptive alignment.
I think you’ll find near universal agreement among alignment researchers that deceptive alignment hasn’t been solved. (I’d say “universal” if I weren’t worried about true Scottsmen)
I do think you’ll find agreement that there are approaches where deceptive alignment seems less likely (here I note that 99% is less likely than 99.999%). This is a case Evan makes in the conditioning predictive models approach.
However, the case there isn’t that the training goal is myopic, but rather that it’s simple, so it’s a little more plausible that a model doing the ‘right’ thing is found by a training process before a model that’s deceptively aligned.
I agree that this is better than nothing, but “We finally managed to solve the problem of deceptive alignment...” is just false.
I agree that this is better than nothing, but “We finally managed to solve the problem of deceptive alignment...” is just false.
I agree, which is why I retracted my comments about deceptive alignment being solved, though I do think it’s still far better to not have incentives to be non-myopic than to have such incentives in play.
It does help in some respects. On the other hand, a system without any non-myopic goals also will not help to prevent catastrophic side-effects. If a system were intent-aligned at the top level, we could trust that it’d have the motivation to ensure any of its internal processes were sufficiently aligned, and that its output wouldn’t cause catastrophe (e.g. it wouldn’t give us a correct answer/prediction containing information it knew would be extremely harmful).
If a system only does myopic prediction, then we have to manually ensure that nothing of this kind occurs—no misaligned subsystems, no misaligned agents created, no correct-but-catastrophic outputs....
I still think it makes sense to explore in this direction, but it seems to be in the category [temporary hack that might work long enough to help us do alignment work, if we’re careful] rather than [early version of scalable alignment solution]. (though a principled hack, as hacks go)
To relate this to your initial point about progress on the overall problem, this doesn’t seem to be much evidence that we’re making progress—just that we might be closer to building a tool that may help us make progress. That’s still great—only it doesn’t tell us much about the difficulty of the real problem.
Good question to ask, and I’ll explain.
So one of the prerequisites of deceptive alignment is that it optimizes for non-myopic goals. In particular, these are goals that are about the long-term.
So in order to avoid deceptive alignment, one must find a goal that is both myopic and ideally scales to arbitrary capabilities.
And in a sense, that’s what Pretraining from Human Feedback found, in that the goal of cross-entropy from a feedback-annotated webtext distribution is a myopic goal, and it’s either on the capabilities frontier or outright the optimal goal for AIs. In particular, they have way less alignment taxes than other schemes.
In essence, the goal avoids deceptive alignment by removing one of the prerequisites of deceptive alignment. At the very least, it doesn’t incentivized deceptive alignment.
You seem to be conflating myopic training with myopic cognition.
Myopic training is not sufficient to ensure myopic cognition.
I think you’ll find near universal agreement among alignment researchers that deceptive alignment hasn’t been solved. (I’d say “universal” if I weren’t worried about true Scottsmen)
I do think you’ll find agreement that there are approaches where deceptive alignment seems less likely (here I note that 99% is less likely than 99.999%). This is a case Evan makes in the conditioning predictive models approach.
However, the case there isn’t that the training goal is myopic, but rather that it’s simple, so it’s a little more plausible that a model doing the ‘right’ thing is found by a training process before a model that’s deceptively aligned.
I agree that this is better than nothing, but “We finally managed to solve the problem of deceptive alignment...” is just false.
I agree, which is why I retracted my comments about deceptive alignment being solved, though I do think it’s still far better to not have incentives to be non-myopic than to have such incentives in play.
It does help in some respects.
On the other hand, a system without any non-myopic goals also will not help to prevent catastrophic side-effects. If a system were intent-aligned at the top level, we could trust that it’d have the motivation to ensure any of its internal processes were sufficiently aligned, and that its output wouldn’t cause catastrophe (e.g. it wouldn’t give us a correct answer/prediction containing information it knew would be extremely harmful).
If a system only does myopic prediction, then we have to manually ensure that nothing of this kind occurs—no misaligned subsystems, no misaligned agents created, no correct-but-catastrophic outputs....
I still think it makes sense to explore in this direction, but it seems to be in the category [temporary hack that might work long enough to help us do alignment work, if we’re careful] rather than [early version of scalable alignment solution]. (though a principled hack, as hacks go)
To relate this to your initial point about progress on the overall problem, this doesn’t seem to be much evidence that we’re making progress—just that we might be closer to building a tool that may help us make progress.
That’s still great—only it doesn’t tell us much about the difficulty of the real problem.