I disagree, though you’re right that my initial arguments weren’t enough.
To talk about the alignment progress we’ve achieved so far, here’s a list:
We finally managed to solve the problem of deceptive alignment while being capabilities competitive. In particular, we figured out a goal that is both more outer aligned than the Maximum Likelihood Estimation goal that LLMs use, and critically it is a myopic goal, meaning we can avoid deceptive alignment even at arbitrarily high capabilities.
The more data we give to the AI, the more aligned the AI is, which is huge in the sense that we can reliably get AI to be more aligned as it’s more capable, vindicating the scalable alignment agenda.
The training method doesn’t allow the AI to affect it’s own distribution, unlike online learning, where the AI selects all the data points to learn, and thus can’t shift the distribution nor gradient hack.
As far as how much progress? I’d say this is probably 50-70% of the way there, primarily because we finally are figuring out ways to deal with core problems of alignment like deceptive alignment or outer alignment of goals without too much alignment taxes.
So one of the prerequisites of deceptive alignment is that it optimizes for non-myopic goals. In particular, these are goals that are about the long-term.
So in order to avoid deceptive alignment, one must find a goal that is both myopic and ideally scales to arbitrary capabilities.
And in a sense, that’s what Pretraining from Human Feedback found, in that the goal of cross-entropy from a feedback-annotated webtext distribution is a myopic goal, and it’s either on the capabilities frontier or outright the optimal goal for AIs. In particular, they have way less alignment taxes than other schemes.
In essence, the goal avoids deceptive alignment by removing one of the prerequisites of deceptive alignment. At the very least, it doesn’t incentivized deceptive alignment.
I think you’ll find near universal agreement among alignment researchers that deceptive alignment hasn’t been solved. (I’d say “universal” if I weren’t worried about true Scottsmen)
I do think you’ll find agreement that there are approaches where deceptive alignment seems less likely (here I note that 99% is less likely than 99.999%). This is a case Evan makes in the conditioning predictive models approach.
However, the case there isn’t that the training goal is myopic, but rather that it’s simple, so it’s a little more plausible that a model doing the ‘right’ thing is found by a training process before a model that’s deceptively aligned.
I agree that this is better than nothing, but “We finally managed to solve the problem of deceptive alignment...” is just false.
I agree that this is better than nothing, but “We finally managed to solve the problem of deceptive alignment...” is just false.
I agree, which is why I retracted my comments about deceptive alignment being solved, though I do think it’s still far better to not have incentives to be non-myopic than to have such incentives in play.
It does help in some respects. On the other hand, a system without any non-myopic goals also will not help to prevent catastrophic side-effects. If a system were intent-aligned at the top level, we could trust that it’d have the motivation to ensure any of its internal processes were sufficiently aligned, and that its output wouldn’t cause catastrophe (e.g. it wouldn’t give us a correct answer/prediction containing information it knew would be extremely harmful).
If a system only does myopic prediction, then we have to manually ensure that nothing of this kind occurs—no misaligned subsystems, no misaligned agents created, no correct-but-catastrophic outputs....
I still think it makes sense to explore in this direction, but it seems to be in the category [temporary hack that might work long enough to help us do alignment work, if we’re careful] rather than [early version of scalable alignment solution]. (though a principled hack, as hacks go)
To relate this to your initial point about progress on the overall problem, this doesn’t seem to be much evidence that we’re making progress—just that we might be closer to building a tool that may help us make progress. That’s still great—only it doesn’t tell us much about the difficulty of the real problem.
I disagree, though you’re right that my initial arguments weren’t enough.
To talk about the alignment progress we’ve achieved so far, here’s a list:
We finally managed to solve the problem of deceptive alignment while being capabilities competitive. In particular, we figured out a goal that is both more outer aligned than the Maximum Likelihood Estimation goal that LLMs use, and critically it is a myopic goal, meaning we can avoid deceptive alignment even at arbitrarily high capabilities.
The more data we give to the AI, the more aligned the AI is, which is huge in the sense that we can reliably get AI to be more aligned as it’s more capable, vindicating the scalable alignment agenda.
The training method doesn’t allow the AI to affect it’s own distribution, unlike online learning, where the AI selects all the data points to learn, and thus can’t shift the distribution nor gradient hack.
As far as how much progress? I’d say this is probably 50-70% of the way there, primarily because we finally are figuring out ways to deal with core problems of alignment like deceptive alignment or outer alignment of goals without too much alignment taxes.
“We finally managed to solve the problem of deceptive alignment while being capabilities competitive”
??????
Good question to ask, and I’ll explain.
So one of the prerequisites of deceptive alignment is that it optimizes for non-myopic goals. In particular, these are goals that are about the long-term.
So in order to avoid deceptive alignment, one must find a goal that is both myopic and ideally scales to arbitrary capabilities.
And in a sense, that’s what Pretraining from Human Feedback found, in that the goal of cross-entropy from a feedback-annotated webtext distribution is a myopic goal, and it’s either on the capabilities frontier or outright the optimal goal for AIs. In particular, they have way less alignment taxes than other schemes.
In essence, the goal avoids deceptive alignment by removing one of the prerequisites of deceptive alignment. At the very least, it doesn’t incentivized deceptive alignment.
You seem to be conflating myopic training with myopic cognition.
Myopic training is not sufficient to ensure myopic cognition.
I think you’ll find near universal agreement among alignment researchers that deceptive alignment hasn’t been solved. (I’d say “universal” if I weren’t worried about true Scottsmen)
I do think you’ll find agreement that there are approaches where deceptive alignment seems less likely (here I note that 99% is less likely than 99.999%). This is a case Evan makes in the conditioning predictive models approach.
However, the case there isn’t that the training goal is myopic, but rather that it’s simple, so it’s a little more plausible that a model doing the ‘right’ thing is found by a training process before a model that’s deceptively aligned.
I agree that this is better than nothing, but “We finally managed to solve the problem of deceptive alignment...” is just false.
I agree, which is why I retracted my comments about deceptive alignment being solved, though I do think it’s still far better to not have incentives to be non-myopic than to have such incentives in play.
It does help in some respects.
On the other hand, a system without any non-myopic goals also will not help to prevent catastrophic side-effects. If a system were intent-aligned at the top level, we could trust that it’d have the motivation to ensure any of its internal processes were sufficiently aligned, and that its output wouldn’t cause catastrophe (e.g. it wouldn’t give us a correct answer/prediction containing information it knew would be extremely harmful).
If a system only does myopic prediction, then we have to manually ensure that nothing of this kind occurs—no misaligned subsystems, no misaligned agents created, no correct-but-catastrophic outputs....
I still think it makes sense to explore in this direction, but it seems to be in the category [temporary hack that might work long enough to help us do alignment work, if we’re careful] rather than [early version of scalable alignment solution]. (though a principled hack, as hacks go)
To relate this to your initial point about progress on the overall problem, this doesn’t seem to be much evidence that we’re making progress—just that we might be closer to building a tool that may help us make progress.
That’s still great—only it doesn’t tell us much about the difficulty of the real problem.