I think that Eliezer (and many others including myself!) may be suspectable to “living in the should-universe”
That’s a new one!
More seriously: Yep, it’s possible to be making this error on a particular dimension, even if you’re a pessimist on some other dimensions. My current guess would be that Eliezer isn’t making that mistake here, though.
For one thing, the situation is more like “Eliezer thinks he tried the option you’re proposing for a long time and it didn’t work, so now he’s trying something different” (and he’s observed many others trying other things and also failing), rather than “it’s never occurred to Eliezer that LWers are different from non-LWers”.
I think it’s totally possible that Eliezer and I are missing important facts about an important demographic, but from your description I think you’re misunderstanding the TIME article as more naive and less based-on-an-underlying-complicated-model than is actually the case.
you’re misunderstanding the TIME article as more naive and less based-on-an-underlying-complicated-model than is actually the case.
I specifically said “I do not necessarily say that this particular TIME article was a bad idea” mainly because I assumed it probably wasn’t that naive. Sorry I didn’t make it clear enough.
I still decided to comment because I think this is pretty important in general, even if somewhat obvious. Looks like one of those biases which show up over and over again even if you try pretty hard to correct it.
Also, I think it’s pretty hard to judge what works and what doesn’t. The vibe has shifted a lot even in the last 6 months. I think it is plausible it shifted more than in a 10-year period 2010-2019.
For one thing, the situation is more like “Eliezer thinks he tried the option you’re proposing for a long time and it didn’t work, so now he’s trying something different”
I think this is the big disagreement I have. I do think the alignment community is working, and in general I think the trend of alignment is positive. We haven’t solved the problems, but were quite a bit closer to the solution than 10 years ago.
The only question was whether LW and the intentional creation of an alignment community was necessary, or was the alignment problem going to be solved without intentionally creating LW and a field of alignment research.
in general I think the trend of alignment is positive. We haven’t solved the problems, but were quite a bit closer to the solution than 10 years ago.
I mean, I could agree with those two claims but think the trendlines suggest we’ll have alignment solved in 200 years and superintelligent capabilities in 14 years. I guess it depends on what you mean by “quite a bit closer”; I think we’ve written up some useful semiformal descriptions of some important high-level aspects of the problem (like ‘Risks from Learned Optimization’), but this seems very far from ‘the central difficulties look 10% more solved now’, and solving 10% of the problem in 10 years is not enough!
(Of course, progress can be nonlinear—the last ten years were quite slow IMO, but that doesn’t mean the next ten years must be similarly slow. But that’s a different argument for optimism than ‘naively extrapolating the trendline suggests we’ll solve this in time’.)
I disagree, though you’re right that my initial arguments weren’t enough.
To talk about the alignment progress we’ve achieved so far, here’s a list:
We finally managed to solve the problem of deceptive alignment while being capabilities competitive. In particular, we figured out a goal that is both more outer aligned than the Maximum Likelihood Estimation goal that LLMs use, and critically it is a myopic goal, meaning we can avoid deceptive alignment even at arbitrarily high capabilities.
The more data we give to the AI, the more aligned the AI is, which is huge in the sense that we can reliably get AI to be more aligned as it’s more capable, vindicating the scalable alignment agenda.
The training method doesn’t allow the AI to affect it’s own distribution, unlike online learning, where the AI selects all the data points to learn, and thus can’t shift the distribution nor gradient hack.
As far as how much progress? I’d say this is probably 50-70% of the way there, primarily because we finally are figuring out ways to deal with core problems of alignment like deceptive alignment or outer alignment of goals without too much alignment taxes.
So one of the prerequisites of deceptive alignment is that it optimizes for non-myopic goals. In particular, these are goals that are about the long-term.
So in order to avoid deceptive alignment, one must find a goal that is both myopic and ideally scales to arbitrary capabilities.
And in a sense, that’s what Pretraining from Human Feedback found, in that the goal of cross-entropy from a feedback-annotated webtext distribution is a myopic goal, and it’s either on the capabilities frontier or outright the optimal goal for AIs. In particular, they have way less alignment taxes than other schemes.
In essence, the goal avoids deceptive alignment by removing one of the prerequisites of deceptive alignment. At the very least, it doesn’t incentivized deceptive alignment.
I think you’ll find near universal agreement among alignment researchers that deceptive alignment hasn’t been solved. (I’d say “universal” if I weren’t worried about true Scottsmen)
I do think you’ll find agreement that there are approaches where deceptive alignment seems less likely (here I note that 99% is less likely than 99.999%). This is a case Evan makes in the conditioning predictive models approach.
However, the case there isn’t that the training goal is myopic, but rather that it’s simple, so it’s a little more plausible that a model doing the ‘right’ thing is found by a training process before a model that’s deceptively aligned.
I agree that this is better than nothing, but “We finally managed to solve the problem of deceptive alignment...” is just false.
I agree that this is better than nothing, but “We finally managed to solve the problem of deceptive alignment...” is just false.
I agree, which is why I retracted my comments about deceptive alignment being solved, though I do think it’s still far better to not have incentives to be non-myopic than to have such incentives in play.
It does help in some respects. On the other hand, a system without any non-myopic goals also will not help to prevent catastrophic side-effects. If a system were intent-aligned at the top level, we could trust that it’d have the motivation to ensure any of its internal processes were sufficiently aligned, and that its output wouldn’t cause catastrophe (e.g. it wouldn’t give us a correct answer/prediction containing information it knew would be extremely harmful).
If a system only does myopic prediction, then we have to manually ensure that nothing of this kind occurs—no misaligned subsystems, no misaligned agents created, no correct-but-catastrophic outputs....
I still think it makes sense to explore in this direction, but it seems to be in the category [temporary hack that might work long enough to help us do alignment work, if we’re careful] rather than [early version of scalable alignment solution]. (though a principled hack, as hacks go)
To relate this to your initial point about progress on the overall problem, this doesn’t seem to be much evidence that we’re making progress—just that we might be closer to building a tool that may help us make progress. That’s still great—only it doesn’t tell us much about the difficulty of the real problem.
That’s a new one!
More seriously: Yep, it’s possible to be making this error on a particular dimension, even if you’re a pessimist on some other dimensions. My current guess would be that Eliezer isn’t making that mistake here, though.
For one thing, the situation is more like “Eliezer thinks he tried the option you’re proposing for a long time and it didn’t work, so now he’s trying something different” (and he’s observed many others trying other things and also failing), rather than “it’s never occurred to Eliezer that LWers are different from non-LWers”.
I think it’s totally possible that Eliezer and I are missing important facts about an important demographic, but from your description I think you’re misunderstanding the TIME article as more naive and less based-on-an-underlying-complicated-model than is actually the case.
I specifically said “I do not necessarily say that this particular TIME article was a bad idea” mainly because I assumed it probably wasn’t that naive. Sorry I didn’t make it clear enough.
I still decided to comment because I think this is pretty important in general, even if somewhat obvious. Looks like one of those biases which show up over and over again even if you try pretty hard to correct it.
Also, I think it’s pretty hard to judge what works and what doesn’t. The vibe has shifted a lot even in the last 6 months. I think it is plausible it shifted more than in a 10-year period 2010-2019.
I think this is the big disagreement I have. I do think the alignment community is working, and in general I think the trend of alignment is positive. We haven’t solved the problems, but were quite a bit closer to the solution than 10 years ago.
The only question was whether LW and the intentional creation of an alignment community was necessary, or was the alignment problem going to be solved without intentionally creating LW and a field of alignment research.
I mean, I could agree with those two claims but think the trendlines suggest we’ll have alignment solved in 200 years and superintelligent capabilities in 14 years. I guess it depends on what you mean by “quite a bit closer”; I think we’ve written up some useful semiformal descriptions of some important high-level aspects of the problem (like ‘Risks from Learned Optimization’), but this seems very far from ‘the central difficulties look 10% more solved now’, and solving 10% of the problem in 10 years is not enough!
(Of course, progress can be nonlinear—the last ten years were quite slow IMO, but that doesn’t mean the next ten years must be similarly slow. But that’s a different argument for optimism than ‘naively extrapolating the trendline suggests we’ll solve this in time’.)
I disagree, though you’re right that my initial arguments weren’t enough.
To talk about the alignment progress we’ve achieved so far, here’s a list:
We finally managed to solve the problem of deceptive alignment while being capabilities competitive. In particular, we figured out a goal that is both more outer aligned than the Maximum Likelihood Estimation goal that LLMs use, and critically it is a myopic goal, meaning we can avoid deceptive alignment even at arbitrarily high capabilities.
The more data we give to the AI, the more aligned the AI is, which is huge in the sense that we can reliably get AI to be more aligned as it’s more capable, vindicating the scalable alignment agenda.
The training method doesn’t allow the AI to affect it’s own distribution, unlike online learning, where the AI selects all the data points to learn, and thus can’t shift the distribution nor gradient hack.
As far as how much progress? I’d say this is probably 50-70% of the way there, primarily because we finally are figuring out ways to deal with core problems of alignment like deceptive alignment or outer alignment of goals without too much alignment taxes.
“We finally managed to solve the problem of deceptive alignment while being capabilities competitive”
??????
Good question to ask, and I’ll explain.
So one of the prerequisites of deceptive alignment is that it optimizes for non-myopic goals. In particular, these are goals that are about the long-term.
So in order to avoid deceptive alignment, one must find a goal that is both myopic and ideally scales to arbitrary capabilities.
And in a sense, that’s what Pretraining from Human Feedback found, in that the goal of cross-entropy from a feedback-annotated webtext distribution is a myopic goal, and it’s either on the capabilities frontier or outright the optimal goal for AIs. In particular, they have way less alignment taxes than other schemes.
In essence, the goal avoids deceptive alignment by removing one of the prerequisites of deceptive alignment. At the very least, it doesn’t incentivized deceptive alignment.
You seem to be conflating myopic training with myopic cognition.
Myopic training is not sufficient to ensure myopic cognition.
I think you’ll find near universal agreement among alignment researchers that deceptive alignment hasn’t been solved. (I’d say “universal” if I weren’t worried about true Scottsmen)
I do think you’ll find agreement that there are approaches where deceptive alignment seems less likely (here I note that 99% is less likely than 99.999%). This is a case Evan makes in the conditioning predictive models approach.
However, the case there isn’t that the training goal is myopic, but rather that it’s simple, so it’s a little more plausible that a model doing the ‘right’ thing is found by a training process before a model that’s deceptively aligned.
I agree that this is better than nothing, but “We finally managed to solve the problem of deceptive alignment...” is just false.
I agree, which is why I retracted my comments about deceptive alignment being solved, though I do think it’s still far better to not have incentives to be non-myopic than to have such incentives in play.
It does help in some respects.
On the other hand, a system without any non-myopic goals also will not help to prevent catastrophic side-effects. If a system were intent-aligned at the top level, we could trust that it’d have the motivation to ensure any of its internal processes were sufficiently aligned, and that its output wouldn’t cause catastrophe (e.g. it wouldn’t give us a correct answer/prediction containing information it knew would be extremely harmful).
If a system only does myopic prediction, then we have to manually ensure that nothing of this kind occurs—no misaligned subsystems, no misaligned agents created, no correct-but-catastrophic outputs....
I still think it makes sense to explore in this direction, but it seems to be in the category [temporary hack that might work long enough to help us do alignment work, if we’re careful] rather than [early version of scalable alignment solution]. (though a principled hack, as hacks go)
To relate this to your initial point about progress on the overall problem, this doesn’t seem to be much evidence that we’re making progress—just that we might be closer to building a tool that may help us make progress.
That’s still great—only it doesn’t tell us much about the difficulty of the real problem.