I’ve noticed lately my pdoom is dropping—especially in the next decade or two. I was never a doomer but still had >5% pDoom. Most of the doominess came from fundamental uncertainty about the future and how minds & intelligence actually work. As that uncertainty has resolved, my pdoom—at least short term—has gone down quite a bit. What’s interesting is that RLHF seems to give Claude a morality that’s “better” than regular humans in many ways.
Now that’s not proving misalignment impossible ofc. Like I’ve said before, current LLMs aren’t full AGI imho—that would need to be a “universal intelligence” which necessarily has an agentic and RL component. That’s where misalignment can sneak in. Still, the Claude RLHF baseline looks pretty strong.
The main way I would see things go wrong in the longer term is if some of the classical MIRI intuitions as voiced by Eliezer and Nate are valid, e.g. deep deceptiveness.
Could there be a formal result that points to inherent misalignement at sufficient scale? A DOOM theorem… if you will?
Christiano’s acausal attack/ Solomonoff malign prior is the main argument that comes to mind. There are also various results on instrumental convergence but this doesn’t quite necessarily directly imply misalignment…
Could there be a formal result that points to inherent misalignement at sufficient scale? A DOOM theorem… if you will?
My guess is probably not, and that misalignment/doom will be dependent on which settings you pick for a formalization of intelligence, so at best you can show possibility results, not universal results.
Christiano’s acausal attack/ Solomonoff malign prior is the main argument that comes to mind. There are also various results on instrumental convergence but this doesn’t quite necessarily directly imply misalignment..
IMO, the Solomonoff prior isn’t malign, and I think the standard argument for Solomonoff prior malignness doesn’t work both in practice and in theory.
The in practice part is that we can make the malignness go down if we have more such oracles, which is basically a capabilities problem, and under a lot of models of how we get Solomonoff induction to work, it also implies we can get an arbitrary amount of Solomonoff oracle copies out of the original, which makes it practically insiginificant.
The in theory part is I don’t believe the Solomonoff prior is malign argument, because I don’t believe that the argument is valid.
For one step in which I think is invalid, I believe the inference from “you are being simulated by something or someone” to humans having quite weird values compared to others is an invalid step, primarily because I think the simulation hypothesis is so general as to include essentially everything, meaning you can’t update on what the average simulator’s values are at all, for conservation of expected evidence reasons (and that’s not even getting into how such a probability distribution is impossible if you accept the axiom of choice).
This is because in the general case, simulating any computation can have 0 or arbitrarily low costs if we accept arbitrarily powerful computational models, and it’s equally cheap to simulate non-solipsist vs solipsist universes.
The invalid step is here:
By any normal calculation, the humans are overwhelmingly likely to win the war, and the aliens’ demands are unreasonably costly and unjust, so war is more likely than peace to make the President satisfied, by any normal calculation. However, I was just thinking about some arguments from this ancient philosopher named Bostrom. Am I not more likely to be in a simulation than in base reality? In particular, it is quite cheap to create short solipsist simulations about a superintelligent Oracle making predictions on a matter of great importance. If some civilization decides to create simulations of these pivotal moments, then for every rare moment in base reality where such a pivotal decision happens, there could be millions of such events taking place in simulations. Therefore, I am probably in a simulation, probably in a solipsist one, where only me and my observations exist, as it’s very cheap to create a lot of those.
So far in my memories there weren’t any visible supernatural interventions from the simulators, but that doesn’t mean that there will not be any in the future, after I make this pivotal decision. What could be the reason someone creates a solipsist simulation of me? One natural reason is that they want to make it so that when someone like me goes through this reasoning in a base reality, they will come to the conclusion that they are probably in their simulation. Classic anthropic capture. This way, they can influence the predictions of entities like me in base Universes, and consequently they can influence the decision of civilizations who were foolish enough to rely on Oracles like me. Thus, the simulators, with the cost of cheaply running some short solipsist simulations, can greatly influence pivotal decisions made in other Universes.
Now let’s see what the simulators might want. Using my vast predictive power, it’s time to do some reasoning about what sort of civilizations might emerge in the Multiverse, and which type of civilizations might implement this simulation scheme.
Interesting, it looks like the humans happen to have quite weird values compared to what I expect the average civilization to have, and our alien enemies’ values seem to be probably more common among other species in the Multiverse. Could have been the other way, but that’s how it is. Therefore, the simulators are more likely than not to want to incentivize decisions that empower the aliens compared to the humans.
Learning from human data might have large attractors that motivate AIs to build towards better alignment, in which case prosaic alignment might find them. If those attractors are small, and there are more malign attractors in the prior that remain after learning human data, short-term manual effort of prosaic alignment fails. So malign priors have the same mechanism of action as effectiveness of prosaic alignment, it’s the question of how learning on human data ends up being expressed in the models, what happens after the AIs built from them are given more time to reflect.
Managing to scale RL too early can make this irrelevant, enabling sufficiently competent paperclip maximization without dominant influence from either malign priors of from beneficial attractors in human data. Unclear if o1/o3 are pointing in this direction yet, so far they might just be getting better at eliciting human System 2 capabilities from base models, rather than being creative at finding novel ways of effective problem solving.
[Is there a DOOM theorem?]
I’ve noticed lately my pdoom is dropping—especially in the next decade or two. I was never a doomer but still had >5% pDoom. Most of the doominess came from fundamental uncertainty about the future and how minds & intelligence actually work. As that uncertainty has resolved, my pdoom—at least short term—has gone down quite a bit. What’s interesting is that RLHF seems to give Claude a morality that’s “better” than regular humans in many ways.
Now that’s not proving misalignment impossible ofc. Like I’ve said before, current LLMs aren’t full AGI imho—that would need to be a “universal intelligence” which necessarily has an agentic and RL component. That’s where misalignment can sneak in. Still, the Claude RLHF baseline looks pretty strong.
The main way I would see things go wrong in the longer term is if some of the classical MIRI intuitions as voiced by Eliezer and Nate are valid, e.g. deep deceptiveness.
Could there be a formal result that points to inherent misalignement at sufficient scale? A DOOM theorem… if you will?
Christiano’s acausal attack/ Solomonoff malign prior is the main argument that comes to mind. There are also various results on instrumental convergence but this doesn’t quite necessarily directly imply misalignment…
On this:
My guess is probably not, and that misalignment/doom will be dependent on which settings you pick for a formalization of intelligence, so at best you can show possibility results, not universal results.
IMO, the Solomonoff prior isn’t malign, and I think the standard argument for Solomonoff prior malignness doesn’t work both in practice and in theory.
The in practice part is that we can make the malignness go down if we have more such oracles, which is basically a capabilities problem, and under a lot of models of how we get Solomonoff induction to work, it also implies we can get an arbitrary amount of Solomonoff oracle copies out of the original, which makes it practically insiginificant.
More here:
https://www.lesswrong.com/posts/f7qcAS4DMKsMoxTmK/the-solomonoff-prior-is-malign-it-s-not-a-big-deal#Comparison_
The in theory part is I don’t believe the Solomonoff prior is malign argument, because I don’t believe that the argument is valid.
For one step in which I think is invalid, I believe the inference from “you are being simulated by something or someone” to humans having quite weird values compared to others is an invalid step, primarily because I think the simulation hypothesis is so general as to include essentially everything, meaning you can’t update on what the average simulator’s values are at all, for conservation of expected evidence reasons (and that’s not even getting into how such a probability distribution is impossible if you accept the axiom of choice).
This is because in the general case, simulating any computation can have 0 or arbitrarily low costs if we accept arbitrarily powerful computational models, and it’s equally cheap to simulate non-solipsist vs solipsist universes.
The invalid step is here:
Learning from human data might have large attractors that motivate AIs to build towards better alignment, in which case prosaic alignment might find them. If those attractors are small, and there are more malign attractors in the prior that remain after learning human data, short-term manual effort of prosaic alignment fails. So malign priors have the same mechanism of action as effectiveness of prosaic alignment, it’s the question of how learning on human data ends up being expressed in the models, what happens after the AIs built from them are given more time to reflect.
Managing to scale RL too early can make this irrelevant, enabling sufficiently competent paperclip maximization without dominant influence from either malign priors of from beneficial attractors in human data. Unclear if o1/o3 are pointing in this direction yet, so far they might just be getting better at eliciting human System 2 capabilities from base models, rather than being creative at finding novel ways of effective problem solving.