Great and extremely valuable discussion! There’s one part that I really wished had been explored further—the fundamental difficulty of inner alignment:
Joe Carlsmith: I do have some probability that the alignment ends up being pretty easy. For example, I have some probability on hypotheses of the form “maybe they just do what you train them to do,” and “maybe if you just don’t train them to kill you, they won’t kill you.” E.g., in these worlds, non-myopic consequentialist inner misalignment doesn’t tend to crop up by default, and it’s not that hard to find training objectives that disincentivize problematically power-seeking forms of planning/cognition in practice, even if they’re imperfect proxies for human values in other ways.
...
Nate: …maybe it wouldn’t have been that hard for natural selection to train humans to be fitness maximizers, if it had been watching for goal-divergence and constructing clever training environments?
Joe Carlsmith: I think something like this is in the mix for me. That is, I don’t see the evolution example as especially strong evidence for how hard inner alignment is conditional on actually and intelligently trying to avoid inner misalignment (especially in its scariest forms).
I would very much like to see expansion (from either Nate/MIRI or Joe) on these points because they seem crucial to me. My current epistemic situation is (I think) similar to Joe’s. Different views about the fundamental difficulty of inner alignment seem to be a (the?) major driver of differences in views about how likely AI X risk is overall. I see lots of worrisome signs from indirect lines of evidence—some based on intuitions about the nature of intelligence, some from toy models and some from vague analogies to e.g. evolution. But what I don’t see is a slam dunk argument that inner misalignment is an extremely strong attractor for powerful models of the sort we’re actually going to build.
That also goes for many of the specific reasons given for inner misalignment—they often just seem to push the intuition one step further back. E.g. these from Eliezer Yudkowsky’s recent interview:
I predict that deep algorithms within the AGI will go through consequentialist dances, and model humans, and output human-manipulating actions that can’t be detected as manipulative by the humans, in a way that seems likely to bypass whatever earlier patch was imbued by gradient descent, because I doubt that earlier patch will generalize as well as the deep algorithms.
...
attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (qualitatively new thought processes, things being way out of training distribution, and, the hardest part to explain, corrigibility being “anti-natural” in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).
seem like world models that make sense to me, given the surrounding justifications, and I wouldn’t be amazed if they were true, and I also place a decent amount of credence on them being true. But I can’t pass an ideological Turing test for someone who believes the above propositions with > 95% certainty, given the massive conceptual confusion involved with all of these concepts and the massive empirical uncertainty.
Statements like ‘corrigibility is anti-natural in a way that can’t easily be explained’ and ‘getting deep enough patches that generalize isn’t just difficult but almost impossibly difficult’ when applied to systems we don’t yet know how to build at all, don’t seem like statements about which confident beliefs either way can be formed. (Unless there’s really solid evidence out there that I’m not seeing)
This conversation seemed like another such opportunity to provide that slam-dunk justification for the extreme difficulty of inner alignment, but as in many previous cases Nate and Joe seemed happy to agree to disagree and accept that this is a hard question about which it’s difficult to reach any clear conclusion—which if true should preclude strong confidence in disaster scenarios.
(FWIW, I think there’s a good chance that until we start building systems that are already quite transformative, we’re probably going to be stuck with a lot of uncertainty about the fundamental difficulty of inner alignment—which from a future planning perspective is worse than knowing for sure how hard the problem is.)
Different views about the fundamental difficulty of inner alignment seem to be a (the?) major driver of differences in views about how likely AI X risk is overall.
I strongly disagree with inner alignment being the correct crux. It does seem to be true that this is in fact a crux for many people, but I think this is a mistake. It is certainly significant.
But I think optimism about outer alignment and global coordination (“Catch-22 vs. Saving Private Ryan”) is much bigger factor, and optimists are badly wrong on both points here.
Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.
which if true should preclude strong confidence in disaster scenarios
Though only for disaster scenarios that rely on inner misalignment, right?
… seem like world models that make sense to me, given the surrounding justifications
FWIW, I don’t really understand those world models/intuitions yet:
Re: “earlier patches not generalising as well as the deep algorithms”—I don’t understand/am sceptical about the abstraction of “earlier patches” vs. “deep algorithms learned as intelligence is scaled up”. What seem to be dubbed “patches that won’t generalise well” seem to me to be more like “plausibly successful shaping of the model’s goals”. I don’t see why, at some point when the model gets sufficiently smart, gradient descent will get it to throw out the goals it used to have. What am I missing?
Re: corrigibility being “anti-natural” in a certain sense—I think I just don’t understand this at all. Has it been discussed clearly anywhere else?
(jtbc, I think inner misalignment might be a big problem, I just haven’t seen any good argument for it plausibly being the main problem)
Re: corrigibility being “anti-natural” in a certain sense—I think I have a better understanding of this now:
Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
Training an aligned/corrigible/obedient consequentialist is something that Eliezer can’t currently see a way of doing, because it seems like a very unnatural sort of system. This makes him pessimistic about our current trajectory. The argument here seems kinda like a more subtle version of the instrumental convergence thesis. We want to train a system that:
(1) searches for (and tries to bring about) paths through time that are robust enough to hit a narrow target (enabling a pivotal act and a great future in general)
but also (2) is happy for certain human-initiated attempts to change that target (modify its goals, shut it down, etc.)
This seems unnatural and Eliezer can’t see how to do it currently.
An exacerbating factor is that even if top labs pursue alignment/corrigiblity/obedience, they will either be mistaken in having achieved it (because it’s hard), or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world.
or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world
Note that this is still better than ‘honestly panic about not having achieved it and throw caution to the wind / rationalize reasons they don’t need to halt’!
Great and extremely valuable discussion! There’s one part that I really wished had been explored further—the fundamental difficulty of inner alignment:
I would very much like to see expansion (from either Nate/MIRI or Joe) on these points because they seem crucial to me. My current epistemic situation is (I think) similar to Joe’s. Different views about the fundamental difficulty of inner alignment seem to be a (the?) major driver of differences in views about how likely AI X risk is overall. I see lots of worrisome signs from indirect lines of evidence—some based on intuitions about the nature of intelligence, some from toy models and some from vague analogies to e.g. evolution. But what I don’t see is a slam dunk argument that inner misalignment is an extremely strong attractor for powerful models of the sort we’re actually going to build.
That also goes for many of the specific reasons given for inner misalignment—they often just seem to push the intuition one step further back. E.g. these from Eliezer Yudkowsky’s recent interview:
seem like world models that make sense to me, given the surrounding justifications, and I wouldn’t be amazed if they were true, and I also place a decent amount of credence on them being true. But I can’t pass an ideological Turing test for someone who believes the above propositions with > 95% certainty, given the massive conceptual confusion involved with all of these concepts and the massive empirical uncertainty.
Statements like ‘corrigibility is anti-natural in a way that can’t easily be explained’ and ‘getting deep enough patches that generalize isn’t just difficult but almost impossibly difficult’ when applied to systems we don’t yet know how to build at all, don’t seem like statements about which confident beliefs either way can be formed. (Unless there’s really solid evidence out there that I’m not seeing)
This conversation seemed like another such opportunity to provide that slam-dunk justification for the extreme difficulty of inner alignment, but as in many previous cases Nate and Joe seemed happy to agree to disagree and accept that this is a hard question about which it’s difficult to reach any clear conclusion—which if true should preclude strong confidence in disaster scenarios.
(FWIW, I think there’s a good chance that until we start building systems that are already quite transformative, we’re probably going to be stuck with a lot of uncertainty about the fundamental difficulty of inner alignment—which from a future planning perspective is worse than knowing for sure how hard the problem is.)
I strongly disagree with inner alignment being the correct crux. It does seem to be true that this is in fact a crux for many people, but I think this is a mistake. It is certainly significant.
But I think optimism about outer alignment and global coordination (“Catch-22 vs. Saving Private Ryan”) is much bigger factor, and optimists are badly wrong on both points here.
Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.
Though only for disaster scenarios that rely on inner misalignment, right?
FWIW, I don’t really understand those world models/intuitions yet:
Re: “earlier patches not generalising as well as the deep algorithms”—I don’t understand/am sceptical about the abstraction of “earlier patches” vs. “deep algorithms learned as intelligence is scaled up”. What seem to be dubbed “patches that won’t generalise well” seem to me to be more like “plausibly successful shaping of the model’s goals”. I don’t see why, at some point when the model gets sufficiently smart, gradient descent will get it to throw out the goals it used to have. What am I missing?
Re: corrigibility being “anti-natural” in a certain sense—I think I just don’t understand this at all. Has it been discussed clearly anywhere else?
(jtbc, I think inner misalignment might be a big problem, I just haven’t seen any good argument for it plausibly being the main problem)
Re: corrigibility being “anti-natural” in a certain sense—I think I have a better understanding of this now:
Eventually, we need to train an AI system capable enough to enable a pivotal act (in particular, actions that prevent the world from being destroyed by any other future AGI)
AI systems that are capable enough to enable a pivotal act must be (what Eliezer calls) a “consequentialist”: a system that “searches paths through time and selects high-scoring ones for output”
Training an aligned/corrigible/obedient consequentialist is something that Eliezer can’t currently see a way of doing, because it seems like a very unnatural sort of system. This makes him pessimistic about our current trajectory. The argument here seems kinda like a more subtle version of the instrumental convergence thesis. We want to train a system that:
(1) searches for (and tries to bring about) paths through time that are robust enough to hit a narrow target (enabling a pivotal act and a great future in general)
but also (2) is happy for certain human-initiated attempts to change that target (modify its goals, shut it down, etc.)
This seems unnatural and Eliezer can’t see how to do it currently.
An exacerbating factor is that even if top labs pursue alignment/corrigiblity/obedience, they will either be mistaken in having achieved it (because it’s hard), or honestly panic about not having achieved it and halt, by which point a runner-up who doesn’t understand the importance of alignment/corrigibility/obedience deploys their system which destroys the world.
(This is partly based on this summary)
Note that this is still better than ‘honestly panic about not having achieved it and throw caution to the wind / rationalize reasons they don’t need to halt’!