Excellent work. This should be part of my cruxes of alignment difficulty. It’s a bit more obscure, but I think it completes the puzzle in explaining what gets EY to >99%. It seems this consideration is a large part of what pushes Eliezer’s p(doom) so high. If this is true, it prevents corrigibility. Giving up on corrigibility of some form would push most of our alignment difficulty estimates way up.
I also think it’s wrong, but it’s not obvious from the arguments here, and I’m not sure. Just tacking on the “don’t reverse your trades even if you don’t care” principal prevents the agent from being taken advantage of, but that’s not what we care about. We care whether a superintelligent, rational agent could coherently be corrigible.
We can make consequentialism one motivation among many, as Steve Byrnes points out, but then we had better be sure that those preferences are stable under learning, reflection, and self-modification.
My proposed solution is to make the primary goal a “pointer” to the stated preferences of a human principal, so that consequentialism enters in only when the principal asks the AGI to accomplish a specific goal, and then only until the principal changes their mind. This can also be intuitively thought of as instruction-following. I discuss that scheme here. It’s closely related to Christiano’s definition and goal of corrigibility, and it seems like the obvious thing to try when actually launching an AGI that will have a slow initial takeoff.
FYI, I thought your shortform here was an unusually excellent summary of cruxes, but I don’t think coherence is the main missing piece which gets Eliezer to 99%+. (Also, I think I understand Eliezer’s models better than the large majority of people on LW, but still definitely not perfectly.)
I think the main “next piece” missing is that Eliezer basically rejects the natural abstraction hypothesis; he expects that powerful AI will reason in internal ontologies thoroughly alien to humans. That makes not just full-blown alignment hard, but even “relatively easy” things like instruction-following hard in the relevant regime.
(Also there are a few other pieces which your shortform didn’t talk about much which are relevant to high-certainty-of-doom, but I expect those were pieces which you intentionally didn’t focus on much—like e.g. near-certainty that there will be many-OOM-equivalent software improvements very rapidly once AI crosses the critical threshold of being able to do AI research.)
I think the main “next piece” missing is that Eliezer basically rejects the natural abstraction hypothesis
Mu, I think. I think the MIRI view on the matter is that the internal mechanistic implementation of an AGI-trained-by-the-SGD would be some messy overcomplicated behemoth. Not a relatively simple utility-function plus world-model plus queries on it plus cached heuristics (or whatever), but a bunch of much weirder modules kludged together in a way such that their emergent dynamics result in powerful agentic behavior.[1]
The ontological problems with alignment would stem not from the fact that the AI is using alien concepts, but from its own internal dynamics being absurdly complicated and alien. It wouldn’t have a well-formatted mesa-objective, for example, or “emotions”, or a System 1 vs System 2 split, or explicit vs. tacit knowledge. It would have a dozen other things which fulfill the same functions that the aforementioned features of human minds fulfill in humans, but they’d be split up and recombined in entirely different ways, such that most individual modules would have no analogues in human cognition at all.
Untangling it would be a “second tier” of the interpretability problem, which the current interpretability research didn’t yet even get a glimpse of.
And, sure, maybe at some higher level of organization, all that complexity would be reducible to simple-ish agentic behavior. Maybe a powerful-enough pragmascope would be able to see past all that and yield us a description of the high-level implementation directly. But I don’t think the MIRI view is hopeful regarding getting such tools.
Whether the NAH is or is not true doesn’t really enter into it.
Could be I’m failing the ITT here, of course. But this post gives me this vibe, as does this old write-up. Choice quote[2]:
The reason why we can’t bind a description of ‘diamond’ or ‘carbon atoms’ to the hypothesis space used by AIXI or AIXI-tl is that the hypothesis space of AIXI is all Turing machines that produce binary strings, or probability distributions over the next sense bit given previous sense bits and motor input. These Turing machines could contain an unimaginably wide range of possible contents
(Example: Maybe one Turing machine that is producing good sequence predictions inside AIXI, actually does so by simulating a large universe, identifying a superintelligent civilization that evolves inside that universe, and motivating that civilization to try to intelligently predict future future bits from past bits (as provided by some intervention). To write a formal utility function that could extract the ‘amount of real diamond in the environment’ from arbitrary predictors in the above case , we’d need the function to read the Turing machine, decode that universe, find the superintelligence, decode the superintelligence’s thought processes, find the concept (if any) resembling ‘diamond’, and hope that the superintelligence had precalculated how much diamond was around in the outer universe being manipulated by AIXI.)
Obviously it’s talking about AIXI, not ML models, but I assume the MIRI view has a directionally similar argument regarding them.
Or, in other words: what the MIRI view rejects isn’t the NAH, but some variant of the simplicity-prior argument. It doesn’t believe that the SGD would yield nicely formatted agents; that the ML training loops produce pressures shaping minds this way.[3]
This powerful agentic behavior would then of course be able to streamline its own implementation, once it’s powerful enough, but that’s what the starting point would be – and also what we’d need to align, since once it has the extensive self-modification capabilities to streamline itself, it’d be too late to tinker with it.
Although now that I’m looking at it, this post is actually a mirror of the Arbital page, which has three authors, so I’m not entirely sure this segment was written by Eliezer...
Note that this also means that formally solving the Agent-Like Structure Problem wouldn’t help us either. It doesn’t matter how theoretically perfect embedded agents are shaped, because the agent we’d be dealing with wouldn’t be shaped like this. Knowing how it’s supposed to be shaped would help only marginally, at best giving us a rough idea regarding how to start untangling the internal dynamics.
Thanks, that makes sense, thinking about YKs writings. I’ll add that briefly to the piece. What’s the best reference, if you have a moment?
I think LLMs are already mostly finding natural abstractions. They’ll have some weird cross-talk, like the golden gate bridge being mixed with fog, but humans have that too, to maybe a lesser degree, and we can still communicate pretty well about abstractions, at least if we’re careful.
I’m glad you liked the crux list. I think it’s really important to keep asking ourselves why others have different takes. The topic is too important to do the sta dard thing and just say “well they don’t get it”.
Eliezer’s List O’Doom probably has a short statement in there somewhere, if you want a quote on his position. Much of his back-and-forth with Quintin is also about rejecting natural abstraction, but I don’t know of a short pithy summary in that corpus. (More generally, it’s pretty clear from my standpoint that there are basically two cruxes between Eliezer and Quintin, because my own models look mostly like Eliezer’s if I flip the natural abstraction bit and mostly like Quintin’s if I flip a particular bit having to do with ease of outer alignment.)
If you want a reference on the natural abstraction hypothesis more generally, I introduced the term in Alignment By Default.
Excellent work. This should be part of my cruxes of alignment difficulty. It’s a bit more obscure, but I think it completes the puzzle in explaining what gets EY to >99%. It seems this consideration is a large part of what pushes Eliezer’s p(doom) so high. If this is true, it prevents corrigibility. Giving up on corrigibility of some form would push most of our alignment difficulty estimates way up.
I also think it’s wrong, but it’s not obvious from the arguments here, and I’m not sure. Just tacking on the “don’t reverse your trades even if you don’t care” principal prevents the agent from being taken advantage of, but that’s not what we care about. We care whether a superintelligent, rational agent could coherently be corrigible.
We can make consequentialism one motivation among many, as Steve Byrnes points out, but then we had better be sure that those preferences are stable under learning, reflection, and self-modification.
My proposed solution is to make the primary goal a “pointer” to the stated preferences of a human principal, so that consequentialism enters in only when the principal asks the AGI to accomplish a specific goal, and then only until the principal changes their mind. This can also be intuitively thought of as instruction-following. I discuss that scheme here. It’s closely related to Christiano’s definition and goal of corrigibility, and it seems like the obvious thing to try when actually launching an AGI that will have a slow initial takeoff.
FYI, I thought your shortform here was an unusually excellent summary of cruxes, but I don’t think coherence is the main missing piece which gets Eliezer to 99%+. (Also, I think I understand Eliezer’s models better than the large majority of people on LW, but still definitely not perfectly.)
I think the main “next piece” missing is that Eliezer basically rejects the natural abstraction hypothesis; he expects that powerful AI will reason in internal ontologies thoroughly alien to humans. That makes not just full-blown alignment hard, but even “relatively easy” things like instruction-following hard in the relevant regime.
(Also there are a few other pieces which your shortform didn’t talk about much which are relevant to high-certainty-of-doom, but I expect those were pieces which you intentionally didn’t focus on much—like e.g. near-certainty that there will be many-OOM-equivalent software improvements very rapidly once AI crosses the critical threshold of being able to do AI research.)
Mu, I think. I think the MIRI view on the matter is that the internal mechanistic implementation of an AGI-trained-by-the-SGD would be some messy overcomplicated behemoth. Not a relatively simple utility-function plus world-model plus queries on it plus cached heuristics (or whatever), but a bunch of much weirder modules kludged together in a way such that their emergent dynamics result in powerful agentic behavior.[1]
The ontological problems with alignment would stem not from the fact that the AI is using alien concepts, but from its own internal dynamics being absurdly complicated and alien. It wouldn’t have a well-formatted mesa-objective, for example, or “emotions”, or a System 1 vs System 2 split, or explicit vs. tacit knowledge. It would have a dozen other things which fulfill the same functions that the aforementioned features of human minds fulfill in humans, but they’d be split up and recombined in entirely different ways, such that most individual modules would have no analogues in human cognition at all.
Untangling it would be a “second tier” of the interpretability problem, which the current interpretability research didn’t yet even get a glimpse of.
And, sure, maybe at some higher level of organization, all that complexity would be reducible to simple-ish agentic behavior. Maybe a powerful-enough pragmascope would be able to see past all that and yield us a description of the high-level implementation directly. But I don’t think the MIRI view is hopeful regarding getting such tools.
Whether the NAH is or is not true doesn’t really enter into it.
Could be I’m failing the ITT here, of course. But this post gives me this vibe, as does this old write-up. Choice quote[2]:
Obviously it’s talking about AIXI, not ML models, but I assume the MIRI view has a directionally similar argument regarding them.
Or, in other words: what the MIRI view rejects isn’t the NAH, but some variant of the simplicity-prior argument. It doesn’t believe that the SGD would yield nicely formatted agents; that the ML training loops produce pressures shaping minds this way.[3]
This powerful agentic behavior would then of course be able to streamline its own implementation, once it’s powerful enough, but that’s what the starting point would be – and also what we’d need to align, since once it has the extensive self-modification capabilities to streamline itself, it’d be too late to tinker with it.
Although now that I’m looking at it, this post is actually a mirror of the Arbital page, which has three authors, so I’m not entirely sure this segment was written by Eliezer...
Note that this also means that formally solving the Agent-Like Structure Problem wouldn’t help us either. It doesn’t matter how theoretically perfect embedded agents are shaped, because the agent we’d be dealing with wouldn’t be shaped like this. Knowing how it’s supposed to be shaped would help only marginally, at best giving us a rough idea regarding how to start untangling the internal dynamics.
Thanks, that makes sense, thinking about YKs writings. I’ll add that briefly to the piece. What’s the best reference, if you have a moment?
I think LLMs are already mostly finding natural abstractions. They’ll have some weird cross-talk, like the golden gate bridge being mixed with fog, but humans have that too, to maybe a lesser degree, and we can still communicate pretty well about abstractions, at least if we’re careful.
I’m glad you liked the crux list. I think it’s really important to keep asking ourselves why others have different takes. The topic is too important to do the sta dard thing and just say “well they don’t get it”.
Eliezer’s List O’Doom probably has a short statement in there somewhere, if you want a quote on his position. Much of his back-and-forth with Quintin is also about rejecting natural abstraction, but I don’t know of a short pithy summary in that corpus. (More generally, it’s pretty clear from my standpoint that there are basically two cruxes between Eliezer and Quintin, because my own models look mostly like Eliezer’s if I flip the natural abstraction bit and mostly like Quintin’s if I flip a particular bit having to do with ease of outer alignment.)
If you want a reference on the natural abstraction hypothesis more generally, I introduced the term in Alignment By Default.