So when the AI “understands humans perfectly well”, that means something like: The AI can visualise the flawed (ie, high prediction error) model that we use to think about the world. And it does this accurately. But it also sees how the model is completely wrong, and how the things, that we say we want, only make sense in that model that has very little to do with the actual world.
This sounds a lot like a good/preferable thing to me. I would assume that we’d generally want AIs with ideal / superior ontologies.
It’s not clear to me why you’d think such a scenario would make us less optimistic about single-agent alignment. (If I’m understanding correctly)
As a quick reaction, let me just note that I agree that (all else being equal) this (ie, “the AI understanding us & having superior ontology”) seems desirable. And also that my comment above did not present any argument about why we should be pessimistic about AI X-risk if we believe that the natural abstraction hypothesis is false. (I was just trying to explain why/how “the AI has a different ontology” is compatible with “the AI understands our ontology”.)
As a longer reaction: I think my primary reason for pessimism, if natural abstraction hypothetis is false, is that a bunch of existing proposals might work if the hypothesis were true, but don’t work if the hypothesis is false. (EG, if the hypothesis is true, I can imagine that “do a lot of RLHF, and then ramp up the AIs intelligence” could just work. Similarly for “just train the AI to not be deceptive”.)
If I had to gesture at an underlying principle, then perhaps it could be something like: Suppose we successfully code up an AI which is pretty good at optimising, or create a process which gives rise to such an AI. [Inference step missing here.] Then the goals and planning of this AI will be happening in some ontology which allows for low prediction error. But this will be completely alien to our ontology. [Inference step missing here.] And, therefore, things that score very highly with respect to these (“alien”) goals will have roughly no value[1] according to our preferences. (I am not quite clear on this, but I think that if this paragraph was false, then you could come up with a way of falsifying my earlier description of how it looks like when the natural abstraction hypothesis is false.)
EG, if the hypothesis is true, I can imagine that “do a lot of RLHF, and then ramp up the AIs intelligence” could just work. Similarly for “just train the AI to not be deceptive”.)
Thanks, this makes sense to me.
Yea, I guess I’m unsure about that ‘[Inference step missing here.]’. My guess is that such system would be able to recognize situations where things that score highly with respect to its ontology, would score lowly, or would be likely to score lowly, using a human ontology. Like, it would be able to simulate a human deliberating on this for a very long time and coming to some conclusion.
I imagine that the cases where this would be scary are some narrow ones (though perhaps likely ones) where the system is both dramatically intelligent in specific ways, but incredibly inept in others. This ineptness isn’t severe enough to stop it from taking over the world, but it is enough to stop it from being at all able to maximize goals—and it also doesn’t take basic risk measures like “just keep a bunch of humans around and chat to them a whole lot, when curious”, or “try to first make a better AI that doesn’t have these failures, before doing huge unilateralist actions” for some reason.
It’s very hard for me to imagine such an agent, but that doesn’t mean it’s not possible, or perhaps likely.
[I am confused about your response. I fully endorse your paragraph on “the AI with superior ontology would be able to predict how humans would react to things”. But then the follow-up, on when this would be scary, seems mostly irrelevant / wrong to me—meaning that I am missing some implicit assumptions, misunderstanding how you view this, etc. I will try react in a hopefully-helpful way, but I might be completely missing the mark here, in which case I apologise :).]
I think the problem is that there is a difference between: (1) AI which can predict how things score in human ontology; and (2) AI which has “select things that score high in human ontology” as part of its goal[1]. And then, in the worlds where natural abstraction hypothesis is false: Most AIs achieve (1) as a by-product of the instrumental sub-goal of having low prediction error / being selected by our training processes / being able to manipulate humans. But us successfully achieving (2) for a powerful AI would require the natural abstraction hypothesis[2].
And this leaves us two options. First, maybe we just have no write access to the AI’s utility function at all. (EG, my neighbour would be very happy if I gave him $10k, but he doesn’t have any way of making me (intrinsincally) desire doing that.) Second, we might have a write access to the AI’s utility function, but not in a way that will lead to predictable changes in goals or behaviour. (EG, if you give me full access to weights of an LLM, it’s not like I know how to use that to turn that LLM into an actually-helpful assistant.) (And both of these seem scary to me, because of the argument that “not-fully-aligned goal + extremely powerful optimisation ==> extinction”. Which I didn’t argue for here.)
More precisely: Damn, we need a better terminology here. The way I understand things, “natural abstraction hypothesis” is the claim that most AIs will converge to an ontology that is similar to ours. The negation of that is that a non-trivial portion of AIs will use an ontology that is different from ours. What I subscribe to is that “almost no powerful AIs will use an ontology that is similar to ours”. Let’s call that “strong negation” of the natural abstraction hypothesis. So achieving (2) would be a counterexample to this strong negation. Ironically, I believe the strong negation hypothesis because I expect that very powerful AIs will arrive at similar ways of modelling the world—and those are all different from how we model the world.
This sounds a lot like a good/preferable thing to me. I would assume that we’d generally want AIs with ideal / superior ontologies.
It’s not clear to me why you’d think such a scenario would make us less optimistic about single-agent alignment. (If I’m understanding correctly)
As a quick reaction, let me just note that I agree that (all else being equal) this (ie, “the AI understanding us & having superior ontology”) seems desirable. And also that my comment above did not present any argument about why we should be pessimistic about AI X-risk if we believe that the natural abstraction hypothesis is false. (I was just trying to explain why/how “the AI has a different ontology” is compatible with “the AI understands our ontology”.)
As a longer reaction: I think my primary reason for pessimism, if natural abstraction hypothetis is false, is that a bunch of existing proposals might work if the hypothesis were true, but don’t work if the hypothesis is false. (EG, if the hypothesis is true, I can imagine that “do a lot of RLHF, and then ramp up the AIs intelligence” could just work. Similarly for “just train the AI to not be deceptive”.)
If I had to gesture at an underlying principle, then perhaps it could be something like: Suppose we successfully code up an AI which is pretty good at optimising, or create a process which gives rise to such an AI. [Inference step missing here.] Then the goals and planning of this AI will be happening in some ontology which allows for low prediction error. But this will be completely alien to our ontology. [Inference step missing here.] And, therefore, things that score very highly with respect to these (“alien”) goals will have roughly no value[1] according to our preferences.
(I am not quite clear on this, but I think that if this paragraph was false, then you could come up with a way of falsifying my earlier description of how it looks like when the natural abstraction hypothesis is false.)
IE, no positive value, but also no negative value. So no S-risk.
Thanks for that explanation.
Thanks, this makes sense to me.
Yea, I guess I’m unsure about that ‘[Inference step missing here.]’. My guess is that such system would be able to recognize situations where things that score highly with respect to its ontology, would score lowly, or would be likely to score lowly, using a human ontology. Like, it would be able to simulate a human deliberating on this for a very long time and coming to some conclusion.
I imagine that the cases where this would be scary are some narrow ones (though perhaps likely ones) where the system is both dramatically intelligent in specific ways, but incredibly inept in others. This ineptness isn’t severe enough to stop it from taking over the world, but it is enough to stop it from being at all able to maximize goals—and it also doesn’t take basic risk measures like “just keep a bunch of humans around and chat to them a whole lot, when curious”, or “try to first make a better AI that doesn’t have these failures, before doing huge unilateralist actions” for some reason.
It’s very hard for me to imagine such an agent, but that doesn’t mean it’s not possible, or perhaps likely.
[I am confused about your response. I fully endorse your paragraph on “the AI with superior ontology would be able to predict how humans would react to things”. But then the follow-up, on when this would be scary, seems mostly irrelevant / wrong to me—meaning that I am missing some implicit assumptions, misunderstanding how you view this, etc. I will try react in a hopefully-helpful way, but I might be completely missing the mark here, in which case I apologise :).]
I think the problem is that there is a difference between:
(1) AI which can predict how things score in human ontology; and
(2) AI which has “select things that score high in human ontology” as part of its goal[1].
And then, in the worlds where natural abstraction hypothesis is false: Most AIs achieve (1) as a by-product of the instrumental sub-goal of having low prediction error / being selected by our training processes / being able to manipulate humans. But us successfully achieving (2) for a powerful AI would require the natural abstraction hypothesis[2].
And this leaves us two options. First, maybe we just have no write access to the AI’s utility function at all. (EG, my neighbour would be very happy if I gave him $10k, but he doesn’t have any way of making me (intrinsincally) desire doing that.) Second, we might have a write access to the AI’s utility function, but not in a way that will lead to predictable changes in goals or behaviour. (EG, if you give me full access to weights of an LLM, it’s not like I know how to use that to turn that LLM into an actually-helpful assistant.)
(And both of these seem scary to me, because of the argument that “not-fully-aligned goal + extremely powerful optimisation ==> extinction”. Which I didn’t argue for here.)
IE, not just instrumentally because it is pretending to be aligned while becoming more powerful, etc.
More precisely: Damn, we need a better terminology here. The way I understand things, “natural abstraction hypothesis” is the claim that most AIs will converge to an ontology that is similar to ours. The negation of that is that a non-trivial portion of AIs will use an ontology that is different from ours. What I subscribe to is that “almost no powerful AIs will use an ontology that is similar to ours”. Let’s call that “strong negation” of the natural abstraction hypothesis. So achieving (2) would be a counterexample to this strong negation.
Ironically, I believe the strong negation hypothesis because I expect that very powerful AIs will arrive at similar ways of modelling the world—and those are all different from how we model the world.