I think that alignment approaches with a heavy reliance on output evaluation are doomed, both on the grounds that humans can’t evaluate the effectiveness of a plan capable of ending the acute risk period, [...]
The way you say this, and the way Eliezer wrote Point 30 in AGI ruin, sounds like you think there is no AGI text output with which humans alone could execute a pivotal act.
This surprises me. For one thing, if the AGI outputs the textbook of the future on alignment, I’d say we could understand that sufficiently well to be sure that our AI will be aligned/corrigible. (And sure there’s the possibility that we could be hacked through text so we only think it’s safe or so, but I’d expect that this is significantly harder to achieve than just outputting a correct solution to alignment.)
But even if we say humans would need to do a pivotal act without AGI, I’d intuitively guess an AGI could give us the tools (e.g. non-AGI algorithms we can understand) and relevant knowledge to do it ourselves.
To be clear, I do not think that we can get an AGI that prompts us the relevant text to do a weak pivotal act, without the AGI destroying the world. And if we could do that, there may well be a safer way to let a corrigible AI do a pivotal act.
So I agree it’s not a stategy that could work in practice.
The way you phrase it sounds to me like it’s not even possible in theory, which seems pretty surprising to me, which is why I ask whether you actually think that or if you meant it’s not possible in practice:
Do you agree that in the unrealistic hypothetical case where we could build a safe AGI that outputs the textbook of the future on alignment (but we somehow don’t have knowledge to build other aligned or corrigible AGI directly), we’d survive?
If future humans (say 50 years in the future) could transmit 10MB of instructions through a time machine, only they are not allowed to tell us how to build aligned or corrigible AGI or how to find out how to build aligned or corrigible AGI (etc through the meta-levels), do you think they could still transmit information with which we would be able to execute a pivotal act ourselves?
I’m also curious about your answer on:
3. If we had a high-rank dath ilani keeper teleported into our world, but he is not allowed to build aligned or corrigible AI and cannot tell anyone how or make us find the solution to alignment etc, could he save the world without using AGI? By what margin? (Let’s assume all the pivotal-act specific knowledge from dath ilan is deleted from the keeper’s mind as he arrives here.)
(E.g. I’m currently at P(doom)>85%, but P(doom | tomorrow such a keeper will be teleported here)~=12%. (Most uncertainty comes from that my model is wrong. So I think with ~80% probability that if we’d get such a keeper, he’d almost certainly be able to save the world, but in the remaining 20% where I misestimated sth, it might still be pretty hard.))
if the AGI outputs the textbook of the future on alignment, I’d say we could understand that sufficiently well to be sure that our AI will be aligned/corrigible
It’s at least not obvious to me that we would be able to tell apart “the textbook of the future on alignment” from “an artifact that purports to be the textbook of the future on alignment, but is in fact a manipulative trick”. At least, I think we wouldn’t be able to tell them apart in the limit of superintelligent AI.
I mean, it’s not like I’ve never found an argument compelling and then realized much later on that it was wrong. Hasn’t everyone?
I’d say that there’s a big difference between fooling you into “brilliant that really looks plausible” shortly after you read it, and a group of smart humans trying to deeply understand the concepts for months and trying to make really sure there are no loopholes.
In fact, I’d expect the making us wrongly believe strongly everything works after months is impossible even in the limit of superintelligence, though I do think the superintelligence could prompt some text that destroys/shapes the world as it’d like.
And generally something smart enough to solve alignment will likely be smart enough to break out of the box and take over the world, as said.
But yeah if the people with the AGI aren’t extremely cautions and just go ahead and quickly build AGI because it looks all correct, then that might go badly. But my point was that it is in the reach of human checkability.
I think Nate’s original argument holds, but might need a bit if elaboration:
Roughly speaking, I think that alignment approaches with a heavy reliance on output evaluation are doomed, [both] on the grounds that humans can’t evaluate the effectiveness of a plan capable of ending the acute risk period, and [...].
I think there are classes of pivotal plans + their descriptions, such that if the AI gave one of them to us, we could tell whether it is pivotal good or pivotal bad. But also there are other classes of pivotal plans + their descriptions such that you will think you can tell the difference, but you can’t.
So the observation “this seems good and I am super-convincend i could tell if it wasn’t”—by itself—isn’t enough. To trust the AIs plan, you need some additional argument for why it is the kind of AI that wouldn’t try to deceive you, or why it wouldn’t search through dangerous plans, yada yada. But essentially that means you aren’t relying on the plan verification step anymore.
I’d guess we can likely reliably identify some classes of pivotal acts where we cannot be fooled easily, and would only accept suggestions from those classes, and I’d still intuitively expect that there are doable pivotal acts on those classes.
The way you say this, and the way Eliezer wrote Point 30 in AGI ruin, sounds like you think there is no AGI text output with which humans alone could execute a pivotal act.
This surprises me. For one thing, if the AGI outputs the textbook of the future on alignment, I’d say we could understand that sufficiently well to be sure that our AI will be aligned/corrigible. (And sure there’s the possibility that we could be hacked through text so we only think it’s safe or so, but I’d expect that this is significantly harder to achieve than just outputting a correct solution to alignment.)
But even if we say humans would need to do a pivotal act without AGI, I’d intuitively guess an AGI could give us the tools (e.g. non-AGI algorithms we can understand) and relevant knowledge to do it ourselves.
To be clear, I do not think that we can get an AGI that prompts us the relevant text to do a weak pivotal act, without the AGI destroying the world. And if we could do that, there may well be a safer way to let a corrigible AI do a pivotal act.
So I agree it’s not a stategy that could work in practice.
The way you phrase it sounds to me like it’s not even possible in theory, which seems pretty surprising to me, which is why I ask whether you actually think that or if you meant it’s not possible in practice:
Do you agree that in the unrealistic hypothetical case where we could build a safe AGI that outputs the textbook of the future on alignment (but we somehow don’t have knowledge to build other aligned or corrigible AGI directly), we’d survive?
If future humans (say 50 years in the future) could transmit 10MB of instructions through a time machine, only they are not allowed to tell us how to build aligned or corrigible AGI or how to find out how to build aligned or corrigible AGI (etc through the meta-levels), do you think they could still transmit information with which we would be able to execute a pivotal act ourselves?
I’m also curious about your answer on:
3. If we had a high-rank dath ilani keeper teleported into our world, but he is not allowed to build aligned or corrigible AI and cannot tell anyone how or make us find the solution to alignment etc, could he save the world without using AGI? By what margin? (Let’s assume all the pivotal-act specific knowledge from dath ilan is deleted from the keeper’s mind as he arrives here.)
(E.g. I’m currently at P(doom)>85%, but P(doom | tomorrow such a keeper will be teleported here)~=12%. (Most uncertainty comes from that my model is wrong. So I think with ~80% probability that if we’d get such a keeper, he’d almost certainly be able to save the world, but in the remaining 20% where I misestimated sth, it might still be pretty hard.))
It’s at least not obvious to me that we would be able to tell apart “the textbook of the future on alignment” from “an artifact that purports to be the textbook of the future on alignment, but is in fact a manipulative trick”. At least, I think we wouldn’t be able to tell them apart in the limit of superintelligent AI.
I mean, it’s not like I’ve never found an argument compelling and then realized much later on that it was wrong. Hasn’t everyone?
I’d say that there’s a big difference between fooling you into “brilliant that really looks plausible” shortly after you read it, and a group of smart humans trying to deeply understand the concepts for months and trying to make really sure there are no loopholes. In fact, I’d expect the making us wrongly believe strongly everything works after months is impossible even in the limit of superintelligence, though I do think the superintelligence could prompt some text that destroys/shapes the world as it’d like. And generally something smart enough to solve alignment will likely be smart enough to break out of the box and take over the world, as said.
But yeah if the people with the AGI aren’t extremely cautions and just go ahead and quickly build AGI because it looks all correct, then that might go badly. But my point was that it is in the reach of human checkability.
I think Nate’s original argument holds, but might need a bit if elaboration:
I think there are classes of pivotal plans + their descriptions, such that if the AI gave one of them to us, we could tell whether it is pivotal good or pivotal bad. But also there are other classes of pivotal plans + their descriptions such that you will think you can tell the difference, but you can’t.
So the observation “this seems good and I am super-convincend i could tell if it wasn’t”—by itself—isn’t enough. To trust the AIs plan, you need some additional argument for why it is the kind of AI that wouldn’t try to deceive you, or why it wouldn’t search through dangerous plans, yada yada. But essentially that means you aren’t relying on the plan verification step anymore.
I’d guess we can likely reliably identify some classes of pivotal acts where we cannot be fooled easily, and would only accept suggestions from those classes, and I’d still intuitively expect that there are doable pivotal acts on those classes.