Hmm, I think I’d want to explicitly include two other points, that are kind of included in that but don’t get communicated well by that summary:
There may not be a problem at all; perhaps by default powerful AI systems are not goal-directed.
If there is a problem, we’ll get evidence of its existence before it’s too late, and coordination to not build problematic AI systems will buy us additional time.
Cool, just wanted to make sure I’m engaging with the main argument here. With that out of the way...
I generally buy the “no foom ⇒ iterate ⇒ probably ok” scenario. There are some caveats and qualifications, but broadly-defined “no foom” is a crux for me—I expect at least some kind of decisive strategic advantage for early AGI, and would find the “aligned by default” scenario plausible in a no-foom world.
I do not think that a lack of goal-directedness is particularly relevant here. If an AI has extreme capabilities, then a lack of goals doesn’t really make it any safer. At some point I’ll probably write a post about Don Norman’s fridge which talks about this in more depth, but the short version is: if we have an AI with extreme capabilities but a confusing interface, then there’s a high chance that we all die, goal-direction or not. In the “no foom” scenario, we’re assuming the AI won’t have those extreme capabilities, but it’s foom vs no foom which matters there, not goals vs no goals.
I also disagree with coordination having any hope whatsoever if there is a problem. There’s a huge unilateralist problem there, with millions of people each easily able to push the shiny red button. I think straight-up solving all of the technical alignment problems would be much easier than that coordination problem.
Looking at both the first and third point, I suspect that a sub-crux might be expectations about the resource requirements (i.e. compute & data) needed for AGI. I expect that, once we have the key concepts, human-level AGI will be able to run in realtime on an ordinary laptop. (Training might require more resources, at least early on. That would reduce the unilateralist problem, but increase the chance of decisive strategic advantage due to the higher barrier to entry.)
EDIT: to clarify, those second two points are both conditioned on foom. Point being, the only thing which actually matters here is foom vs no foom:
if there’s no foom, then we can probably iterate, and then we’re probably fine anyway (regardless of goal-direction, coordination, etc).
if there’s foom, then a lack of goal-direction won’t help much, and coordination is unlikely to work.
the only thing which actually matters here is foom vs no foom
Yeah, I think I mostly agree with this.
if we have an AI with extreme capabilities but a confusing interface, then there’s a high chance that we all die
Yeah, I agree with that (assuming “extreme capabilities” = rearranging atoms however it sees fit, or something of that nature), but why must it have a confusing interface? Couldn’t you just talk to it, and it would know what you mean? So I do think the goal-directed point does matter.
I suspect that a sub-crux might be expectations about the resource requirements (i.e. compute & data) needed for AGI. I expect that, once we have the key concepts, human-level AGI will be able to run in realtime on an ordinary laptop.
I agree that this is a sub-crux. Note that I believe that eventually human-level AGI will be able to run on a laptop, just that it will be preceded by human-level AGIs that take more compute.
Training might require more resources, at least early on. That would reduce the unilateralist problem, but increase the chance of decisive strategic advantage due to the higher barrier to entry.
I tend to think that if problems arise, you’ve mostly lost already, so I’m actually happier about decisive strategic advantage because it reduces competitive pressure.
But tbc, I broadly agree with all of your points, and do think that in FOOM worlds most of my arguments don’t work. (Though I continue to be confused what exactly a FOOM world looks like.)
but why must it have a confusing interface? Couldn’t you just talk to it, and it would know what you mean?
That’s where the Don Norman part comes in. Interfaces to complicated systems are confusing by default. The general problem of systematically building non-confusing interfaces is, in my mind at least, roughly equivalent to the full technical problem of AI alignment. (Writing a program which knows what you mean is also, in my mind, roughly equivalent to the full technical problem of AI alignment.) A wording which makes it more obvious:
The main problem of AI alignment is to translate what a human wants into a format usable by a machine
The main problem of user interface design is to help/allow a human to translate what they want into a format usable by a machine
Something like e.g. tool AI puts more of the translation burden on the human, rather than on the AI, but that doesn’t make the translation itself any less difficult.
In a non-foomy world, the translation doesn’t have to be perfect—humanity won’t be wiped out if the AI doesn’t quite perfectly understand what we mean. Extreme capabilities make high-quality translation more important, not just because of Goodhart, but because the translation itself will break down in scenarios very different from what humans are used to. So if the AI has the capabilities to achieve scenarios very different from what humans are used to, then that translation needs to be quite good.
Do you agree that an AI with extreme capabilities should know what you mean, even if it doesn’t act in accordance with it? (This seems like an implication of “extreme capabilities”.)
No. The whole notion of a human “meaning things” presumes a certain level of abstraction. One could imagine an AI simply reasoning about molecules or fields (or at least individual neurons), without having any need for viewing certain chunks of matter as humans who mean things. In principle, no predictive power whatsoever would be lost in that view of the world.
That said, I do think that problem is less central/immediate than the problem of taking an AI which does know what we mean, and pointing at that AI’s concept-of-what-we-mean—i.e. in order to program the AI to do what we mean. Even if an AI learns a concept of human values, we still need to be able to point to that concept within the AI’s concept-space in order to actually align it—and that means translating between AI-notion-of-what-we-want and our-notion-of-what-we-want.
That’s the crux for me; I expect AI systems that we build to be capable of “knowing what you mean” (using the appropriate level of abstraction). They may also use other levels of abstraction, but I expect them to be capable of using that one.
Even if an AI learns a concept of human values, we still need to be able to point to that concept within the AI’s concept-space in order to actually align it
Yes, I would call that the central problem. (Though it would also be fine to build a pointer to a human and have the AI “help the human”, without necessarily pointing to human values.)
Yes, I would call that the central problem. (Though it would also be fine to build a pointer to a human and have the AI “help the human”, without necessarily pointing to human values.)
How would we do either of those things without workable theory of embedded agency, abstraction, some idea of what kind-of-structure human values have, etc?
If you wanted a provable guarantee before powerful AI systems are actually built, you probably can’t do it without the things you listed.
I’m claiming that as we get powerful AI systems, we could figure out techniques that work with those AI systems. They only initially need to work for AI systems that are around our level of intelligence, and then we can improve our techniques in tandem with the AI systems gaining intelligence. In that setting, I’m relatively optimistic about things like “just train the AI to follow your instructions”; while this will break down in exotic cases or as the AI scales up, those cases are rare and hard to find.
I’m not really thinking about provable guarantees per se. I’m just thinking about how to point to the AI’s concept of human values—directly point to it, not point to some proxy of it, because proxies break down etc.
(Rough heuristic here: it is not possible to point directly at an abstract object in the territory. Even though a territory often supports certain natural abstractions, which are instrumentally convergent to learn/use, we still can’t unambiguously point to that abstraction in the territory—only in the map.)
A proxy is probably good enough for a lot of applications with little scale and few corner cases. And if we’re doing something like “train the AI to follow your instructions”, then a proxy is exactly what we’ll get. But if you want, say, an AI which “tries to help”—as opposed to e.g. an AI which tries to look like it’s helping—then that means pointing directly to human values, not to a proxy.
Now, it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that’s what you have in mind, and I do think it’s plausible, even if sounds a bit crazy. Of course, without better theoretical tools, we still wouldn’t have a way to directly check even in hindsight whether the AI actually wound up pointing to human values or not. (Again, not talking about provable guarantees here, I just want to be able to look at the AI’s own internal data structures and figure out (a) whether it has a notion of human values, and (b) whether it’s actually trying to act in accordance with them, or just something correlated with them.)
it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that’s what you have in mind
Kind of, but not exactly.
I think that whatever proxy is learned will not be a perfect pointer. I don’t know if there is such a thing as a “perfect pointer”, given that I don’t think there is a “right” answer to the question of what human values are, and consequently I don’t think there is a “right” answer to what is helpful vs. not helpful.
I think the learned proxy will be a good enough pointer that the agent will not be actively trying to kill us all, will let us correct it, and will generally do useful things. It seems likely that if the agent was magically scaled up a lot, then bad things could happen due to the errors in the pointer. But I’d hope that as the agent scales up, we improve and correct the pointer (where “we” doesn’t have to be just humans; it could also include other AI assistants).
Would it be fair to summarize your view here as “Assuming no foom, we’ll be able to iterate, and that’s probably enough.”?
Hmm, I think I’d want to explicitly include two other points, that are kind of included in that but don’t get communicated well by that summary:
There may not be a problem at all; perhaps by default powerful AI systems are not goal-directed.
If there is a problem, we’ll get evidence of its existence before it’s too late, and coordination to not build problematic AI systems will buy us additional time.
Cool, just wanted to make sure I’m engaging with the main argument here. With that out of the way...
I generally buy the “no foom ⇒ iterate ⇒ probably ok” scenario. There are some caveats and qualifications, but broadly-defined “no foom” is a crux for me—I expect at least some kind of decisive strategic advantage for early AGI, and would find the “aligned by default” scenario plausible in a no-foom world.
I do not think that a lack of goal-directedness is particularly relevant here. If an AI has extreme capabilities, then a lack of goals doesn’t really make it any safer. At some point I’ll probably write a post about Don Norman’s fridge which talks about this in more depth, but the short version is: if we have an AI with extreme capabilities but a confusing interface, then there’s a high chance that we all die, goal-direction or not. In the “no foom” scenario, we’re assuming the AI won’t have those extreme capabilities, but it’s foom vs no foom which matters there, not goals vs no goals.
I also disagree with coordination having any hope whatsoever if there is a problem. There’s a huge unilateralist problem there, with millions of people each easily able to push the shiny red button. I think straight-up solving all of the technical alignment problems would be much easier than that coordination problem.
Looking at both the first and third point, I suspect that a sub-crux might be expectations about the resource requirements (i.e. compute & data) needed for AGI. I expect that, once we have the key concepts, human-level AGI will be able to run in realtime on an ordinary laptop. (Training might require more resources, at least early on. That would reduce the unilateralist problem, but increase the chance of decisive strategic advantage due to the higher barrier to entry.)
EDIT: to clarify, those second two points are both conditioned on foom. Point being, the only thing which actually matters here is foom vs no foom:
if there’s no foom, then we can probably iterate, and then we’re probably fine anyway (regardless of goal-direction, coordination, etc).
if there’s foom, then a lack of goal-direction won’t help much, and coordination is unlikely to work.
Yeah, I think I mostly agree with this.
Yeah, I agree with that (assuming “extreme capabilities” = rearranging atoms however it sees fit, or something of that nature), but why must it have a confusing interface? Couldn’t you just talk to it, and it would know what you mean? So I do think the goal-directed point does matter.
I agree that this is a sub-crux. Note that I believe that eventually human-level AGI will be able to run on a laptop, just that it will be preceded by human-level AGIs that take more compute.
I tend to think that if problems arise, you’ve mostly lost already, so I’m actually happier about decisive strategic advantage because it reduces competitive pressure.
But tbc, I broadly agree with all of your points, and do think that in FOOM worlds most of my arguments don’t work. (Though I continue to be confused what exactly a FOOM world looks like.)
That’s where the Don Norman part comes in. Interfaces to complicated systems are confusing by default. The general problem of systematically building non-confusing interfaces is, in my mind at least, roughly equivalent to the full technical problem of AI alignment. (Writing a program which knows what you mean is also, in my mind, roughly equivalent to the full technical problem of AI alignment.) A wording which makes it more obvious:
The main problem of AI alignment is to translate what a human wants into a format usable by a machine
The main problem of user interface design is to help/allow a human to translate what they want into a format usable by a machine
Something like e.g. tool AI puts more of the translation burden on the human, rather than on the AI, but that doesn’t make the translation itself any less difficult.
In a non-foomy world, the translation doesn’t have to be perfect—humanity won’t be wiped out if the AI doesn’t quite perfectly understand what we mean. Extreme capabilities make high-quality translation more important, not just because of Goodhart, but because the translation itself will break down in scenarios very different from what humans are used to. So if the AI has the capabilities to achieve scenarios very different from what humans are used to, then that translation needs to be quite good.
Do you agree that an AI with extreme capabilities should know what you mean, even if it doesn’t act in accordance with it? (This seems like an implication of “extreme capabilities”.)
No. The whole notion of a human “meaning things” presumes a certain level of abstraction. One could imagine an AI simply reasoning about molecules or fields (or at least individual neurons), without having any need for viewing certain chunks of matter as humans who mean things. In principle, no predictive power whatsoever would be lost in that view of the world.
That said, I do think that problem is less central/immediate than the problem of taking an AI which does know what we mean, and pointing at that AI’s concept-of-what-we-mean—i.e. in order to program the AI to do what we mean. Even if an AI learns a concept of human values, we still need to be able to point to that concept within the AI’s concept-space in order to actually align it—and that means translating between AI-notion-of-what-we-want and our-notion-of-what-we-want.
That’s the crux for me; I expect AI systems that we build to be capable of “knowing what you mean” (using the appropriate level of abstraction). They may also use other levels of abstraction, but I expect them to be capable of using that one.
Yes, I would call that the central problem. (Though it would also be fine to build a pointer to a human and have the AI “help the human”, without necessarily pointing to human values.)
How would we do either of those things without workable theory of embedded agency, abstraction, some idea of what kind-of-structure human values have, etc?
If you wanted a provable guarantee before powerful AI systems are actually built, you probably can’t do it without the things you listed.
I’m claiming that as we get powerful AI systems, we could figure out techniques that work with those AI systems. They only initially need to work for AI systems that are around our level of intelligence, and then we can improve our techniques in tandem with the AI systems gaining intelligence. In that setting, I’m relatively optimistic about things like “just train the AI to follow your instructions”; while this will break down in exotic cases or as the AI scales up, those cases are rare and hard to find.
I’m not really thinking about provable guarantees per se. I’m just thinking about how to point to the AI’s concept of human values—directly point to it, not point to some proxy of it, because proxies break down etc.
(Rough heuristic here: it is not possible to point directly at an abstract object in the territory. Even though a territory often supports certain natural abstractions, which are instrumentally convergent to learn/use, we still can’t unambiguously point to that abstraction in the territory—only in the map.)
A proxy is probably good enough for a lot of applications with little scale and few corner cases. And if we’re doing something like “train the AI to follow your instructions”, then a proxy is exactly what we’ll get. But if you want, say, an AI which “tries to help”—as opposed to e.g. an AI which tries to look like it’s helping—then that means pointing directly to human values, not to a proxy.
Now, it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that’s what you have in mind, and I do think it’s plausible, even if sounds a bit crazy. Of course, without better theoretical tools, we still wouldn’t have a way to directly check even in hindsight whether the AI actually wound up pointing to human values or not. (Again, not talking about provable guarantees here, I just want to be able to look at the AI’s own internal data structures and figure out (a) whether it has a notion of human values, and (b) whether it’s actually trying to act in accordance with them, or just something correlated with them.)
Kind of, but not exactly.
I think that whatever proxy is learned will not be a perfect pointer. I don’t know if there is such a thing as a “perfect pointer”, given that I don’t think there is a “right” answer to the question of what human values are, and consequently I don’t think there is a “right” answer to what is helpful vs. not helpful.
I think the learned proxy will be a good enough pointer that the agent will not be actively trying to kill us all, will let us correct it, and will generally do useful things. It seems likely that if the agent was magically scaled up a lot, then bad things could happen due to the errors in the pointer. But I’d hope that as the agent scales up, we improve and correct the pointer (where “we” doesn’t have to be just humans; it could also include other AI assistants).