Do you agree that an AI with extreme capabilities should know what you mean, even if it doesn’t act in accordance with it? (This seems like an implication of “extreme capabilities”.)
No. The whole notion of a human “meaning things” presumes a certain level of abstraction. One could imagine an AI simply reasoning about molecules or fields (or at least individual neurons), without having any need for viewing certain chunks of matter as humans who mean things. In principle, no predictive power whatsoever would be lost in that view of the world.
That said, I do think that problem is less central/immediate than the problem of taking an AI which does know what we mean, and pointing at that AI’s concept-of-what-we-mean—i.e. in order to program the AI to do what we mean. Even if an AI learns a concept of human values, we still need to be able to point to that concept within the AI’s concept-space in order to actually align it—and that means translating between AI-notion-of-what-we-want and our-notion-of-what-we-want.
That’s the crux for me; I expect AI systems that we build to be capable of “knowing what you mean” (using the appropriate level of abstraction). They may also use other levels of abstraction, but I expect them to be capable of using that one.
Even if an AI learns a concept of human values, we still need to be able to point to that concept within the AI’s concept-space in order to actually align it
Yes, I would call that the central problem. (Though it would also be fine to build a pointer to a human and have the AI “help the human”, without necessarily pointing to human values.)
Yes, I would call that the central problem. (Though it would also be fine to build a pointer to a human and have the AI “help the human”, without necessarily pointing to human values.)
How would we do either of those things without workable theory of embedded agency, abstraction, some idea of what kind-of-structure human values have, etc?
If you wanted a provable guarantee before powerful AI systems are actually built, you probably can’t do it without the things you listed.
I’m claiming that as we get powerful AI systems, we could figure out techniques that work with those AI systems. They only initially need to work for AI systems that are around our level of intelligence, and then we can improve our techniques in tandem with the AI systems gaining intelligence. In that setting, I’m relatively optimistic about things like “just train the AI to follow your instructions”; while this will break down in exotic cases or as the AI scales up, those cases are rare and hard to find.
I’m not really thinking about provable guarantees per se. I’m just thinking about how to point to the AI’s concept of human values—directly point to it, not point to some proxy of it, because proxies break down etc.
(Rough heuristic here: it is not possible to point directly at an abstract object in the territory. Even though a territory often supports certain natural abstractions, which are instrumentally convergent to learn/use, we still can’t unambiguously point to that abstraction in the territory—only in the map.)
A proxy is probably good enough for a lot of applications with little scale and few corner cases. And if we’re doing something like “train the AI to follow your instructions”, then a proxy is exactly what we’ll get. But if you want, say, an AI which “tries to help”—as opposed to e.g. an AI which tries to look like it’s helping—then that means pointing directly to human values, not to a proxy.
Now, it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that’s what you have in mind, and I do think it’s plausible, even if sounds a bit crazy. Of course, without better theoretical tools, we still wouldn’t have a way to directly check even in hindsight whether the AI actually wound up pointing to human values or not. (Again, not talking about provable guarantees here, I just want to be able to look at the AI’s own internal data structures and figure out (a) whether it has a notion of human values, and (b) whether it’s actually trying to act in accordance with them, or just something correlated with them.)
it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that’s what you have in mind
Kind of, but not exactly.
I think that whatever proxy is learned will not be a perfect pointer. I don’t know if there is such a thing as a “perfect pointer”, given that I don’t think there is a “right” answer to the question of what human values are, and consequently I don’t think there is a “right” answer to what is helpful vs. not helpful.
I think the learned proxy will be a good enough pointer that the agent will not be actively trying to kill us all, will let us correct it, and will generally do useful things. It seems likely that if the agent was magically scaled up a lot, then bad things could happen due to the errors in the pointer. But I’d hope that as the agent scales up, we improve and correct the pointer (where “we” doesn’t have to be just humans; it could also include other AI assistants).
Do you agree that an AI with extreme capabilities should know what you mean, even if it doesn’t act in accordance with it? (This seems like an implication of “extreme capabilities”.)
No. The whole notion of a human “meaning things” presumes a certain level of abstraction. One could imagine an AI simply reasoning about molecules or fields (or at least individual neurons), without having any need for viewing certain chunks of matter as humans who mean things. In principle, no predictive power whatsoever would be lost in that view of the world.
That said, I do think that problem is less central/immediate than the problem of taking an AI which does know what we mean, and pointing at that AI’s concept-of-what-we-mean—i.e. in order to program the AI to do what we mean. Even if an AI learns a concept of human values, we still need to be able to point to that concept within the AI’s concept-space in order to actually align it—and that means translating between AI-notion-of-what-we-want and our-notion-of-what-we-want.
That’s the crux for me; I expect AI systems that we build to be capable of “knowing what you mean” (using the appropriate level of abstraction). They may also use other levels of abstraction, but I expect them to be capable of using that one.
Yes, I would call that the central problem. (Though it would also be fine to build a pointer to a human and have the AI “help the human”, without necessarily pointing to human values.)
How would we do either of those things without workable theory of embedded agency, abstraction, some idea of what kind-of-structure human values have, etc?
If you wanted a provable guarantee before powerful AI systems are actually built, you probably can’t do it without the things you listed.
I’m claiming that as we get powerful AI systems, we could figure out techniques that work with those AI systems. They only initially need to work for AI systems that are around our level of intelligence, and then we can improve our techniques in tandem with the AI systems gaining intelligence. In that setting, I’m relatively optimistic about things like “just train the AI to follow your instructions”; while this will break down in exotic cases or as the AI scales up, those cases are rare and hard to find.
I’m not really thinking about provable guarantees per se. I’m just thinking about how to point to the AI’s concept of human values—directly point to it, not point to some proxy of it, because proxies break down etc.
(Rough heuristic here: it is not possible to point directly at an abstract object in the territory. Even though a territory often supports certain natural abstractions, which are instrumentally convergent to learn/use, we still can’t unambiguously point to that abstraction in the territory—only in the map.)
A proxy is probably good enough for a lot of applications with little scale and few corner cases. And if we’re doing something like “train the AI to follow your instructions”, then a proxy is exactly what we’ll get. But if you want, say, an AI which “tries to help”—as opposed to e.g. an AI which tries to look like it’s helping—then that means pointing directly to human values, not to a proxy.
Now, it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that’s what you have in mind, and I do think it’s plausible, even if sounds a bit crazy. Of course, without better theoretical tools, we still wouldn’t have a way to directly check even in hindsight whether the AI actually wound up pointing to human values or not. (Again, not talking about provable guarantees here, I just want to be able to look at the AI’s own internal data structures and figure out (a) whether it has a notion of human values, and (b) whether it’s actually trying to act in accordance with them, or just something correlated with them.)
Kind of, but not exactly.
I think that whatever proxy is learned will not be a perfect pointer. I don’t know if there is such a thing as a “perfect pointer”, given that I don’t think there is a “right” answer to the question of what human values are, and consequently I don’t think there is a “right” answer to what is helpful vs. not helpful.
I think the learned proxy will be a good enough pointer that the agent will not be actively trying to kill us all, will let us correct it, and will generally do useful things. It seems likely that if the agent was magically scaled up a lot, then bad things could happen due to the errors in the pointer. But I’d hope that as the agent scales up, we improve and correct the pointer (where “we” doesn’t have to be just humans; it could also include other AI assistants).