If you wanted a provable guarantee before powerful AI systems are actually built, you probably can’t do it without the things you listed.
I’m claiming that as we get powerful AI systems, we could figure out techniques that work with those AI systems. They only initially need to work for AI systems that are around our level of intelligence, and then we can improve our techniques in tandem with the AI systems gaining intelligence. In that setting, I’m relatively optimistic about things like “just train the AI to follow your instructions”; while this will break down in exotic cases or as the AI scales up, those cases are rare and hard to find.
I’m not really thinking about provable guarantees per se. I’m just thinking about how to point to the AI’s concept of human values—directly point to it, not point to some proxy of it, because proxies break down etc.
(Rough heuristic here: it is not possible to point directly at an abstract object in the territory. Even though a territory often supports certain natural abstractions, which are instrumentally convergent to learn/use, we still can’t unambiguously point to that abstraction in the territory—only in the map.)
A proxy is probably good enough for a lot of applications with little scale and few corner cases. And if we’re doing something like “train the AI to follow your instructions”, then a proxy is exactly what we’ll get. But if you want, say, an AI which “tries to help”—as opposed to e.g. an AI which tries to look like it’s helping—then that means pointing directly to human values, not to a proxy.
Now, it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that’s what you have in mind, and I do think it’s plausible, even if sounds a bit crazy. Of course, without better theoretical tools, we still wouldn’t have a way to directly check even in hindsight whether the AI actually wound up pointing to human values or not. (Again, not talking about provable guarantees here, I just want to be able to look at the AI’s own internal data structures and figure out (a) whether it has a notion of human values, and (b) whether it’s actually trying to act in accordance with them, or just something correlated with them.)
it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that’s what you have in mind
Kind of, but not exactly.
I think that whatever proxy is learned will not be a perfect pointer. I don’t know if there is such a thing as a “perfect pointer”, given that I don’t think there is a “right” answer to the question of what human values are, and consequently I don’t think there is a “right” answer to what is helpful vs. not helpful.
I think the learned proxy will be a good enough pointer that the agent will not be actively trying to kill us all, will let us correct it, and will generally do useful things. It seems likely that if the agent was magically scaled up a lot, then bad things could happen due to the errors in the pointer. But I’d hope that as the agent scales up, we improve and correct the pointer (where “we” doesn’t have to be just humans; it could also include other AI assistants).
If you wanted a provable guarantee before powerful AI systems are actually built, you probably can’t do it without the things you listed.
I’m claiming that as we get powerful AI systems, we could figure out techniques that work with those AI systems. They only initially need to work for AI systems that are around our level of intelligence, and then we can improve our techniques in tandem with the AI systems gaining intelligence. In that setting, I’m relatively optimistic about things like “just train the AI to follow your instructions”; while this will break down in exotic cases or as the AI scales up, those cases are rare and hard to find.
I’m not really thinking about provable guarantees per se. I’m just thinking about how to point to the AI’s concept of human values—directly point to it, not point to some proxy of it, because proxies break down etc.
(Rough heuristic here: it is not possible to point directly at an abstract object in the territory. Even though a territory often supports certain natural abstractions, which are instrumentally convergent to learn/use, we still can’t unambiguously point to that abstraction in the territory—only in the map.)
A proxy is probably good enough for a lot of applications with little scale and few corner cases. And if we’re doing something like “train the AI to follow your instructions”, then a proxy is exactly what we’ll get. But if you want, say, an AI which “tries to help”—as opposed to e.g. an AI which tries to look like it’s helping—then that means pointing directly to human values, not to a proxy.
Now, it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that’s what you have in mind, and I do think it’s plausible, even if sounds a bit crazy. Of course, without better theoretical tools, we still wouldn’t have a way to directly check even in hindsight whether the AI actually wound up pointing to human values or not. (Again, not talking about provable guarantees here, I just want to be able to look at the AI’s own internal data structures and figure out (a) whether it has a notion of human values, and (b) whether it’s actually trying to act in accordance with them, or just something correlated with them.)
Kind of, but not exactly.
I think that whatever proxy is learned will not be a perfect pointer. I don’t know if there is such a thing as a “perfect pointer”, given that I don’t think there is a “right” answer to the question of what human values are, and consequently I don’t think there is a “right” answer to what is helpful vs. not helpful.
I think the learned proxy will be a good enough pointer that the agent will not be actively trying to kill us all, will let us correct it, and will generally do useful things. It seems likely that if the agent was magically scaled up a lot, then bad things could happen due to the errors in the pointer. But I’d hope that as the agent scales up, we improve and correct the pointer (where “we” doesn’t have to be just humans; it could also include other AI assistants).