This works great when you can recognize good things within the respresentation the AI uses to think about the world. But what if that’s not true?
Here’s the optimistic case:
Suppose you build a Go-playing AI that defers to you for its values, but the only things it represents are states of the Go board, and functions over states of the Go board. You want to tell it to win at Go, but it doesn’t represent that concept, you have to tell it what “win at Go” means in terms of a value function from states of the Go board to real numbers. If (like me) you have a hard time telling when you’re winning at Go, maybe you just generate as many obviously-winning positions as you can and label them all as high-value, everything else low-value. And this sort of works! The Go-playing AI tries to steer the gameboard into one of these obviously-winning states, and then it stops, and maybe it could win more games of Go if it also valued the less-obviously-winning positions, but that’s alright.
Why is that optimistic?
Because it doesn’t scale to the real world. An AI that learns about and acts in the real world doesn’t have a simple gameboard that we just need to find some obviously-good arrangements of. At the base level it has raw sensor feeds and motor outputs, which we are not smart enough to define success in terms of directly. And as it processes its sensory data it (by default) generates representations and internal states that are useful for it, but not simple for humans to understand, or good things to try to put value functions over. In fact, an entire intelligent system can operate without ever internally representing the things we want to put value functions over.
Sorry for taking a ridiculously long time to get back to you. I was dealing with some stuff.
This works great when you can recognize good things within the represention the AI uses to think about the world. But what if that’s not true?
Yes, that is correct. As I said in the article, a high degree of interpretability is necessary to use the idea.
It’s true that interpretability is required, but the key point of my scheme is this: interpretability is all you need for intent alignment, provided my scheme is correct. I don’t know of any other alignment strategies for which which this is the case. So, my scheme, if correct, basically allows you to bypass what is plausibly the hardest part of AI safety: robust value-loading.
I know of course that I could be wrong about this, but if the technique is correct, it seems like a quite promising AI safety technique to me.
Does this seem reasonable? I may very well be just be misunderstanding or missing something.
My point was that you don’t just need interpretability, you need the AI to “meet you halfway” by already learning the right concept that you want to interpret. You might also need it to not learn “spurious” concepts that fit the data but generalize poorly. This doesn’t happen by default AFAICT, it needs to be designed for.
I hadn’t fully appreciated to difficultly that could result from AIs having alien concepts, so thanks for bringing it up.
However, it seems to me that this would not be a big problem, provided the AI is still interpretable. I’ll provide two ways to handle this.
For one, you could potentially translate the human concepts you care about into statements using the AI’s concepts. Even if the AI doesn’t use the same concepts people do, AIs are still incentivized to form a detailed model of the world. If you can have access to all the AI’s world model, but still can’t figure out basic things like if the model means the world gets destroyed or the AI takes over the world, then that model doesn’t seem very interperable. So I’m skeptical that this would really be a problem.
But, if it is, it seems to me that there’s a way to get the AI to have non-alien concepts.
In a comment with another person, made a modification to the system by saying that the people outputting utilities should be able to refuse to output one in a given query, for example because the situation is too complicated or to vague for humans to understand that desirability of. This could potentially allow for people to avoid having the AI from having very aliens concepts.
To deal with alien concepts, you can just have the people refuse to provide an answer to the utility of a possible for description if the description is described. This way, the AI would need to come up with sufficiently non-alien concepts before it can understand the utility of things. The AI would have to come up with reasonably non-alien concepts in order to get any of its calls to its utility function to work.
This works great when you can recognize good things within the respresentation the AI uses to think about the world. But what if that’s not true?
Here’s the optimistic case:
Suppose you build a Go-playing AI that defers to you for its values, but the only things it represents are states of the Go board, and functions over states of the Go board. You want to tell it to win at Go, but it doesn’t represent that concept, you have to tell it what “win at Go” means in terms of a value function from states of the Go board to real numbers. If (like me) you have a hard time telling when you’re winning at Go, maybe you just generate as many obviously-winning positions as you can and label them all as high-value, everything else low-value. And this sort of works! The Go-playing AI tries to steer the gameboard into one of these obviously-winning states, and then it stops, and maybe it could win more games of Go if it also valued the less-obviously-winning positions, but that’s alright.
Why is that optimistic?
Because it doesn’t scale to the real world. An AI that learns about and acts in the real world doesn’t have a simple gameboard that we just need to find some obviously-good arrangements of. At the base level it has raw sensor feeds and motor outputs, which we are not smart enough to define success in terms of directly. And as it processes its sensory data it (by default) generates representations and internal states that are useful for it, but not simple for humans to understand, or good things to try to put value functions over. In fact, an entire intelligent system can operate without ever internally representing the things we want to put value functions over.
Here’s a nice post from the past: https://www.lesswrong.com/posts/Mizt7thg22iFiKERM/concept-safety-the-problem-of-alien-concepts
Sorry for taking a ridiculously long time to get back to you. I was dealing with some stuff.
Yes, that is correct. As I said in the article, a high degree of interpretability is necessary to use the idea.
It’s true that interpretability is required, but the key point of my scheme is this: interpretability is all you need for intent alignment, provided my scheme is correct. I don’t know of any other alignment strategies for which which this is the case. So, my scheme, if correct, basically allows you to bypass what is plausibly the hardest part of AI safety: robust value-loading.
I know of course that I could be wrong about this, but if the technique is correct, it seems like a quite promising AI safety technique to me.
Does this seem reasonable? I may very well be just be misunderstanding or missing something.
My point was that you don’t just need interpretability, you need the AI to “meet you halfway” by already learning the right concept that you want to interpret. You might also need it to not learn “spurious” concepts that fit the data but generalize poorly. This doesn’t happen by default AFAICT, it needs to be designed for.
I hadn’t fully appreciated to difficultly that could result from AIs having alien concepts, so thanks for bringing it up.
However, it seems to me that this would not be a big problem, provided the AI is still interpretable. I’ll provide two ways to handle this.
For one, you could potentially translate the human concepts you care about into statements using the AI’s concepts. Even if the AI doesn’t use the same concepts people do, AIs are still incentivized to form a detailed model of the world. If you can have access to all the AI’s world model, but still can’t figure out basic things like if the model means the world gets destroyed or the AI takes over the world, then that model doesn’t seem very interperable. So I’m skeptical that this would really be a problem.
But, if it is, it seems to me that there’s a way to get the AI to have non-alien concepts.
In a comment with another person, made a modification to the system by saying that the people outputting utilities should be able to refuse to output one in a given query, for example because the situation is too complicated or to vague for humans to understand that desirability of. This could potentially allow for people to avoid having the AI from having very aliens concepts.
To deal with alien concepts, you can just have the people refuse to provide an answer to the utility of a possible for description if the description is described. This way, the AI would need to come up with sufficiently non-alien concepts before it can understand the utility of things. The AI would have to come up with reasonably non-alien concepts in order to get any of its calls to its utility function to work.