How do you know when you have solved the value extrapolation problem?
One hypothesis I have for what you might say is something like “a training scheme solves the value extrapolation problem when the sequence of inputs that will be seen in deployment by the AI produced by that training scheme leads to outputs which lead to positive outcomes by human lights” though from what I can tell, that’s basically the same as having a training scheme that leads to an “impact aligned” AI*.
If it isn’t this, how is your answer different?
*[ETA: the definition of impact alignment that Evan gives in the linked post technically only refers to an AI “which doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic,” but in my comment above, I meant to refer to what I think is the more relevant property for an AI to have, which I’ll call (impact aligned)_Jack: an agent is (impact aligned)_Jack to the degree that, by human lights, it doesn’t take bad actions and does take good actions.” I think that this is more relevant because Evan’s definition doesn’t distinguish between a rock and an intuitively aligned AI.]
Knowing that we’ve solve the problem relies on the knowing the innards of the algorithm we’ve designed, and proving theorems about it, rather than looking solely at its behaviour.
Ping about my other comment—FYI, because I am currently concerned that you don’t have criteria for the innards in mind, I’m less excited about your agenda than other alignment theory agendas (though this lack of excitement is somewhat weak, e.g. since I haven’t tried to digest your work much yet).
Let me develop the idea a bit more. It is somewhat akin to answering, in the 1968, the question “how do you know you’ve solved the moon landing problem?” In that case, NASA could point to them having solved a host of related problems (getting into space, getting to the moon, module separation, module reconnection), knowing that their lander could theoretically land on the moon (via knowledge of the laws of physics and of their lander design), estimating that the pilots are capable of dealing with likely contingencies, trusting that their model of the lunar landing problem is correct and has covered various likely contingencies, etc… and then putting it all together into a plan where they could say “successful lunar landing is likely”.
Note that various parts of the assumptions could be tested; engineers could probe at the plan and say things like “what if the conductivity of the lunar surface is unusual”, and try and see if their plan could cope with that.
Back to value extrapolation. We’d be confident that it is likely to work if we had, for example:
It works well in an all situations where we can completely test it (eg we have a list of human moral principles, and we can have an AI successfully run a school using those as input).
It works well on testable subproblems of more complicated situations (eg we inspect the AI’s behaviour in specific situations).
We have models of how value extrapolation works in extreme situations, and strong theoretical arguments that those models are correct.
We have developed a much better theoretical understanding of value extrapolation, and are confident that it works.
We’ve studied the problem adversarially and failed to break the approach.
We have deployed interpretability methods to look inside the AI at certain places, and what we’ve seen is what we expect to see.
These are the sort of things that could make us confident that a new approach could work. Is this what you are thinking?
Though the list still doesn’t strike me as very novel—it feels that most of these conditions are conditions we’ve been shooting for anyways.
E.g. conditions 1, 2, and 5 are about selecting for behavior we approve of and condition 5 is just inspection with interpretability tools.
If you feel you have traction on conditions 3 and 4 though, that does seem novel (side-note that condition 4 seems to be a subset of condition 3). I feel skeptical though, since value extrapolation seems like about as hard of a problem as understanding machine generalization in general + the way a thing behaves in a large class of cases seems to be so complicated of a concept that you won’t be able to have confident beliefs about it or understand it. I don’t have a concrete argument about this though.
Anyways, thanks for responding, and if you have any thoughts about the tractability of conditions 3⁄4, I’m pretty curious.
Yes, the list isn’t very novel—I was trying to think of the mix of theoretical and practical results that convince us, in the current world, that a new approach will work. Obviously we want a lot more rigour for something like AI alignment! But there is an urgency to get it fast, too :-(
How do you know when you have solved the value extrapolation problem?
One hypothesis I have for what you might say is something like “a training scheme solves the value extrapolation problem when the sequence of inputs that will be seen in deployment by the AI produced by that training scheme leads to outputs which lead to positive outcomes by human lights” though from what I can tell, that’s basically the same as having a training scheme that leads to an “impact aligned” AI*.
If it isn’t this, how is your answer different?
*[ETA: the definition of impact alignment that Evan gives in the linked post technically only refers to an AI “which doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic,” but in my comment above, I meant to refer to what I think is the more relevant property for an AI to have, which I’ll call (impact aligned)_Jack: an agent is (impact aligned)_Jack to the degree that, by human lights, it doesn’t take bad actions and does take good actions.” I think that this is more relevant because Evan’s definition doesn’t distinguish between a rock and an intuitively aligned AI.]
Knowing that we’ve solve the problem relies on the knowing the innards of the algorithm we’ve designed, and proving theorems about it, rather than looking solely at its behaviour.
Oh I see—could you say more about what characteristics you want the innards to have?
Ping about my other comment—FYI, because I am currently concerned that you don’t have criteria for the innards in mind, I’m less excited about your agenda than other alignment theory agendas (though this lack of excitement is somewhat weak, e.g. since I haven’t tried to digest your work much yet).
Let me develop the idea a bit more. It is somewhat akin to answering, in the 1968, the question “how do you know you’ve solved the moon landing problem?” In that case, NASA could point to them having solved a host of related problems (getting into space, getting to the moon, module separation, module reconnection), knowing that their lander could theoretically land on the moon (via knowledge of the laws of physics and of their lander design), estimating that the pilots are capable of dealing with likely contingencies, trusting that their model of the lunar landing problem is correct and has covered various likely contingencies, etc… and then putting it all together into a plan where they could say “successful lunar landing is likely”.
Note that various parts of the assumptions could be tested; engineers could probe at the plan and say things like “what if the conductivity of the lunar surface is unusual”, and try and see if their plan could cope with that.
Back to value extrapolation. We’d be confident that it is likely to work if we had, for example:
It works well in an all situations where we can completely test it (eg we have a list of human moral principles, and we can have an AI successfully run a school using those as input).
It works well on testable subproblems of more complicated situations (eg we inspect the AI’s behaviour in specific situations).
We have models of how value extrapolation works in extreme situations, and strong theoretical arguments that those models are correct.
We have developed a much better theoretical understanding of value extrapolation, and are confident that it works.
We’ve studied the problem adversarially and failed to break the approach.
We have deployed interpretability methods to look inside the AI at certain places, and what we’ve seen is what we expect to see.
These are the sort of things that could make us confident that a new approach could work. Is this what you are thinking?
Thanks for this list!
Though the list still doesn’t strike me as very novel—it feels that most of these conditions are conditions we’ve been shooting for anyways.
E.g. conditions 1, 2, and 5 are about selecting for behavior we approve of and condition 5 is just inspection with interpretability tools.
If you feel you have traction on conditions 3 and 4 though, that does seem novel (side-note that condition 4 seems to be a subset of condition 3). I feel skeptical though, since value extrapolation seems like about as hard of a problem as understanding machine generalization in general + the way a thing behaves in a large class of cases seems to be so complicated of a concept that you won’t be able to have confident beliefs about it or understand it. I don’t have a concrete argument about this though.
Anyways, thanks for responding, and if you have any thoughts about the tractability of conditions 3⁄4, I’m pretty curious.
Yes, the list isn’t very novel—I was trying to think of the mix of theoretical and practical results that convince us, in the current world, that a new approach will work. Obviously we want a lot more rigour for something like AI alignment! But there is an urgency to get it fast, too :-(