This seems like an interesting idea for how to build an AI system in practice, along the same lines as corrigibility. We notice that value learning is not very robust: if you aren’t very good at value learning, then you can get very bad behavior, and human values are sufficiently complex that you do need to be very capable in order to be sufficiently good at value learning. With (a particular kind of) corrigibility, we instead set the goal to be to make an AI system that is trying to help us, which seems more achievable even when the AI system is not very capable. Similarly, if we formalize or learn informed consent reasonably well (which seems easier to do since it is not as complex as “human values”), then our AI systems will likely have good behavior (though they will probably not have the best possible behavior, since they are limited by having to respect informed consent).
However, this also feels different from corrigibility, in that it feels more like a limitation put on the AI system, while corrigibility seems more like a property of the AI’s “motivational system”. This might be fine, since the AI might just not be goal-directed. One other benefit of corrigibility is that if you are “somewhat” corrigible, then you would like to become more corrigible, since that is what the human would prefer; informed-consent-AI doesn’t seem to have an analogous benefit.
This seems like an interesting idea for how to build an AI system in practice, along the same lines as corrigibility. We notice that value learning is not very robust: if you aren’t very good at value learning, then you can get very bad behavior, and human values are sufficiently complex that you do need to be very capable in order to be sufficiently good at value learning. With (a particular kind of) corrigibility, we instead set the goal to be to make an AI system that is trying to help us, which seems more achievable even when the AI system is not very capable. Similarly, if we formalize or learn informed consent reasonably well (which seems easier to do since it is not as complex as “human values”), then our AI systems will likely have good behavior (though they will probably not have the best possible behavior, since they are limited by having to respect informed consent).
However, this also feels different from corrigibility, in that it feels more like a limitation put on the AI system, while corrigibility seems more like a property of the AI’s “motivational system”. This might be fine, since the AI might just not be goal-directed. One other benefit of corrigibility is that if you are “somewhat” corrigible, then you would like to become more corrigible, since that is what the human would prefer; informed-consent-AI doesn’t seem to have an analogous benefit.