I think this is the key to corrigibility—the AI needs to have a decision/learning process that reflects the facts that a) it’s current value model is likely to be partially incorrect b) that what it’s actually attempting to optimize is the value under the correct value model that it doesn’t yet know and is trying to learn, and c) that correction input from its programmers is a valuable source of information about inaccuracies in its current value model that will help it improve its value model and thus increase the expectation of the true value of its future actions.
If an agent understands these facts, it will then need to divide it’s efforts between a) acting on its current value model and b) trying to (as safely and efficiently as possible) gather information with which to improve its current value model (using Bayesian logic, machine learning, or whatever) -- this ends up looking rather like the Multi-Armed Bandit problem: the optimum is usually to learn first and then gradually shift over towards acting more. Optimizing this shifting division of effort is going to require it maintaining some sort of internal model of how much confidence it currently has in its value model in different parts of the space of all possible actions/outcomes that it is optimizing over. This is also quite useful for evaluating the utility of actions like “Before taking an action that seems optimal to you but is in an area where your confidence in your current value model’s accuracy is low, ask your programmers what they think about it.”
To instill greater caution in the AI, it’s probably a good idea to give it Bayesian priors suggesting that in regions where its current value model turn out to be in fact incorrect, it is more likely to be an overestimate than an underestimate (since in general humans are both pretty happy with the status quo that they’ve made for themselves and picky about how it gets changed, and also understand (and have thus already encoded into your program) how much they like it and small changes to it better than they do for large changes from the status quo—so inaccuracies in your value model are more likely in areas where the true value is low, and are more likely to cause overestimates than underestimates).
It would also be a good idea to encode awareness of the fact that its inbuilt maximization search of currently expected value over a large space of actions/outcomes is inhernetly much more likely to find a apparent-but-false maximum due to a local value-overestimate than it is to find a true maximum where the AI is in fact also underestimating the true value: so there is a tendency for apparent optima to be points where the value function is overestimated rather than underestimated by the model used in the search (statistically/analytically quantifying this tendency and how it depends on both the form of the value function and variations in its true and estimated accuracy would be an interesting mathematical project, if it hasn’t already been done). There’s a kind of Look-Elsewhere effect here—the larger your search space, the more likely it is that your search includes a point where the estimated value is incorrect by an amount that is large compared to your standard-error estimate of the size of its local inaccuracy (this point may or may not end up being the apparent maximum as a result).
I think this is the key to corrigibility—the AI needs to have a decision/learning process that reflects the facts that a) it’s current value model is likely to be partially incorrect b) that what it’s actually attempting to optimize is the value under the correct value model that it doesn’t yet know and is trying to learn, and c) that correction input from its programmers is a valuable source of information about inaccuracies in its current value model that will help it improve its value model and thus increase the expectation of the true value of its future actions.
If an agent understands these facts, it will then need to divide it’s efforts between a) acting on its current value model and b) trying to (as safely and efficiently as possible) gather information with which to improve its current value model (using Bayesian logic, machine learning, or whatever) -- this ends up looking rather like the Multi-Armed Bandit problem: the optimum is usually to learn first and then gradually shift over towards acting more. Optimizing this shifting division of effort is going to require it maintaining some sort of internal model of how much confidence it currently has in its value model in different parts of the space of all possible actions/outcomes that it is optimizing over. This is also quite useful for evaluating the utility of actions like “Before taking an action that seems optimal to you but is in an area where your confidence in your current value model’s accuracy is low, ask your programmers what they think about it.”
To instill greater caution in the AI, it’s probably a good idea to give it Bayesian priors suggesting that in regions where its current value model turn out to be in fact incorrect, it is more likely to be an overestimate than an underestimate (since in general humans are both pretty happy with the status quo that they’ve made for themselves and picky about how it gets changed, and also understand (and have thus already encoded into your program) how much they like it and small changes to it better than they do for large changes from the status quo—so inaccuracies in your value model are more likely in areas where the true value is low, and are more likely to cause overestimates than underestimates).
It would also be a good idea to encode awareness of the fact that its inbuilt maximization search of currently expected value over a large space of actions/outcomes is inhernetly much more likely to find a apparent-but-false maximum due to a local value-overestimate than it is to find a true maximum where the AI is in fact also underestimating the true value: so there is a tendency for apparent optima to be points where the value function is overestimated rather than underestimated by the model used in the search (statistically/analytically quantifying this tendency and how it depends on both the form of the value function and variations in its true and estimated accuracy would be an interesting mathematical project, if it hasn’t already been done). There’s a kind of Look-Elsewhere effect here—the larger your search space, the more likely it is that your search includes a point where the estimated value is incorrect by an amount that is large compared to your standard-error estimate of the size of its local inaccuracy (this point may or may not end up being the apparent maximum as a result).