Come to think of it, couldn’t this be applied to model corrigibility itself?
Have an AI that’s constantly coming up with predictive models of human preferences, generating an ensemble of plans for satisfying human preferences according to each model. Then break those plans into landmarks and look for clusters in goal-space.
Each cluster could then form a candidate basin of attraction of goals for the AI to pursue. The center of each basin would represent a “robust bottleneck” that would be helpful across predictive models; the breadth of each basin would account for the variance in landmark features; and the depth/attractiveness of each basin would be proportional to the number of predictive models that have landmarks in that cluster.
Ideally, the distribution of these basins would update continuously as each model in the ensemble becomes more predictive of human preferences (both stated and revealed) due to what the AGI learns as it interacts with humans in the real world. Plans should always be open to change in light of new information, including those of an AGI, so the landmarks and their clusters would necessarily shift around as well.
Assuming this is the right approach, the questions that remain would be how to structure those models of human preferences, how to measure their predictive performance, how to update the models on new information, how to use those models to generate plans, how to represent landmarks along plan paths in goal-space, how to convert a vector in goal-space into actionable behavior for the AI to pursue, etc., etc., etc. Okay, yeah, there would still be a lot of work left to do.
Come to think of it, couldn’t this be applied to model corrigibility itself?
Have an AI that’s constantly coming up with predictive models of human preferences, generating an ensemble of plans for satisfying human preferences according to each model. Then break those plans into landmarks and look for clusters in goal-space.
Each cluster could then form a candidate basin of attraction of goals for the AI to pursue. The center of each basin would represent a “robust bottleneck” that would be helpful across predictive models; the breadth of each basin would account for the variance in landmark features; and the depth/attractiveness of each basin would be proportional to the number of predictive models that have landmarks in that cluster.
Ideally, the distribution of these basins would update continuously as each model in the ensemble becomes more predictive of human preferences (both stated and revealed) due to what the AGI learns as it interacts with humans in the real world. Plans should always be open to change in light of new information, including those of an AGI, so the landmarks and their clusters would necessarily shift around as well.
Assuming this is the right approach, the questions that remain would be how to structure those models of human preferences, how to measure their predictive performance, how to update the models on new information, how to use those models to generate plans, how to represent landmarks along plan paths in goal-space, how to convert a vector in goal-space into actionable behavior for the AI to pursue, etc., etc., etc. Okay, yeah, there would still be a lot of work left to do.