My current belief on this is that the greatest difficulty is going to be finding the “human values” in the AI’s model of the world. Any AI smart enough to deceive humans will have a predictive model of humans which almost trivially must contain something that looks like “human values”. The biggest problems I see are:
1: “Human values” may not form a tight abstracted cluster in a model of the world at all. This isn’t so much conceptual issue as in theory we could just draw a more complex boundary around them, but it makes it practically more difficult.
2: It’s currently impossible to see what the hell is going on inside most large ML systems. Interpretability work might be able to allow us to find the right subsection of a model.
3: Any pointer we build to the human values in a model also needs to be stable to the model updating. If that knowledge gets moved around as parameters change, the computational tool/mathematical object which points to them needs to be able to keep track of that. This could include sudden shifts, slow movement, breaking up of models into smaller separate models.
(I haven’t defined knowledge, I’m not very confused about what it means to say “knowledge of X is in a particular location in the model” but I don’t have space here to write it all up)
My current belief on this is that the greatest difficulty is going to be finding the “human values” in the AI’s model of the world. Any AI smart enough to deceive humans will have a predictive model of humans which almost trivially must contain something that looks like “human values”. The biggest problems I see are:
1: “Human values” may not form a tight abstracted cluster in a model of the world at all. This isn’t so much conceptual issue as in theory we could just draw a more complex boundary around them, but it makes it practically more difficult.
2: It’s currently impossible to see what the hell is going on inside most large ML systems. Interpretability work might be able to allow us to find the right subsection of a model.
3: Any pointer we build to the human values in a model also needs to be stable to the model updating. If that knowledge gets moved around as parameters change, the computational tool/mathematical object which points to them needs to be able to keep track of that. This could include sudden shifts, slow movement, breaking up of models into smaller separate models.
(I haven’t defined knowledge, I’m not very confused about what it means to say “knowledge of X is in a particular location in the model” but I don’t have space here to write it all up)