I want to meta-learn value learning and then apply it to the object level case of human values. Hopefully after being tested on a variety of other agents the predictor will be able to learn human goals as well. It’s also possible that a method could be developed to test whether or not the system was properly aligned; I suspect that seeing whether or not it could output the neural net of the user would be a good test (though potentially vulnerable to deception if something goes wrong, and hard to check against humans). But if it learns to reliably understand the connections and weights of other agents, perhaps it can learn to understand the human mind as well.
I want to meta-learn value learning and then apply it to the object level case of human values. Hopefully after being tested on a variety of other agents the predictor will be able to learn human goals as well. It’s also possible that a method could be developed to test whether or not the system was properly aligned; I suspect that seeing whether or not it could output the neural net of the user would be a good test (though potentially vulnerable to deception if something goes wrong, and hard to check against humans). But if it learns to reliably understand the connections and weights of other agents, perhaps it can learn to understand the human mind as well.