This seems like the right general way of thinking about the problem.
It doesn’t seem like we can always copy what other people do, so I suspect we can’t dodge the problem in this way (e.g. we are still in trouble if we don’t have a benign universal prior).
For example, consider an extreme case where each agent is interested in predicting different facts, and where it is possible to distinguish “predictions that are important for humans” from “predictions that are important for an AI with inhuman values.” (Maybe an agent with source code A needs to predict f(A) for a complex function f.) Then a malign component of our prior might decide to make bad predictions only on questions that are important to humans. This seems to leave us back at square one.
Good point; a prior that favors some values over others is going to be a problem, and this is true of the universal prior. The way I’m thinking about it, some of the agents we’re going to be competing with are the malign consequentialists in the universal prior. Of course, figuring out what prior these agents have is going to require more analysis.
This seems like the right general way of thinking about the problem.
It doesn’t seem like we can always copy what other people do, so I suspect we can’t dodge the problem in this way (e.g. we are still in trouble if we don’t have a benign universal prior).
For example, consider an extreme case where each agent is interested in predicting different facts, and where it is possible to distinguish “predictions that are important for humans” from “predictions that are important for an AI with inhuman values.” (Maybe an agent with source code A needs to predict f(A) for a complex function f.) Then a malign component of our prior might decide to make bad predictions only on questions that are important to humans. This seems to leave us back at square one.
Good point; a prior that favors some values over others is going to be a problem, and this is true of the universal prior. The way I’m thinking about it, some of the agents we’re going to be competing with are the malign consequentialists in the universal prior. Of course, figuring out what prior these agents have is going to require more analysis.