paulfchristiano comments on Pursuing convergent instrumental subgoals on the user’s behalf doesn’t always require good priors

paulfchristiano 30 Dec 2016 2:48 UTC
0 points
AF
This seems like the right general way of thinking about the problem.

It doesn’t seem like we can always copy what other people do, so I suspect we can’t dodge the problem in this way (e.g. we are still in trouble if we don’t have a benign universal prior).

For example, consider an extreme case where each agent is interested in predicting different facts, and where it is possible to distinguish “predictions that are important for humans” from “predictions that are important for an AI with inhuman values.” (Maybe an agent with source code A needs to predict f(A) for a complex function f.) Then a malign component of our prior might decide to make bad predictions only on questions that are important to humans. This seems to leave us back at square one.
- jessicata 30 Dec 2016 3:53 UTC
  0 points
  AF Parent
  Good point; a prior that favors some values over others is going to be a problem, and this is true of the universal prior. The way I’m thinking about it, some of the agents we’re going to be competing with are the malign consequentialists in the universal prior. Of course, figuring out what prior these agents have is going to require more analysis.