In the real world, these domains aren’t the sort of thing where you get a perfect simulation. The differences will strongly add up when you strongly train an AI to maximize <this thing which was a good predictor of diamonds in the more restricted domain of <the domain, as viewed by the AI that was trained to predict the environment> >.
We are now far from your original objection ” I don’t even know how to make an agent with a clear utility function module”.
Imperfect simulations work just fine—for humans and various DL agents, so for your argument to be correct, you now need to explain how humans can still think and steer the future with imperfect world models, and once you do that you will understand how AI can as well.
We’re not far from there. There’s inferential distance here. Translating my original statement, I’d say: the closest thing to the “utility function module” in the scenario you’re describing here with MuZero, is the concept of predicted diamond and the AI it’s inside of. But then you train another AI to pursue that. And I’m saying, I don’t trust that that new trained AI actually maximizes diamond; and to the point, I don’t have any clarity on how the goals of newly trained AI sit inside it, operate inside it, direct its behavior, etc. And in particular I don’t understand it well enough to have any justified confidence it’ll robustly pursue diamond.
So to be clear there is just one AI, built out of several components: a world model, a planning engine, and a utility function. The world model is learned, but assumed to be learned perfectly (resulting in a functional equivalent of the actual sim physics). The planning engine also can learn action/value estimators for efficiency, but that is not required. The utility function is not learned at all, and is manually coded. So the learning components here can not possibly cause any problems.
Of course that’s just in a sim.
Translating the concept to the real world, there are now 3 possible sources of ‘errors’:
imperfection of the learned world model
imperfect planning (compute bound)
imperfect utility function
My main claim is that approximation error in 1 and 2 (which is inevitable) don’t necessarily bias for strong optimization towards the wrong utility function (and they can’t really).
In the real world, these domains aren’t the sort of thing where you get a perfect simulation. The differences will strongly add up when you strongly train an AI to maximize <this thing which was a good predictor of diamonds in the more restricted domain of <the domain, as viewed by the AI that was trained to predict the environment> >.
We are now far from your original objection ” I don’t even know how to make an agent with a clear utility function module”.
Imperfect simulations work just fine—for humans and various DL agents, so for your argument to be correct, you now need to explain how humans can still think and steer the future with imperfect world models, and once you do that you will understand how AI can as well.
We’re not far from there. There’s inferential distance here. Translating my original statement, I’d say: the closest thing to the “utility function module” in the scenario you’re describing here with MuZero, is the concept of predicted diamond and the AI it’s inside of. But then you train another AI to pursue that. And I’m saying, I don’t trust that that new trained AI actually maximizes diamond; and to the point, I don’t have any clarity on how the goals of newly trained AI sit inside it, operate inside it, direct its behavior, etc. And in particular I don’t understand it well enough to have any justified confidence it’ll robustly pursue diamond.
So to be clear there is just one AI, built out of several components: a world model, a planning engine, and a utility function. The world model is learned, but assumed to be learned perfectly (resulting in a functional equivalent of the actual sim physics). The planning engine also can learn action/value estimators for efficiency, but that is not required. The utility function is not learned at all, and is manually coded. So the learning components here can not possibly cause any problems.
Of course that’s just in a sim.
Translating the concept to the real world, there are now 3 possible sources of ‘errors’:
imperfection of the learned world model
imperfect planning (compute bound)
imperfect utility function
My main claim is that approximation error in 1 and 2 (which is inevitable) don’t necessarily bias for strong optimization towards the wrong utility function (and they can’t really).