I still feel like there’s a big hole in this scheme. Maybe I’m just not getting something. Here’s a summary of my current model of this scheme, and what I find confusing:
We train an agent (the Predictor) in a large number of different environments, with a large number of different utility/reward functions. (And some of those reward functions may or may not be easier to learn by interacting with certain parts of the environment which we humans think of as “the User agent”.)
Presumably/hopefully, this leads to the Predictor learning things like caution/low impact, or heuristics like “seek to non-destructively observe other agent-like things in the environment”. (And if so, that could be very useful!)
But: We cannot use humans in the training environments. And: We cannot train the Predictor to superhuman levels in the training environments (for the obvious reason that it wouldn’t be aligned).
When we put the Predictor in the real world (to learn human values), it could no longer be scored the same way as in the training environments—because the real world does not have a built-in reward function we could plug into the Predictor.
Thus: When deployed, the Predictor would be sub-human, and also not have access to a reward function.
And so, question: How do we train the Predictor to learn human values, and become highly capable, once it is in the real world? What do we use as a reward signal?
The closest thing to an answer I found was this:
And even if it’s object level model of human values was static, it could still continue learning about the world with that model providing the scoring needed.
But IIUC this is suggesting that we could/should build a powerful AI with an imperfect/incorrect and static (fixed/unchangeable) model V of human values, and then let that AI bootstrap to superintelligence, with V as its reward function? But that seems like an obviously terrible idea, so maybe I’ve misunderstood something? What?
I still feel like there’s a big hole in this scheme. Maybe I’m just not getting something. Here’s a summary of my current model of this scheme, and what I find confusing:
We train an agent (the Predictor) in a large number of different environments, with a large number of different utility/reward functions. (And some of those reward functions may or may not be easier to learn by interacting with certain parts of the environment which we humans think of as “the User agent”.)
Presumably/hopefully, this leads to the Predictor learning things like caution/low impact, or heuristics like “seek to non-destructively observe other agent-like things in the environment”. (And if so, that could be very useful!)
But: We cannot use humans in the training environments. And: We cannot train the Predictor to superhuman levels in the training environments (for the obvious reason that it wouldn’t be aligned).
When we put the Predictor in the real world (to learn human values), it could no longer be scored the same way as in the training environments—because the real world does not have a built-in reward function we could plug into the Predictor.
Thus: When deployed, the Predictor would be sub-human, and also not have access to a reward function.
And so, question: How do we train the Predictor to learn human values, and become highly capable, once it is in the real world? What do we use as a reward signal?
The closest thing to an answer I found was this:
But IIUC this is suggesting that we could/should build a powerful AI with an imperfect/incorrect and static (fixed/unchangeable) model V of human values, and then let that AI bootstrap to superintelligence, with V as its reward function? But that seems like an obviously terrible idea, so maybe I’ve misunderstood something? What?