I found this post to be interesting. I feel it might be waving at something important. Thanks for writing it.
I’m confused by a bunch of things here, notably point (4.) in the five-point list at the end of the post. I think it’s doing a lot of work, and likely hiding some difficult problems. Below are some notes and questions. If OP (or anyone else) feels like answering the questions, or pointing out where I’m wrong/confused, that’d be really neat.
(i)
If the Predictor is sufficiently intelligent to model human-level Users, it would not be safe to train/run.(?) If, on the other hand, the Predictor is too dumb to model human-level Users, then it’s useless for learning “human values”.
If so, then: The Predictor would need to be kept dumb during training (before it has learned human values); e.g. by limiting the number of parameters in the Predictor model. But also, when deployed, it would need to be able to gain further capabilities—at least up till “human level”.
And yet, when deployed in the real world, the Predictor would lose access to the reward function built into the training environment.
Thus, questions:
How does the Predictor learn/gain capabilities during deployment? How is it scored/rewarded?
How, concretely, do we safely move from the training regime to the deployment regime? (Use different-sized models in training and deployment, and somehow (how?) transfer the smaller (trained) model’s capabilities to the larger one, and then… ???)
(ii)
If the Predictor’s reward function is of type UserState → Reward, then the Predictor would likely end up learning to deceive/hack the User. I’m assuming the Predictor’s reward function is of type EnvironmentState → Reward.(?)
But then: It seems the main benefit of the proposed scheme would come from training the Predictor to interact with other agents in order to learn their goals. (If not, then how does this scheme differ meaningfully from one where the Predictor-agent is alone in an environment, without a User-agent?)
If that’s so, then: The usefulness of this approach would depend critically on things like
the interface / “interaction channels” between User and Predictor
the User’s utility function being very hard to learn without interacting with the User
the User’s utility function also depending on the User’s internal state
How does the Predictor learn/gain capabilities during deployment? How is it scored/rewarded?
How, concretely, do we safely move from the training regime to the deployment regime? (Use different-sized models in training and deployment, and somehow (how?) transfer the smaller (trained) model’s capabilities to the larger one, and then… ???)
If it has properly aligned meta-learning of values, it might be possible to allow that to direct continuing learning on the object level. And even if it’s object level model of human values was static, it could still continue learning about the world with that model providing the scoring needed.
If the Predictor’s reward function is of type UserState → Reward, then the Predictor would likely end up learning to deceive/hack the User. I’m assuming the Predictor’s reward function is of type EnvironmentState → Reward.(?)
But then: It seems the main benefit of the proposed scheme would come from training the Predictor to interact with other agents in order to learn their goals. (If not, then how does this scheme differ meaningfully from one where the Predictor-agent is alone in an environment, without a User-agent?)
If that’s so, then: The usefulness of this approach would depend critically on things like
the interface / “interaction channels” between User and Predictor
the User’s utility function being very hard to learn without interacting with the User
the User’s utility function also depending on the User’s internal state
Is that correct?
The predictor’s reward function comes from being able to fulfill the user’s values in training. It’s deliberately not based on on the user’s state (for exactly the reason you state; such a reward would incentivize hacking the user!) or on the object-level environmental state (which is hard to specify in a way that doesn’t result in catastrophe). The hope is that by training an intelligent agent to learn utility functions superhumanly well, it might be able to specify our values to an acceptable level, even though we ourselves cannot articulate them well enough (which is why EnvironmentState → Reward is hard to do right). The main benefit comes from that prediction, which may come from interaction with other agents or passive observation, or more likely a mixture of the two. You are correct that this depends critically on having a sufficiently clear channel between the user and predictor that it can do this. However, if the user’s utility function isn’t hard to learn without interacting with it, so much the better! The goal is an AI that’s good at aligning to a given user’s values, whether or not it has to interact to do so. Or are you asking if such a user wouldn’t be good enough training, given that you expect human values to be unlearnable without interacting with us? That’s a valid concern, but one I hope adversarial training should solve. Why would the user’s utility function need to depend on its internal state? Is that because our utility functions do, and you’re worried that a predictor that hadn’t seen that before wouldn’t be able to learn them? If so, again valid, and hopefully again resolvable by adversarial training.
I still feel like there’s a big hole in this scheme. Maybe I’m just not getting something. Here’s a summary of my current model of this scheme, and what I find confusing:
We train an agent (the Predictor) in a large number of different environments, with a large number of different utility/reward functions. (And some of those reward functions may or may not be easier to learn by interacting with certain parts of the environment which we humans think of as “the User agent”.)
Presumably/hopefully, this leads to the Predictor learning things like caution/low impact, or heuristics like “seek to non-destructively observe other agent-like things in the environment”. (And if so, that could be very useful!)
But: We cannot use humans in the training environments. And: We cannot train the Predictor to superhuman levels in the training environments (for the obvious reason that it wouldn’t be aligned).
When we put the Predictor in the real world (to learn human values), it could no longer be scored the same way as in the training environments—because the real world does not have a built-in reward function we could plug into the Predictor.
Thus: When deployed, the Predictor would be sub-human, and also not have access to a reward function.
And so, question: How do we train the Predictor to learn human values, and become highly capable, once it is in the real world? What do we use as a reward signal?
The closest thing to an answer I found was this:
And even if it’s object level model of human values was static, it could still continue learning about the world with that model providing the scoring needed.
But IIUC this is suggesting that we could/should build a powerful AI with an imperfect/incorrect and static (fixed/unchangeable) model V of human values, and then let that AI bootstrap to superintelligence, with V as its reward function? But that seems like an obviously terrible idea, so maybe I’ve misunderstood something? What?
I found this post to be interesting. I feel it might be waving at something important. Thanks for writing it.
I’m confused by a bunch of things here, notably point (4.) in the five-point list at the end of the post. I think it’s doing a lot of work, and likely hiding some difficult problems. Below are some notes and questions. If OP (or anyone else) feels like answering the questions, or pointing out where I’m wrong/confused, that’d be really neat.
(i)
If the Predictor is sufficiently intelligent to model human-level Users, it would not be safe to train/run.(?) If, on the other hand, the Predictor is too dumb to model human-level Users, then it’s useless for learning “human values”.
If so, then: The Predictor would need to be kept dumb during training (before it has learned human values); e.g. by limiting the number of parameters in the Predictor model. But also, when deployed, it would need to be able to gain further capabilities—at least up till “human level”.
And yet, when deployed in the real world, the Predictor would lose access to the reward function built into the training environment.
Thus, questions:
How does the Predictor learn/gain capabilities during deployment? How is it scored/rewarded?
How, concretely, do we safely move from the training regime to the deployment regime? (Use different-sized models in training and deployment, and somehow (how?) transfer the smaller (trained) model’s capabilities to the larger one, and then… ???)
(ii)
If the Predictor’s reward function is of type UserState → Reward, then the Predictor would likely end up learning to deceive/hack the User. I’m assuming the Predictor’s reward function is of type EnvironmentState → Reward.(?)
But then: It seems the main benefit of the proposed scheme would come from training the Predictor to interact with other agents in order to learn their goals. (If not, then how does this scheme differ meaningfully from one where the Predictor-agent is alone in an environment, without a User-agent?)
If that’s so, then: The usefulness of this approach would depend critically on things like
the interface / “interaction channels” between User and Predictor
the User’s utility function being very hard to learn without interacting with the User
the User’s utility function also depending on the User’s internal state
Is that correct?
If it has properly aligned meta-learning of values, it might be possible to allow that to direct continuing learning on the object level. And even if it’s object level model of human values was static, it could still continue learning about the world with that model providing the scoring needed.
The predictor’s reward function comes from being able to fulfill the user’s values in training. It’s deliberately not based on on the user’s state (for exactly the reason you state; such a reward would incentivize hacking the user!) or on the object-level environmental state (which is hard to specify in a way that doesn’t result in catastrophe). The hope is that by training an intelligent agent to learn utility functions superhumanly well, it might be able to specify our values to an acceptable level, even though we ourselves cannot articulate them well enough (which is why EnvironmentState → Reward is hard to do right). The main benefit comes from that prediction, which may come from interaction with other agents or passive observation, or more likely a mixture of the two. You are correct that this depends critically on having a sufficiently clear channel between the user and predictor that it can do this. However, if the user’s utility function isn’t hard to learn without interacting with it, so much the better! The goal is an AI that’s good at aligning to a given user’s values, whether or not it has to interact to do so. Or are you asking if such a user wouldn’t be good enough training, given that you expect human values to be unlearnable without interacting with us? That’s a valid concern, but one I hope adversarial training should solve. Why would the user’s utility function need to depend on its internal state? Is that because our utility functions do, and you’re worried that a predictor that hadn’t seen that before wouldn’t be able to learn them? If so, again valid, and hopefully again resolvable by adversarial training.
I still feel like there’s a big hole in this scheme. Maybe I’m just not getting something. Here’s a summary of my current model of this scheme, and what I find confusing:
We train an agent (the Predictor) in a large number of different environments, with a large number of different utility/reward functions. (And some of those reward functions may or may not be easier to learn by interacting with certain parts of the environment which we humans think of as “the User agent”.)
Presumably/hopefully, this leads to the Predictor learning things like caution/low impact, or heuristics like “seek to non-destructively observe other agent-like things in the environment”. (And if so, that could be very useful!)
But: We cannot use humans in the training environments. And: We cannot train the Predictor to superhuman levels in the training environments (for the obvious reason that it wouldn’t be aligned).
When we put the Predictor in the real world (to learn human values), it could no longer be scored the same way as in the training environments—because the real world does not have a built-in reward function we could plug into the Predictor.
Thus: When deployed, the Predictor would be sub-human, and also not have access to a reward function.
And so, question: How do we train the Predictor to learn human values, and become highly capable, once it is in the real world? What do we use as a reward signal?
The closest thing to an answer I found was this:
But IIUC this is suggesting that we could/should build a powerful AI with an imperfect/incorrect and static (fixed/unchangeable) model V of human values, and then let that AI bootstrap to superintelligence, with V as its reward function? But that seems like an obviously terrible idea, so maybe I’ve misunderstood something? What?