IRL defines a model of humans in the environment and then fills in that model by observation. This post’s approach would use a model of agents in the environment to train a NN, but wouldn’t rely on such a model in deployment.
This evades one of the issues with IRL, which is interpreting what choices the human made from the AI’s sensory data.
Wouldn’t the agent have to learn a model of human values at some point? I thought the point of the virtual environment was to provide a place to empirically test a bunch of possible approaches to value learning. I assumed the actual deployment would consist of the predictor interacting with humans in the place of the user agents(although re-reading I notice it doesn’t explicitly say that anywhere, I may have misunderstood)
Yes, it at least tries to learn the model used in constructing the training data (having to specify a good model is definitely an issue it shares with IRL).
I thought the point of the virtual environment was to provide a place to empirically test a bunch of possible approaches to value learning
An analogy might be how OpenAI trained a robot hand controller by training it in a set of simulations with diverse physical parameters. It then learned the general skill of operating in a wide variety of situations, so then it could be directly used in the real world.
The point of the virtual environment is to train an agent with a generalized ability to learn values. Eventually it would interact with humans (or perhaps human writing, depending on what works), align to our values and then be deployed. It should be fully aligned before trying to optimize the real world!
I see, so you essentially want to meta-learn value learning. Fair enough, although you then have the problem that your meta-learned value-learner might not generalize to the human-value-learning case.
I want to meta-learn value learning and then apply it to the object level case of human values. Hopefully after being tested on a variety of other agents the predictor will be able to learn human goals as well. It’s also possible that a method could be developed to test whether or not the system was properly aligned; I suspect that seeing whether or not it could output the neural net of the user would be a good test (though potentially vulnerable to deception if something goes wrong, and hard to check against humans). But if it learns to reliably understand the connections and weights of other agents, perhaps it can learn to understand the human mind as well.
The idea is very close to approval-directed agents, but with the process automated to provide more training than would be feasible with humans providing the feedback, and potentially with other benefits as well, such as adversarial learning (which a human could not necessarily provide as well as a purpose-trained AI) and learning to reproduce the neural net comprising the user (which would be much easier to try with an AI user than a human, and which could potentially provide a more rigorous test of when the predictor is actually becoming aligned vs developing harmful mesa optimizers).
Inverse reinforcement learning, if I understand correctly, involves a human and AI working together. While that might be helpful, it seems unlikely that solely human-supervised learning would work as well as having the option to train a system unsupervised. Certainly this would not have been enough for AlphaGo!
Inverse reinforcement learning, if I understand correctly, involves a human and AI working together
I think IRL just refers to the general setup of trying to infer an agent’s goals from its actions(and possibly communication/interaction with the agent). So you wouldn’t need to learn the human utility function purely from human feedback. Although, I don’t think relying on human feedback would necessarily be a deal-breaker—seems like most of the work of making a powerful AI comes from giving it a good general world model, capabilities etc, and it’s okay if the data specifying human utility is relatively sparse(although still large in objective terms, perhaps many many books long) compared to all the rest of the data the model is being trained on. In the AlphaGo example, this would be kinda like learning the goal state from direct feedback, but getting good at the game through self-play.
Sounds like you’ve independently re-invented inverse reinforcement learning. See also approval-directed agents.
IRL defines a model of humans in the environment and then fills in that model by observation. This post’s approach would use a model of agents in the environment to train a NN, but wouldn’t rely on such a model in deployment.
This evades one of the issues with IRL, which is interpreting what choices the human made from the AI’s sensory data.
Wouldn’t the agent have to learn a model of human values at some point? I thought the point of the virtual environment was to provide a place to empirically test a bunch of possible approaches to value learning. I assumed the actual deployment would consist of the predictor interacting with humans in the place of the user agents(although re-reading I notice it doesn’t explicitly say that anywhere, I may have misunderstood)
Yes, it at least tries to learn the model used in constructing the training data (having to specify a good model is definitely an issue it shares with IRL).
An analogy might be how OpenAI trained a robot hand controller by training it in a set of simulations with diverse physical parameters. It then learned the general skill of operating in a wide variety of situations, so then it could be directly used in the real world.
That’s an excellent analogy.
The point of the virtual environment is to train an agent with a generalized ability to learn values. Eventually it would interact with humans (or perhaps human writing, depending on what works), align to our values and then be deployed. It should be fully aligned before trying to optimize the real world!
I see, so you essentially want to meta-learn value learning. Fair enough, although you then have the problem that your meta-learned value-learner might not generalize to the human-value-learning case.
I want to meta-learn value learning and then apply it to the object level case of human values. Hopefully after being tested on a variety of other agents the predictor will be able to learn human goals as well. It’s also possible that a method could be developed to test whether or not the system was properly aligned; I suspect that seeing whether or not it could output the neural net of the user would be a good test (though potentially vulnerable to deception if something goes wrong, and hard to check against humans). But if it learns to reliably understand the connections and weights of other agents, perhaps it can learn to understand the human mind as well.
The idea is very close to approval-directed agents, but with the process automated to provide more training than would be feasible with humans providing the feedback, and potentially with other benefits as well, such as adversarial learning (which a human could not necessarily provide as well as a purpose-trained AI) and learning to reproduce the neural net comprising the user (which would be much easier to try with an AI user than a human, and which could potentially provide a more rigorous test of when the predictor is actually becoming aligned vs developing harmful mesa optimizers).
Inverse reinforcement learning, if I understand correctly, involves a human and AI working together. While that might be helpful, it seems unlikely that solely human-supervised learning would work as well as having the option to train a system unsupervised. Certainly this would not have been enough for AlphaGo!
I think IRL just refers to the general setup of trying to infer an agent’s goals from its actions(and possibly communication/interaction with the agent). So you wouldn’t need to learn the human utility function purely from human feedback. Although, I don’t think relying on human feedback would necessarily be a deal-breaker—seems like most of the work of making a powerful AI comes from giving it a good general world model, capabilities etc, and it’s okay if the data specifying human utility is relatively sparse(although still large in objective terms, perhaps many many books long) compared to all the rest of the data the model is being trained on. In the AlphaGo example, this would be kinda like learning the goal state from direct feedback, but getting good at the game through self-play.