interstice comments on Alignment and Deep Learning

interstice 17 Apr 2022 20:47 UTC
1 point
Wouldn’t the agent have to learn a model of human values at some point? I thought the point of the virtual environment was to provide a place to empirically test a bunch of possible approaches to value learning. I assumed the actual deployment would consist of the predictor interacting with humans in the place of the user agents(although re-reading I notice it doesn’t explicitly say that anywhere, I may have misunderstood)
- Charlie Steiner 17 Apr 2022 21:10 UTC
  3 points
  Parent
  Yes, it at least tries to learn the model used in constructing the training data (having to specify a good model is definitely an issue it shares with IRL).
  
  I thought the point of the virtual environment was to provide a place to empirically test a bunch of possible approaches to value learning
  
  An analogy might be how OpenAI trained a robot hand controller by training it in a set of simulations with diverse physical parameters. It then learned the general skill of operating in a wide variety of situations, so then it could be directly used in the real world.
  - Aiyen 17 Apr 2022 21:39 UTC
    1 point
    Parent
    That’s an excellent analogy.
- Aiyen 17 Apr 2022 20:58 UTC
  2 points
  Parent
  The point of the virtual environment is to train an agent with a generalized ability to learn values. Eventually it would interact with humans (or perhaps human writing, depending on what works), align to our values and then be deployed. It should be fully aligned before trying to optimize the real world!
  - interstice 17 Apr 2022 23:16 UTC
    1 point
    Parent
    I see, so you essentially want to meta-learn value learning. Fair enough, although you then have the problem that your meta-learned value-learner might not generalize to the human-value-learning case.
    - Aiyen 18 Apr 2022 1:31 UTC
      2 points
      Parent
      I want to meta-learn value learning and then apply it to the object level case of human values. Hopefully after being tested on a variety of other agents the predictor will be able to learn human goals as well. It’s also possible that a method could be developed to test whether or not the system was properly aligned; I suspect that seeing whether or not it could output the neural net of the user would be a good test (though potentially vulnerable to deception if something goes wrong, and hard to check against humans). But if it learns to reliably understand the connections and weights of other agents, perhaps it can learn to understand the human mind as well.