Rohin Shah comments on A Gym Gridworld Environment for the Treacherous Turn

Rohin Shah 2 Aug 2018 18:30 UTC
2 points
I’d like to register an intuition that I could come up with a (toy, unrealistic) continual learning scenario that looks like a treacherous turn with today’s ML, perhaps by restricting the policies that the agent can learn, giving it a strong inductive bias that lets it learn the environment and the supervisor’s preferences quickly and accurately, and making it model-based. It would look something like Stuart Armstrong’s toy version of the AI alignment problem, but with a learned environment model (but maybe learned from a very strong prior, not a neural net).
This is just an intuition, not a strong belief, but it would be enough for me to work on this if I had the time to do so.