paulfchristiano comments on Prize for probable problems

paulfchristiano 20 Mar 2018 17:13 UTC
2 points
To address the example you gave: doing some optimization without introducing misalignment is necessary to perform as well as the RL techniques we are discussing. Avoiding that optimization is in scope.
There may be other optimization or heuristics that an RL agent (or an aligned human) would eventually use in order to perform well, e.g. using a certain kind of external aid. That’s out of scope, because we aren’t trying to compete with all of the things that an RL agent will eventually do (as you say, a powerful RL agent will eventually learn to do everything...) we are trying to compete with the RL algorithm itself.
We need an aligned version of the optimization done by the RL algorithm, not all optimization that the RL agent will eventually decide to do.