Charlie Steiner comments on Reinforcement Learning Goal Misgeneralization: Can we guess what kind of goals are selected by default?

Charlie Steiner 26 Oct 2022 1:41 UTC
4 points
0
My suspicion is it might be better to think about kinetics rather than energetics. That is, the order things get learned in seems important.
So it might be interesting to mathematically investigate questions like “given small random initialization, what determines the relative gradients towards different heuristics?” I would guess there’s some literature on this already—the only thing I can think of off the top of my head is infinite width stuff that’s not super relevant, but probably someone has made other simplifying assumptions like heuristics being fixed circuits with simple effects on the loss, and seen what happens.