Steven Byrnes comments on On how various plans miss the hard bits of the alignment challenge

Steven Byrnes 12 Jul 2022 13:25 UTC
LW: 23 AF: 8
3
AF
For 1—In humans, there’s the distinction between evolution-as-a-learning-algorithm versus within-lifetime learning. There’s some difference of opinion about which of those two slots will be occupied by the PyTorch code comprising our future AGI—the RFLO model says that this code will be doing something analogous to evolution, I say it will be doing something analogous to within-lifetime learning, see my discussion here.
My impression (from their writings) is that Nate & Eliezer are firmly in the former RFLO/evolution camp. If that’s your picture, then within-lifetime learning is a thing that happens inside a learned black box, and thus it’s a big step removed from the gradient descent (imagine: the outer-loop evolution-like gradient descent tweaks the weights, then the trained model thinks and acts and learns and grows and plans for a billion subjective seconds, then the outer-loop evolution-like gradient descent tweaks the weights, then the trained model thinks and acts and learns and grows and plans for a billion subjective seconds…). Then a “sharp left turn” could happen between gradient-descent steps, for example.
In my model, the human-written AGI PyTorch code is instead analogous to within-lifetime learning in humans, and it looks kinda like actor-critic model-based RL. There’s still some gradient descent, but the loss function is not directly “performance”, instead it’s things like self-supervised learning, and then there are also non-gradient-descent things like TD learning too. “Sharp left turns” don’t show up in my picture, at least not the same way. Or I guess, maybe instead of just one “sharp left turn”, the training process would have millions of “sharp left turns” as it keeps learning new things about the world (e.g. learning object permanence, learning that it’s an AGI running on a computer, learning physics, etc.), and each of these is almost guaranteed to help capabilities, but can potentially screw up alignment.