I think this is very important, probably roughly the way to go for top level alignment strategies, and we should start hammering out the mechanistic details of it more as soon as it’s at all feasible.
Do you already have any ideas for experimentally verifying parts of this, and refining/formalising it further?
For example, do you think we could look at current RL models, and trace out how a particular pattern of behaviour being reinforced in early training led to things connected to that behaviour becoming the system’s target even in later stages of training, when the model should theoretically be capable enough to get more reward by trying to do something else?
Can we start nailing down a bit more what desires a given reward signal and training data set is likely to produce? “Something that leads to reward in early training if optimised for, or something that leads to reward if optimised for and that could be stumbled on through behaviours likely to be learned in early training” is still very vague and incomplete. I don’t think we’re at the point with our general understanding of DL selection dynamics yet where we can expect to work this out properly, but maybe a bit more specificity in our qualitative guesses is possible?
Can we set some concrete conditions on the outer optimisation algorithm that must hold for this dynamic to occur, or that would strengthen or weaken it? Locality and path dependence seem important, but can we get more of an idea about how that shakes out quantitatively? Are there any particular features in hypothetical future replacements of GD/ADAM that we should be on the lookout for, because they’d make desires more unstable, or change the rules of how to select for a particular desire we want?
It seems like we want our system to not get stuck in local optima when it comes to capabilities, and design our training processes accordingly. There should be smooth transitions in the loss-function landscape from trying to problem solve in one, inefficient way, to trying to problem solve in another, more efficient way. But when it comes to desires, we now want our systems to “get stuck”, so they don’t go off becoming reward-maximisers, or start wanting other counter-intuitive things we didn’t prepare for. How do you make a training setup that does both these things simultaneously?
I think this is very important, probably roughly the way to go for top level alignment strategies, and we should start hammering out the mechanistic details of it more as soon as it’s at all feasible.
Do you already have any ideas for experimentally verifying parts of this, and refining/formalising it further?
For example, do you think we could look at current RL models, and trace out how a particular pattern of behaviour being reinforced in early training led to things connected to that behaviour becoming the system’s target even in later stages of training, when the model should theoretically be capable enough to get more reward by trying to do something else?
Can we start nailing down a bit more what desires a given reward signal and training data set is likely to produce? “Something that leads to reward in early training if optimised for, or something that leads to reward if optimised for and that could be stumbled on through behaviours likely to be learned in early training” is still very vague and incomplete. I don’t think we’re at the point with our general understanding of DL selection dynamics yet where we can expect to work this out properly, but maybe a bit more specificity in our qualitative guesses is possible?
Can we set some concrete conditions on the outer optimisation algorithm that must hold for this dynamic to occur, or that would strengthen or weaken it? Locality and path dependence seem important, but can we get more of an idea about how that shakes out quantitatively? Are there any particular features in hypothetical future replacements of GD/ADAM that we should be on the lookout for, because they’d make desires more unstable, or change the rules of how to select for a particular desire we want?
It seems like we want our system to not get stuck in local optima when it comes to capabilities, and design our training processes accordingly. There should be smooth transitions in the loss-function landscape from trying to problem solve in one, inefficient way, to trying to problem solve in another, more efficient way. But when it comes to desires, we now want our systems to “get stuck”, so they don’t go off becoming reward-maximisers, or start wanting other counter-intuitive things we didn’t prepare for. How do you make a training setup that does both these things simultaneously?