My view is that you conduct better research when you have a fairly detailed, tentative path to impact in mind—otherwise, it’s too easy to accidently sweep a core problem under the rug by not really looking closely at that section of your plan until you get there.
With that in mind, I personally have been convinced of all of those above four approaches to alignment. GEM lets us open the black box, both hugely informing shard theory and giving us a fine grained lever to wield in training. Models trained i.i.d. (as opposed to models trained such that their outputs will strongly influence their future inputs) won’t be as strongly incentivized to be coherent in the respects that we care about, and so will stay considerably less dangerous for longer. Instilling a single stopgap target value only means jumping one impossible hurdle instead of many impossible hurdles, and so we should do that if a stopgap aligned model will help us bootstrap to more thoroughly aligned models.
I’m happy to be shown to be wrong on any of these claims, though, and would then appropriately update my working alignment scheme.
In terms of detailed plans: What about, for example, figuring out enough details about shard theory to make preregistered predictions about the test-time behaviors and internal circuits you will find in an agent after training in a novel toy environment, based on attributes of the training trajectories? Success at that would represent a real win within the field, with a lot of potential further downstream of that.
Re: the rest, even if all of those 4 approaches you listed are individually promising (which I’m inclined to agree with you on), the conjunction of them might be much less likely to work out. I personally consider them as separate bets that can stand or fall on their own, and hope that if multiple pan out then their benefits may stack.
My view is that you conduct better research when you have a fairly detailed, tentative path to impact in mind—otherwise, it’s too easy to accidently sweep a core problem under the rug by not really looking closely at that section of your plan until you get there.
With that in mind, I personally have been convinced of all of those above four approaches to alignment. GEM lets us open the black box, both hugely informing shard theory and giving us a fine grained lever to wield in training. Models trained i.i.d. (as opposed to models trained such that their outputs will strongly influence their future inputs) won’t be as strongly incentivized to be coherent in the respects that we care about, and so will stay considerably less dangerous for longer. Instilling a single stopgap target value only means jumping one impossible hurdle instead of many impossible hurdles, and so we should do that if a stopgap aligned model will help us bootstrap to more thoroughly aligned models.
I’m happy to be shown to be wrong on any of these claims, though, and would then appropriately update my working alignment scheme.
In terms of detailed plans: What about, for example, figuring out enough details about shard theory to make preregistered predictions about the test-time behaviors and internal circuits you will find in an agent after training in a novel toy environment, based on attributes of the training trajectories? Success at that would represent a real win within the field, with a lot of potential further downstream of that.
Re: the rest, even if all of those 4 approaches you listed are individually promising (which I’m inclined to agree with you on), the conjunction of them might be much less likely to work out. I personally consider them as separate bets that can stand or fall on their own, and hope that if multiple pan out then their benefits may stack.