cfoster0 comments on The Shard Theory Alignment Scheme

cfoster0 25 Aug 2022 5:15 UTC
10 points
7
There would seem to be many possible alignment schemes that would be unlocked if one had “master[ed] the theory of value formation in trained intelligences”, which is what I understand the primary goal of shard theory to be. My understanding is that we’re still working out a lot of the details of that theory, and that they need to be proved out in basic RL setups with small models.

If that’s the case, is there a reason to be singling out this particular ML technique (chain-of-thought prompting on LMs) + training goal (single value formation + punting the rest to alignment researchers) as “the shard theory alignment scheme” this early in the game?
- David Udell 25 Aug 2022 5:34 UTC
  5 points
  2
  Parent
  My view is that you conduct better research when you have a fairly detailed, tentative path to impact in mind—otherwise, it’s too easy to accidently sweep a core problem under the rug by not really looking closely at that section of your plan until you get there.
  With that in mind, I personally have been convinced of all of those above four approaches to alignment. GEM lets us open the black box, both hugely informing shard theory and giving us a fine grained lever to wield in training. Models trained i.i.d. (as opposed to models trained such that their outputs will strongly influence their future inputs) won’t be as strongly incentivized to be coherent in the respects that we care about, and so will stay considerably less dangerous for longer. Instilling a single stopgap target value only means jumping one impossible hurdle instead of many impossible hurdles, and so we should do that if a stopgap aligned model will help us bootstrap to more thoroughly aligned models.
  I’m happy to be shown to be wrong on any of these claims, though, and would then appropriately update my working alignment scheme.
  - cfoster0 25 Aug 2022 6:17 UTC
    8 points
    0
    Parent
    In terms of detailed plans: What about, for example, figuring out enough details about shard theory to make preregistered predictions about the test-time behaviors and internal circuits you will find in an agent after training in a novel toy environment, based on attributes of the training trajectories? Success at that would represent a real win within the field, with a lot of potential further downstream of that.
    
    Re: the rest, even if all of those 4 approaches you listed are individually promising (which I’m inclined to agree with you on), the conjunction of them might be much less likely to work out. I personally consider them as separate bets that can stand or fall on their own, and hope that if multiple pan out then their benefits may stack.