IKumar comments on Value systematization: how values become coherent (and misaligned)

IKumar 29 Nov 2023 18:23 UTC
1 point
0
Or a model could directly reason about which new values would best systematize its current values, with the intention of having its conclusions distilled into its weights; this would be an example of gradient hacking.
Quick clarifying question—the ability to figure out which direction in weight space an update should be applied in order to modify a neural net’s values seems like it would require a super strong understanding of mechanistic interpretability—something far past current human levels. Is this an underlying assumption for a model that is able to direct how its values will be systematised?
- Richard_Ngo 29 Nov 2023 19:57 UTC
  2 points
  0
  Parent
  The ability to do so in general probably requires a super strong understanding. The ability to do so in specific limited cases probably doesn’t. For example, suppose I decide to think about strawberries all day every day. It seems reasonable to infer that, after some period of doing this, my values will end up somehow more strawberry-related than they used to be. That’s roughly analogous to what I’m suggesting in the section you quote.