Or a model could directly reason about which new values would best systematize its current values, with the intention of having its conclusions distilled into its weights; this would be an example of gradient hacking.
Quick clarifying question—the ability to figure out which direction in weight space an update should be applied in order to modify a neural net’s values seems like it would require a super strong understanding of mechanistic interpretability—something far past current human levels. Is this an underlying assumption for a model that is able to direct how its values will be systematised?
The ability to do so in general probably requires a super strong understanding. The ability to do so in specific limited cases probably doesn’t. For example, suppose I decide to think about strawberries all day every day. It seems reasonable to infer that, after some period of doing this, my values will end up somehow more strawberry-related than they used to be. That’s roughly analogous to what I’m suggesting in the section you quote.
Quick clarifying question—the ability to figure out which direction in weight space an update should be applied in order to modify a neural net’s values seems like it would require a super strong understanding of mechanistic interpretability—something far past current human levels. Is this an underlying assumption for a model that is able to direct how its values will be systematised?
The ability to do so in general probably requires a super strong understanding. The ability to do so in specific limited cases probably doesn’t. For example, suppose I decide to think about strawberries all day every day. It seems reasonable to infer that, after some period of doing this, my values will end up somehow more strawberry-related than they used to be. That’s roughly analogous to what I’m suggesting in the section you quote.