Thanks for the question!
Yeah, the story is something like: structuring model internals gives us more control over how models generalize limited supervision. For example, maybe we can factor out how a model represents humans vs. how it represents math concepts, then localize RLHF updates on math research to the math concept region. This kind of learning update would plausibly reduce the extent to which a model learns (or learns to exploit) human biases, increasing the odds that the model generalizes in an intended way from misspecified feedback.
Another angle is: if we create models with selective incapacities (e.g. lack of situational awareness), the models might lack the concepts required to misgeneralize from our feedback. For example, consider a situationally unaware model, upon exploring a trajectory which involved subversively manipulating its environment in a way that received higher-than-average reward—as a result, the model will be updated towards the behavior. However, since the model lacks the concepts required to internalize the behavioral tendency “gain control over my environment,” it won’t learn that tendency. Instead, the trajectory might simply serve as noise.
Under “improving reward,” another option that seems promising is using models to iteratively improve your training data, like in Iterative Label Refinement.