Why expect goals to be somehow localized inside of RL models? Well, fine-tuning only changes a small & localized part of LLMs, and goal locality was found when interpreting a trained from scratch maze solver. Certainly the goal must be interpreted in the context of the rest of the model, but based on these, and unpublished results from applying ROME to open source llm values from last year, I’m confident (though not certain) in this inference.
Why expect goals to be somehow localized inside of RL models? Well, fine-tuning only changes a small & localized part of LLMs, and goal locality was found when interpreting a trained from scratch maze solver. Certainly the goal must be interpreted in the context of the rest of the model, but based on these, and unpublished results from applying ROME to open source llm values from last year, I’m confident (though not certain) in this inference.