Garrett Baker comments on D0TheMath’s Shortform

Garrett Baker 14 Dec 2023 21:12 UTC
2 points
Why expect goals to be somehow localized inside of RL models? Well, fine-tuning only changes a small & localized part of LLMs, and goal locality was found when interpreting a trained from scratch maze solver. Certainly the goal must be interpreted in the context of the rest of the model, but based on these, and unpublished results from applying ROME to open source llm values from last year, I’m confident (though not certain) in this inference.