I am eager to see how the mentioned topics connect in the end—this is like the first few chapters in a book, reading the backstories of the characters which are yet to meet.
On the interpretability side—I’m curious how you do causal mediation analysis on anything resembling “values”? The ROME paper framework shows where the model recalls “properties of an object” in the computation graph, but it’s a long way from that to editing out reward proxies from the model.
I am eager to see how the mentioned topics connect in the end—this is like the first few chapters in a book, reading the backstories of the characters which are yet to meet.
On the interpretability side—I’m curious how you do causal mediation analysis on anything resembling “values”? The ROME paper framework shows where the model recalls “properties of an object” in the computation graph, but it’s a long way from that to editing out reward proxies from the model.