Tapatakt comments on Reward Hacking from a Causal Perspective

Tapatakt 30 Apr 2024 13:14 UTC
LW: 1 AF: 1
0
AF
How can we combine behavioural experiments with mechanistic interpretability to infer an agent’s subjective causal model? The next post will say more about this.
There is no next post. Can I read about it somewhere anyway?
- tom4everitt 2 Sep 2024 11:19 UTC
  LW: 3 AF: 3
  0
  AF Parent
  Sorry, this post got stuck on the backburner for a little bit. But the content will largely be from “Robust Agents Learn Causal World Models”