How can we combine behavioural experiments with mechanistic interpretability to infer an agent’s subjective causal model? The next post will say more about this.
There is no next post. Can I read about it somewhere anyway?
Sorry, this post got stuck on the backburner for a little bit. But the content will largely be from “Robust Agents Learn Causal World Models”
There is no next post. Can I read about it somewhere anyway?
Sorry, this post got stuck on the backburner for a little bit. But the content will largely be from “Robust Agents Learn Causal World Models”