A slightly sideways argument for interpretability: It’s a really good way to introduce the importance and tractability of alignment research
In my experience it’s very easy to explain to someone with no technical background that
Image classifiers have got much much better (like in 10 years they went from being impossible to being something you can do on your laptop)
We actually don’t really understand why they do what they do (like we don’t know why the classifier says this is an image of a cat, even if it’s right)
But, thanks to dedicated research, we have begun to understand a bit of what’s going on in the black box (like we know it knows what a curve is, we can tell when it thinks it sees a curve)
Then you say ‘this is the same thing that big companies are using to maximise your engagement on social media and sell you stuff, and look at how that’s going. and by the way did you notice how AIs keep getting bigger and stronger?’
At this point my experience is it’s very easy for people to understand why alignment matters and also what kind of thing you can actually do about it.
Compare this to trying to explain why people are worried about mesa-optimisers, boxed oracles, or even the ELK problem, and it’s a lot less concrete. People seem to approach it much more like a thought experiment and less like an ongoing problem, and it’s harder to grasp why ‘developing better regularisers’ might be a meaningful goal.
But interpretability gives people a non-technical story for how alignment affects their lives, the scale of the problem, and how progress can be made. IMO no other approach to alignment is anywhere near as good for this.
Thanks! Yeah this isn’t in the paper, it’s just a thing I’m fairly sure of which probably deserves a more thorough treatment elsewhere. In the meantime, some rough intuitions would be:
delusions are a result of causal confounders, which must be hidden upstream variables
if you actually simulate and therefore specify an entire markov blanket, it will screen off all other upstream variables including all possible confounders
this is ludicrously difficult for agents with a long history (like a human), but if the STF story is correct, it’s sufficient, and crucially, you don’t even need to know the full causal structure of reality, just a complete markov blanket
any holes in the markov blanket/boundary represent ways for unintended causal pathways to leak through, which separate the predictor’s predictions about the effect of an action from the actual causal effect of the action, making the agent appear ‘delusional’
I hope we’ll have a proper writeup soon; in the meantime let me know if this doesn’t make sense.