I was once discussing Aumann’s agreement theorem with a rationalist who wanted to use it to reduce disagreements about AI safety. We were thinking about why it doesn’t work in that context when (I would argue) Aumann’s agreement theorem applies robustly in lots of other contexts.
I think I can find out (or maybe even already know) why it doesn’t apply in the context of AI safety, but to test it I would benefit from having a bunch of cases of disagreements about AI safety to investigate.
So if you have a disagreement with anyone about AI safety, feel encouraged to post it here.
Some say mechanistic interpretability seems really unlikely to bear any fruit because of one or both of the following
Neural networks are fundamentally impossible (or just very very hard) for humans to understand, like neuroscience or economics. Fully understanding complex systems is not a thing humans can do.
Neural networks are not doing anything you would want to understand, mostly shallow pattern matching and absurd statistical correlations, and it seems impossible to explain how that pattern matching is occurring or tease apart what the root cause of the statistical correlations are.
I think 1 is wrong, because mechanistic interpretability seems to have very fast feedback loops, and we are able to run a shit-ton of experiments. Humans are empirically great at understanding even the most complex of systems if they’re able to run a shit-ton of experiments.
For 2, I think the claim that neural networks are doing shallow pattern matching & absurd statistical correlations is true, and may continue to be true for really scary systems, but am still optimistic we’ll be able to understand why it uses the correlations it does. We have access to the causal system which produced the network (the gradient descent process), and it doesn’t seem too far a step to go from understanding raw networks, to tracing back parts of the network you don’t quite understand or you think are doing shallow pattern matching to the gradients which built those in, and the datapoints which led to those gradients.
The most usual concern is: some agentic AI going rogue and taking over. I think this is wrong for two reasons:
it’s a human behavior we’ve been dealing with for ages; maybe the one danger our species is the best at fighting.
there’s a much more pressing issue: the rise of Moloch, i.e. AI tools are already disrupting our ability to coordinate, and even our desire to do so.