How this perspective could reduce the probability of catastrophes
I want to emphasize that I think the general research direction is good and will be useful and I want more people to work on it (it makes the first, second, fifth and sixth bullet points above more effective); I only disagree with the story you’ve presented for how it reduces x-risk.
To be clear: the way I imagine this research direction working is that somebody comes up with a theory of how to build aligned AI, roughly does that, and then uses some kind of transparency to check whether or not they succeeded. A big part of the attraction to me is that it doesn’t really depend on what exact way aligned AI gets built, as long as it’s built using methods roughly similar to modern neural network training. That being said, if it’s as hard as you think it will be, I don’t understand how it could usefully contribute to the dot points you mention.
That being said, if it’s as hard as you think it will be, I don’t understand how it could usefully contribute to the dot points you mention.
Taking each of the bullet points I mentioned in turn:
Regulations / laws to not build powerful AI
You could imagine a law “we will not build AI systems that use >X amount of compute unless they are mechanistically transparent”. Then research on mechanistic transparency reduces the cost of such a law, making it more palatable to implement it.
Increasing AI researcher paranoia, so all AI researchers are very careful with powerful AI systems
The most obvious way to do this is to demonstrate that powerful AI systems are dangerous. One very compelling demonstration would be to train an AI system that we expect to be deceptive (that isn’t powerful enough to take over), make it mechanistically transparent, and show that it is deceptive.
Here, the mechanistic transparency would make the demonstration much more compelling (relative to a demonstration where you show deceptive behavior, but there’s the possibility that it’s just a weird bug in that particular scenario).
Safety benchmarks (set of tests looking for common problems, updated as we encounter new problems) (“all the potentially dangerous AI systems we could have built failed one of the benchmark tests”)
Mechanistic transparency opens up the possibility for safety tests of the form “train an AI system on this environment, and then use mechanistic transparency to check if it has learned <prohibited cognition>”. (You could imagine that the environment is small, or the models trained are small, and that’s why the cost of mechanistic transparency isn’t prohibitive.)
Any of the AI alignment methods, e.g. value learning or iterated amplification (“we don’t build dangerous AI systems because we build aligned AI systems instead”)
Informed oversight can be solved via universality or interpretability; worst-case optimization currently relies on “magic” interpretability techniques. Even if full mechanistic transparency is too hard to do, I would expect that insights along the way would be helpful. For example, perhaps in adversarial training, if the adversary shares weights with the agent, the adversary already “knows” what the agent is “thinking”, but it might need to use mechanistic transparency just for the final layer to understand what that part is doing.
You could imagine a law “we will not build AI systems that use >X amount of compute unless they are mechanistically transparent”. Then research on mechanistic transparency reduces the cost of such a law, making it more palatable to implement it.
If mechanistic transparency barely works and/or is super expensive, then presumably this law doesn’t look very good compared to other potential laws that prevent the building of powerful AI, so you’d think that marginal changes in how good we are at mechanistic transparency would do basically nothing (unless you’ve got the hope of ‘crossing the threshold’ to the point where this law becomes the most viable such law).
How this perspective could reduce the probability of catastrophes
To be clear: the way I imagine this research direction working is that somebody comes up with a theory of how to build aligned AI, roughly does that, and then uses some kind of transparency to check whether or not they succeeded. A big part of the attraction to me is that it doesn’t really depend on what exact way aligned AI gets built, as long as it’s built using methods roughly similar to modern neural network training. That being said, if it’s as hard as you think it will be, I don’t understand how it could usefully contribute to the dot points you mention.
Taking each of the bullet points I mentioned in turn:
You could imagine a law “we will not build AI systems that use >X amount of compute unless they are mechanistically transparent”. Then research on mechanistic transparency reduces the cost of such a law, making it more palatable to implement it.
The most obvious way to do this is to demonstrate that powerful AI systems are dangerous. One very compelling demonstration would be to train an AI system that we expect to be deceptive (that isn’t powerful enough to take over), make it mechanistically transparent, and show that it is deceptive.
Here, the mechanistic transparency would make the demonstration much more compelling (relative to a demonstration where you show deceptive behavior, but there’s the possibility that it’s just a weird bug in that particular scenario).
Mechanistic transparency opens up the possibility for safety tests of the form “train an AI system on this environment, and then use mechanistic transparency to check if it has learned <prohibited cognition>”. (You could imagine that the environment is small, or the models trained are small, and that’s why the cost of mechanistic transparency isn’t prohibitive.)
Informed oversight can be solved via universality or interpretability; worst-case optimization currently relies on “magic” interpretability techniques. Even if full mechanistic transparency is too hard to do, I would expect that insights along the way would be helpful. For example, perhaps in adversarial training, if the adversary shares weights with the agent, the adversary already “knows” what the agent is “thinking”, but it might need to use mechanistic transparency just for the final layer to understand what that part is doing.
If mechanistic transparency barely works and/or is super expensive, then presumably this law doesn’t look very good compared to other potential laws that prevent the building of powerful AI, so you’d think that marginal changes in how good we are at mechanistic transparency would do basically nothing (unless you’ve got the hope of ‘crossing the threshold’ to the point where this law becomes the most viable such law).
The other bullet points make sense though.