(in which case you don’t deploy your AI system, and someone else destroys the world instead).
Can you explain your reasoning behind this a bit more?
Are you saying someone else destroys the world because a capable lab wants to destroy the world, and so as soon as the route to misaligned AGI is possible then someone will do it? Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well? (Or something else?...)
Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well?
Ok, I think there’s a plausible success story for interpretability though where transparency tools become broadly available. Every major AI lab is equipped to use them and has incorporated them into their development processes.
I also think it’s plausible that either 1) one AI lab eventually gains a considerable lead/advantage over the others so that they’d have time to iterate after their model fails audit, or 2) if one lab communicated that their audits show a certain architecture/training approach keeps producing models that are clearly unsafe, then the other major labs would take that seriously.
This is why “auditing a trained model” still seems like a useful ability to me.
Update: Perhaps I was reading Rohin’s original comment as more critical of audits than he intended. I thought he was arguing that audits will be useless. But re-reading it, I see him saying that the conjunctiveness of the coordination story makes him “more excited” about interpretability for training, and that it’s “not an either-or”.
Can you explain your reasoning behind this a bit more?
Are you saying someone else destroys the world because a capable lab wants to destroy the world, and so as soon as the route to misaligned AGI is possible then someone will do it? Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well? (Or something else?...)
This one.
Ok, I think there’s a plausible success story for interpretability though where transparency tools become broadly available. Every major AI lab is equipped to use them and has incorporated them into their development processes.
I also think it’s plausible that either 1) one AI lab eventually gains a considerable lead/advantage over the others so that they’d have time to iterate after their model fails audit, or 2) if one lab communicated that their audits show a certain architecture/training approach keeps producing models that are clearly unsafe, then the other major labs would take that seriously.
This is why “auditing a trained model” still seems like a useful ability to me.
Update: Perhaps I was reading Rohin’s original comment as more critical of audits than he intended. I thought he was arguing that audits will be useless. But re-reading it, I see him saying that the conjunctiveness of the coordination story makes him “more excited” about interpretability for training, and that it’s “not an either-or”.
Yeah I think I agree with all of that. Thanks for rereading my original comment and noticing a misunderstanding :)