The most interesting substantive disagreement I found in the discussion was that I was comparably much more excited about using interpretability to audit a trained model, and skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view.
Fwiw, I do have the reverse view, but my reason is more that “auditing a trained model” does not have a great story for wins. Like, either you find that the model is fine (in which case it would have been fine if you skipped the auditing) or you find that the model will kill you (in which case you don’t deploy your AI system, and someone else destroys the world instead).
There’s a path to impact where you (a) see that your model is going to kill you and (b) convince everyone else of this, thereby buying you time (or even solving the problem altogether if we then have global coordination to not build AGI since clearly it would destroy us). I feel skeptical about global coordination (especially as it becomes cheaper and cheaper to build AGI over time) but agree that it could buy you time which then allows alignment to “catch up” and solve the problem. However, this pathway seems pretty conjunctive (it makes a difference in worlds where (a) people were uncertain about AGI risk, (b) your interpretability tools successfully revealed evidence that convinced most of them, and (c) the resulting increase in time made the difference).
In contrast, using interpretability tools is impactful if (a) not using the interpretability tools leads to deception (also required in the previous story), and (b) using the interpretability tools gets rid of that deception.
(Obviously “level of conjunctiveness” isn’t the only thing that matters—you also need probabilities for each of the conjuncts—but this feels like the highest-level bit of why I’m more excited about putting tools in the training loop.)
(It’s also not an either-or, e.g. you could use ELK inside of your training loop, and then do Circuits-style mechanistic interpretability as an audit at the end. But if I were forced to go all-in on one of the two options, it would be the training loop one.)
EDIT (March 26, 2023): Coming back to this comment a year later, I think it undersells the “auditing” theory of impact; there are also effects like “if people know you are auditing your models deeply they are less worried that you’ll deploy something risky and so are less likely to race to beat you”. I don’t have a strong opinion on how those effects play out but they do seem important.
Fwiw, I do have the reverse view, but my reason is more that “auditing a trained model” does not have a great story for wins. Like, either you find that the model is fine (in which case it would have been fine if you skipped the auditing) or you find that the model will kill you (in which case you don’t deploy your AI system, and someone else destroys the world instead).
The way I’d put something-like-this is that in order for auditing the model to help (directly), you have to actually be pretty confident in your ability to understand and fix your mistakes if you find one. It’s not like getting a coin to land Heads by flipping it again if it lands Tails—different AGI projects are not independent random variables, if you don’t get good results the first time you won’t get good results the next time unless you understand what happened. This means that auditing trained models isn’t really appropriate for the middle of the skill curve.
Instead, it seems like something you could use after already being confident you’re doing good stuff, as quality control. This sharply limits the amount you expect it to save you, but might increase some other benefits of having an audit, like convincing people you know what you’re doing and aren’t trying to play Defect.
(in which case you don’t deploy your AI system, and someone else destroys the world instead).
Can you explain your reasoning behind this a bit more?
Are you saying someone else destroys the world because a capable lab wants to destroy the world, and so as soon as the route to misaligned AGI is possible then someone will do it? Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well? (Or something else?...)
Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well?
Ok, I think there’s a plausible success story for interpretability though where transparency tools become broadly available. Every major AI lab is equipped to use them and has incorporated them into their development processes.
I also think it’s plausible that either 1) one AI lab eventually gains a considerable lead/advantage over the others so that they’d have time to iterate after their model fails audit, or 2) if one lab communicated that their audits show a certain architecture/training approach keeps producing models that are clearly unsafe, then the other major labs would take that seriously.
This is why “auditing a trained model” still seems like a useful ability to me.
Update: Perhaps I was reading Rohin’s original comment as more critical of audits than he intended. I thought he was arguing that audits will be useless. But re-reading it, I see him saying that the conjunctiveness of the coordination story makes him “more excited” about interpretability for training, and that it’s “not an either-or”.
Fwiw, I do have the reverse view, but my reason is more that “auditing a trained model” does not have a great story for wins. Like, either you find that the model is fine (in which case it would have been fine if you skipped the auditing) or you find that the model will kill you (in which case you don’t deploy your AI system, and someone else destroys the world instead).
There’s a path to impact where you (a) see that your model is going to kill you and (b) convince everyone else of this, thereby buying you time (or even solving the problem altogether if we then have global coordination to not build AGI since clearly it would destroy us). I feel skeptical about global coordination (especially as it becomes cheaper and cheaper to build AGI over time) but agree that it could buy you time which then allows alignment to “catch up” and solve the problem. However, this pathway seems pretty conjunctive (it makes a difference in worlds where (a) people were uncertain about AGI risk, (b) your interpretability tools successfully revealed evidence that convinced most of them, and (c) the resulting increase in time made the difference).
In contrast, using interpretability tools is impactful if (a) not using the interpretability tools leads to deception (also required in the previous story), and (b) using the interpretability tools gets rid of that deception.
(Obviously “level of conjunctiveness” isn’t the only thing that matters—you also need probabilities for each of the conjuncts—but this feels like the highest-level bit of why I’m more excited about putting tools in the training loop.)
(It’s also not an either-or, e.g. you could use ELK inside of your training loop, and then do Circuits-style mechanistic interpretability as an audit at the end. But if I were forced to go all-in on one of the two options, it would be the training loop one.)
EDIT (March 26, 2023): Coming back to this comment a year later, I think it undersells the “auditing” theory of impact; there are also effects like “if people know you are auditing your models deeply they are less worried that you’ll deploy something risky and so are less likely to race to beat you”. I don’t have a strong opinion on how those effects play out but they do seem important.
The way I’d put something-like-this is that in order for auditing the model to help (directly), you have to actually be pretty confident in your ability to understand and fix your mistakes if you find one. It’s not like getting a coin to land Heads by flipping it again if it lands Tails—different AGI projects are not independent random variables, if you don’t get good results the first time you won’t get good results the next time unless you understand what happened. This means that auditing trained models isn’t really appropriate for the middle of the skill curve.
Instead, it seems like something you could use after already being confident you’re doing good stuff, as quality control. This sharply limits the amount you expect it to save you, but might increase some other benefits of having an audit, like convincing people you know what you’re doing and aren’t trying to play Defect.
Can you explain your reasoning behind this a bit more?
Are you saying someone else destroys the world because a capable lab wants to destroy the world, and so as soon as the route to misaligned AGI is possible then someone will do it? Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well? (Or something else?...)
This one.
Ok, I think there’s a plausible success story for interpretability though where transparency tools become broadly available. Every major AI lab is equipped to use them and has incorporated them into their development processes.
I also think it’s plausible that either 1) one AI lab eventually gains a considerable lead/advantage over the others so that they’d have time to iterate after their model fails audit, or 2) if one lab communicated that their audits show a certain architecture/training approach keeps producing models that are clearly unsafe, then the other major labs would take that seriously.
This is why “auditing a trained model” still seems like a useful ability to me.
Update: Perhaps I was reading Rohin’s original comment as more critical of audits than he intended. I thought he was arguing that audits will be useless. But re-reading it, I see him saying that the conjunctiveness of the coordination story makes him “more excited” about interpretability for training, and that it’s “not an either-or”.
Yeah I think I agree with all of that. Thanks for rereading my original comment and noticing a misunderstanding :)