To me (2) seems fairly clearly false—at the very least it’s not doing anything about inner alignment (debate on weights/activations does nothing to address this, since there’s still no [debaters are aiming to win the game] starting point).
Why do you believe this? It’s fairly plausible to me that “train an AI to use interpretability tools to show that this other AI is being deceptive” is the kind of scalable oversight approach that might work, especially for detecting inner misalignment, if you can get the training right and avoid cooperation. But that seems like a plausibly solvable problem to me
The problem is robustly getting the incentive to show that the other AI is being deceptive. Giving access to the weights, activations and tools may give debaters the capability to expose deception—but that alone gets you nothing.
You’re still left saying: So long as we can get the AI to robustly do what we want (i.e. do its best to expose deception), we can get the AI to robustly do what we want.
Similarly, ”...and avoid cooperation” is essentially the entire problem.
To be clear, I’m not saying that an approach of this kind will never catch any instances of an AI being deceptive. (this is one reason I’m less certain on (1)) I’m am saying that there’s no reason to predict anything along these lines should catch all such instances. I see no reason to think it’ll scale.
Another issue: unless you have some kind of true name of deception (I see no reason to expect this exists), you’ll train an AI to detect [things that fit your definition of deception], and we die to things that didn’t fit your definition.
These are all arguments about the limit; whether or not they’re relevant depends on whether they apply to the regime of “smart enough to automate alignment research”.
Agreed. Are you aware of any work that attempts to answer this question? Does this work look like work on debate? (not rhetorical questions!)
My guess is that work likely to address this does not look like work on debate. Therefore my current position remains: don’t bother working on debate; rather work on understanding the fundamentals that might tell you when it’ll break.
The world won’t be short of debate schemes. It’ll be short of principled arguments for their safe application.
Why do you believe this? It’s fairly plausible to me that “train an AI to use interpretability tools to show that this other AI is being deceptive” is the kind of scalable oversight approach that might work, especially for detecting inner misalignment, if you can get the training right and avoid cooperation. But that seems like a plausibly solvable problem to me
The problem is robustly getting the incentive to show that the other AI is being deceptive.
Giving access to the weights, activations and tools may give debaters the capability to expose deception—but that alone gets you nothing.
You’re still left saying:
So long as we can get the AI to robustly do what we want (i.e. do its best to expose deception), we can get the AI to robustly do what we want.
Similarly, ”...and avoid cooperation” is essentially the entire problem.
To be clear, I’m not saying that an approach of this kind will never catch any instances of an AI being deceptive. (this is one reason I’m less certain on (1))
I’m am saying that there’s no reason to predict anything along these lines should catch all such instances.
I see no reason to think it’ll scale.
Another issue: unless you have some kind of true name of deception (I see no reason to expect this exists), you’ll train an AI to detect [things that fit your definition of deception], and we die to things that didn’t fit your definition.
These are all arguments about the limit; whether or not they’re relevant depends on whether they apply to the regime of “smart enough to automate alignment research”.
Agreed.
Are you aware of any work that attempts to answer this question?
Does this work look like work on debate?
(not rhetorical questions!)
My guess is that work likely to address this does not look like work on debate.
Therefore my current position remains: don’t bother working on debate; rather work on understanding the fundamentals that might tell you when it’ll break.
The world won’t be short of debate schemes.
It’ll be short of principled arguments for their safe application.