It seems plausible to both of us that you can use some straightforward selection against straightforward deception and end up succeeding, up to a certain power level, and that marginal research on how to do this improves your odds. But:
I think there’s a power level where it definitely doesn’t work, for the sort of ontological reasons alluded to here whereby[1] useful cognition for achieving an AI’s goals will optimize against you understanding it even without it needing to be tagged as deceptive or for the AI to have any self-awareness of this property.
I also think it’s always a terrifying bet to make due to the adversarialness, whereby you may get a great deal of evidence consistent with it all going quite rosy right up until it dramatically fails (e.g. FTX was an insanely good investment according to financial investors and Effective Altruists right up until it was the worst investment they’d ever made and these people were not stupid).
These reasons give me a sense of naivety to betting on “trying to straightforwardly select against deceptiveness” that “but a lot of the time it’s easier for me to verify the deceptive behavior than for the AI to generate it!” doesn’t fully grapple with, even while it’s hard to point to the exact step whereby I imagine such AI developers getting tricked.
...however my sense from the first half of your comment (“I think our current understanding is sufficiently paltry that the chance of this working is pretty low”) is that we’re broadly in agreement about the odds of betting on this (even though I kind of expect you would articulate why quite differently to how I did).
You then write:
But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop better understanding, or alter your training process in other ways. For example, you can use your understanding of the simpler forms of deception your AIs engage in to invest resources in understanding more complicated forms of deception, e.g. by focusing interpretability efforts.
Certainly being able to show that an AI is behaving deceptively in a way that is hard to train out will in some worlds be useful for pausing AI capabilities progress, though I think this not a great set of world to be betting on ending up in — I think it more likely than not that an AI company would willingly deploy many such AIs.
Be that as it may, it currently reads to me like your interest in this line of research is resting on some belief in a political will to pause in the face of clearly deceptive behavior that I am less confident of, and that’s a different crux than the likelihood of success of the naive select-against-deception strategy (and the likely returns of marginal research on this track).
Which implies that the relative ease of verification/generation is not much delta between your perspective and mine on this issue (and evidence against it being the primary delta between John’s and Paul’s writ large).
It seems plausible to both of us that you can use some straightforward selection against straightforward deception and end up succeeding, up to a certain power level, and that marginal research on how to do this improves your odds. But:
I think there’s a power level where it definitely doesn’t work, for the sort of ontological reasons alluded to here whereby[1] useful cognition for achieving an AI’s goals will optimize against you understanding it even without it needing to be tagged as deceptive or for the AI to have any self-awareness of this property.
I also think it’s always a terrifying bet to make due to the adversarialness, whereby you may get a great deal of evidence consistent with it all going quite rosy right up until it dramatically fails (e.g. FTX was an insanely good investment according to financial investors and Effective Altruists right up until it was the worst investment they’d ever made and these people were not stupid).
These reasons give me a sense of naivety to betting on “trying to straightforwardly select against deceptiveness” that “but a lot of the time it’s easier for me to verify the deceptive behavior than for the AI to generate it!” doesn’t fully grapple with, even while it’s hard to point to the exact step whereby I imagine such AI developers getting tricked.
...however my sense from the first half of your comment (“I think our current understanding is sufficiently paltry that the chance of this working is pretty low”) is that we’re broadly in agreement about the odds of betting on this (even though I kind of expect you would articulate why quite differently to how I did).
You then write:
Certainly being able to show that an AI is behaving deceptively in a way that is hard to train out will in some worlds be useful for pausing AI capabilities progress, though I think this not a great set of world to be betting on ending up in — I think it more likely than not that an AI company would willingly deploy many such AIs.
Be that as it may, it currently reads to me like your interest in this line of research is resting on some belief in a political will to pause in the face of clearly deceptive behavior that I am less confident of, and that’s a different crux than the likelihood of success of the naive select-against-deception strategy (and the likely returns of marginal research on this track).
Which implies that the relative ease of verification/generation is not much delta between your perspective and mine on this issue (and evidence against it being the primary delta between John’s and Paul’s writ large).
(The following is my own phrasing, not the linked post’s.)