I’m hearing you say “If there’s lots of types of ways to do strategic deception, and we can easily verify the presence (or lack) of a wide variety of them, this probably give us a good shot of selecting against all strategically deceptive AIs in our selection process”.
And I’m hearing John’s position as “At a sufficient power level, if a single one of them gets through your training process you’re screwed. And some of the types will be very hard to verify the presence of.”
And then I’m left with an open question as to whether the former is sufficient to prevent the latter, on which my model of Mark is optimistic (i.e. gives it >30% chance of working) and John is pessimistic (i.e. gives it <5% chance of working).
If you’re commited to producing a powerful AI then the thing that matters is the probability there exists something you can’t find that will kill you. I think our current understanding is sufficiently paltry that the chance of this working is pretty low (the value added by doing selection on non-deceptive behavior is probably very small, but I think there’s a decent chance you just won’t get that much deception). But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop better understanding, or alter your training process in other ways. For example, you can use your understanding of the simpler forms of deception your AIs engage in to invest resources in understanding more complicated forms of deception, e.g. by focusing interpretability efforts.
It seems plausible to both of us that you can use some straightforward selection against straightforward deception and end up succeeding, up to a certain power level, and that marginal research on how to do this improves your odds. But:
I think there’s a power level where it definitely doesn’t work, for the sort of ontological reasons alluded to here whereby[1] useful cognition for achieving an AI’s goals will optimize against you understanding it even without it needing to be tagged as deceptive or for the AI to have any self-awareness of this property.
I also think it’s always a terrifying bet to make due to the adversarialness, whereby you may get a great deal of evidence consistent with it all going quite rosy right up until it dramatically fails (e.g. FTX was an insanely good investment according to financial investors and Effective Altruists right up until it was the worst investment they’d ever made and these people were not stupid).
These reasons give me a sense of naivety to betting on “trying to straightforwardly select against deceptiveness” that “but a lot of the time it’s easier for me to verify the deceptive behavior than for the AI to generate it!” doesn’t fully grapple with, even while it’s hard to point to the exact step whereby I imagine such AI developers getting tricked.
...however my sense from the first half of your comment (“I think our current understanding is sufficiently paltry that the chance of this working is pretty low”) is that we’re broadly in agreement about the odds of betting on this (even though I kind of expect you would articulate why quite differently to how I did).
You then write:
But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop better understanding, or alter your training process in other ways. For example, you can use your understanding of the simpler forms of deception your AIs engage in to invest resources in understanding more complicated forms of deception, e.g. by focusing interpretability efforts.
Certainly being able to show that an AI is behaving deceptively in a way that is hard to train out will in some worlds be useful for pausing AI capabilities progress, though I think this not a great set of world to be betting on ending up in — I think it more likely than not that an AI company would willingly deploy many such AIs.
Be that as it may, it currently reads to me like your interest in this line of research is resting on some belief in a political will to pause in the face of clearly deceptive behavior that I am less confident of, and that’s a different crux than the likelihood of success of the naive select-against-deception strategy (and the likely returns of marginal research on this track).
Which implies that the relative ease of verification/generation is not much delta between your perspective and mine on this issue (and evidence against it being the primary delta between John’s and Paul’s writ large).
I’m hearing you say “If there’s lots of types of ways to do strategic deception, and we can easily verify the presence (or lack) of a wide variety of them, this probably give us a good shot of selecting against all strategically deceptive AIs in our selection process”.
And I’m hearing John’s position as “At a sufficient power level, if a single one of them gets through your training process you’re screwed. And some of the types will be very hard to verify the presence of.”
And then I’m left with an open question as to whether the former is sufficient to prevent the latter, on which my model of Mark is optimistic (i.e. gives it >30% chance of working) and John is pessimistic (i.e. gives it <5% chance of working).
If you’re commited to producing a powerful AI then the thing that matters is the probability there exists something you can’t find that will kill you. I think our current understanding is sufficiently paltry that the chance of this working is pretty low (the value added by doing selection on non-deceptive behavior is probably very small, but I think there’s a decent chance you just won’t get that much deception). But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop better understanding, or alter your training process in other ways. For example, you can use your understanding of the simpler forms of deception your AIs engage in to invest resources in understanding more complicated forms of deception, e.g. by focusing interpretability efforts.
It seems plausible to both of us that you can use some straightforward selection against straightforward deception and end up succeeding, up to a certain power level, and that marginal research on how to do this improves your odds. But:
I think there’s a power level where it definitely doesn’t work, for the sort of ontological reasons alluded to here whereby[1] useful cognition for achieving an AI’s goals will optimize against you understanding it even without it needing to be tagged as deceptive or for the AI to have any self-awareness of this property.
I also think it’s always a terrifying bet to make due to the adversarialness, whereby you may get a great deal of evidence consistent with it all going quite rosy right up until it dramatically fails (e.g. FTX was an insanely good investment according to financial investors and Effective Altruists right up until it was the worst investment they’d ever made and these people were not stupid).
These reasons give me a sense of naivety to betting on “trying to straightforwardly select against deceptiveness” that “but a lot of the time it’s easier for me to verify the deceptive behavior than for the AI to generate it!” doesn’t fully grapple with, even while it’s hard to point to the exact step whereby I imagine such AI developers getting tricked.
...however my sense from the first half of your comment (“I think our current understanding is sufficiently paltry that the chance of this working is pretty low”) is that we’re broadly in agreement about the odds of betting on this (even though I kind of expect you would articulate why quite differently to how I did).
You then write:
Certainly being able to show that an AI is behaving deceptively in a way that is hard to train out will in some worlds be useful for pausing AI capabilities progress, though I think this not a great set of world to be betting on ending up in — I think it more likely than not that an AI company would willingly deploy many such AIs.
Be that as it may, it currently reads to me like your interest in this line of research is resting on some belief in a political will to pause in the face of clearly deceptive behavior that I am less confident of, and that’s a different crux than the likelihood of success of the naive select-against-deception strategy (and the likely returns of marginal research on this track).
Which implies that the relative ease of verification/generation is not much delta between your perspective and mine on this issue (and evidence against it being the primary delta between John’s and Paul’s writ large).
(The following is my own phrasing, not the linked post’s.)