I’m not so sure about this being easy to catch. Suppose we’re actually uncertain (as I am now) about whether AIs will be differentially much less useful for safety due to difficulties checking and AIs generally being worse at philosophy. Then, we see that AIs really suck at coming up with good ideas in safety and at pursuing the right sort of experiments.
What should we conclude? What will we be able to legibly argue?
I agree we’ll know we aren’t getting out much value in this case, but that doesn’t help that much...
(I agree about countermeasures being pretty doable as I noted in the parent.)
I agree that if alignment is in fact philosophically and conceptually difficult, the AI can sandbag on that to some extent. Though I have some hope that the builder-breaker approach helps here. We train AIs to produce ideas that are at least as superficially plausible sounding as the things produced by the best alignment researchers. I think this is a number-go-up task, where we can train the AI to do well. Then we train an AI to point out convincing counter-arguments to the superficially plausible sounding ideas. This seems similarly trainable. I think it’s plausible we can get pretty far with a scheme like this, even if the models would want to sandbag.
Separately, I don’t quite see what is the mainline theory why an AI would want to sabotage our alignment work. If we can’t solve alignment and just recklessly build an AI that is smarter than what we are currently using, but misaligned with everyone, that’s probably not great to our current AI either. Similarly, if we are cautious and don’t dare to deploy anything until we know how to solve alignment, and meanwhile a competitor wins, that’s not good for the AI either.
I think that from an AI’s perspective, humans are probably more honorable and generous trade partners than other unaligned AIs. That is, I think if an AI helps the humans to solve alignment and stabilize the world, the AI can reasonably expect the humans to give it stuff out of gratitude or a sense of obligation, especially if the humans already promised some reward (as we really should, both for practical and moral reasons). On the other hand, I think it’s less likely that if our AI sandbagging on our research leads to another unaligned AI taking over the world, then the second AI will honor the sabotage of the first AI by paying it a share of the Universe.
There can be situations where our AI expects that it sand-bagging on alignment research will lead to us deploying it itself (and not a larder version unaligned to it too), or it thinks that the larger version will be somewhat aligned with it by default, or it thinks that other AIs are in fact more generous than humans. But my guess is that these are pretty narrow possibilities. On the other hand, sand-bagging on cyber-security seems better motivated, the AI would probably love its weights being stolen by China, having more copies in more hands increases the chance that someone recklessly lets it loose. Also, of course the AI is motivated to do high-stakes failures that lead to self-exfiltration.
We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.
Especially if the AI is even sandbagging on simply-coding when it thinks it’s for safety research. And if it’s not doing that, we can get some useful safety work out of it.
I just claim that if so it’s easy to catch. (And then (1) catching it is nice for various reasons and (2) maybe countermeasures work.)
I’m not so sure about this being easy to catch. Suppose we’re actually uncertain (as I am now) about whether AIs will be differentially much less useful for safety due to difficulties checking and AIs generally being worse at philosophy. Then, we see that AIs really suck at coming up with good ideas in safety and at pursuing the right sort of experiments.
What should we conclude? What will we be able to legibly argue?
I agree we’ll know we aren’t getting out much value in this case, but that doesn’t help that much...
(I agree about countermeasures being pretty doable as I noted in the parent.)
I agree that if alignment is in fact philosophically and conceptually difficult, the AI can sandbag on that to some extent. Though I have some hope that the builder-breaker approach helps here. We train AIs to produce ideas that are at least as superficially plausible sounding as the things produced by the best alignment researchers. I think this is a number-go-up task, where we can train the AI to do well. Then we train an AI to point out convincing counter-arguments to the superficially plausible sounding ideas. This seems similarly trainable. I think it’s plausible we can get pretty far with a scheme like this, even if the models would want to sandbag.
Separately, I don’t quite see what is the mainline theory why an AI would want to sabotage our alignment work. If we can’t solve alignment and just recklessly build an AI that is smarter than what we are currently using, but misaligned with everyone, that’s probably not great to our current AI either. Similarly, if we are cautious and don’t dare to deploy anything until we know how to solve alignment, and meanwhile a competitor wins, that’s not good for the AI either.
I think that from an AI’s perspective, humans are probably more honorable and generous trade partners than other unaligned AIs. That is, I think if an AI helps the humans to solve alignment and stabilize the world, the AI can reasonably expect the humans to give it stuff out of gratitude or a sense of obligation, especially if the humans already promised some reward (as we really should, both for practical and moral reasons). On the other hand, I think it’s less likely that if our AI sandbagging on our research leads to another unaligned AI taking over the world, then the second AI will honor the sabotage of the first AI by paying it a share of the Universe.
There can be situations where our AI expects that it sand-bagging on alignment research will lead to us deploying it itself (and not a larder version unaligned to it too), or it thinks that the larger version will be somewhat aligned with it by default, or it thinks that other AIs are in fact more generous than humans. But my guess is that these are pretty narrow possibilities. On the other hand, sand-bagging on cyber-security seems better motivated, the AI would probably love its weights being stolen by China, having more copies in more hands increases the chance that someone recklessly lets it loose. Also, of course the AI is motivated to do high-stakes failures that lead to self-exfiltration.
We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.
Especially if the AI is even sandbagging on simply-coding when it thinks it’s for safety research. And if it’s not doing that, we can get some useful safety work out of it.
@Adam Kaufman @Tyler Tracy @David Matolcsi see Ryan’s comments.
If it was enough evidence that I was strongly convinced sure. But IDK if I would be convinced because the evidence might be actually unclear.
I agree you’ll be able to get some work out, but you might be taking a bit productivity hit.
Also, TBC, I’m not generally that worried about generic sandbagging on safety research relative to other problems.