So here you’re talking about situations where a false negative doesn’t have catastrophic consequences?
No, we’ll have to make false positive / false negative tradeoffs about ending the world as well. We’re unlucky like that.
I agree that false sense of security / safetywashing is a potential use of this kind of program.
A model doing something equivalent to this (though presumably not mechanistically like this) will do:
If (have influence over adequate resources)
Search widely for ways to optimize for x.
Else
Follow this heuristic-for-x that produces good behaviour on the training and test set.
So long as x isn’t [target that produces aligned behaviour], and a sufficiently powerful system eventually gets [influence over adequate resources], but was never in this situation in training, that’s enough for us to be screwed. (of course I expect something like [both “have influence over adequate resources” and “find better ways to aim for x” to be processes that are used at lower levels by heuristic-for-x])
I note in passing that extreme optimization for [accurately predict next token] does destroy the world (even in the myopic case, with side channels) - so in this case it’s not as though a bad outcome hinges on the model optimizing for something other than that for which it was trained.
I stress that this example is to illustrate that [we are robustly mistaken] does not require [there is deliberate deception]. I don’t claim an example of this particular form is probable.
I do claim that knowing there’s no deliberate deception is insufficient—and more generally that we’ll fail to think of all the tests that would be necessary. (absent fundamental breakthroughs)
I think this is a great point. For architectures that might learn this kind of behavior (ones that do self-reflection during inference), even somewhat-reliably evaluating their capabilities would require something like latent prompting—being able to search for what states of self-reflection would encourage them to display high capabilities.
I’m somewhat more confident in our ability to think of things to test for. If an AI has “that spark of generality,” it can probably figure out how to strategically deceive humans and hack computers and other obvious danger signs.
If not, sidestepping deception doesn’t look particularly important. If so, I remain confused by your level of confidence.
I retain the right to be confident in unimportant things :P
Maybe it would help for me to point out that my comment was in reply to a big quote of Habryka warning about deceptive alignment as a fundamental problem with evals.
For architectures that might learn this kind of behavior (ones that do self-reflection during inference)
I think it’s dangerous to assume that the kind of behaviour I’m pointing at requires explicit self-reflection during inference. That’s the obvious example to illustrate the point—but I’m reluctant to assume [x is the obvious way to get y] implies [x is required for y].
Here again, I’d expect us to test for the obvious ways that make sense to us (e.g. simple, explicit mechanisms, and/or the behaviours they’d imply), leaving the possibility of getting blind-sided by some equivalent process based on a weird-to-us mechanism.
a big quote of Habryka warning about deceptive alignment
Ah, I see. He warned about “things like” deceptive alignment and treacherous turns. I guess you were thinking “things such as”, and I was thinking “things resembling”. (probably because that’s what I tend to think about—I assume that if deceptive alignment is solved it’ll be as a consequence of a more general approach that also handles [we are robustly mistaken] cases, so that thinking about only deception isn’t likely to get us very far; of course I may be wrong :))
No, we’ll have to make false positive / false negative tradeoffs about ending the world as well. We’re unlucky like that.
I agree that false sense of security / safetywashing is a potential use of this kind of program.
I think this is a great point. For architectures that might learn this kind of behavior (ones that do self-reflection during inference), even somewhat-reliably evaluating their capabilities would require something like latent prompting—being able to search for what states of self-reflection would encourage them to display high capabilities.
I’m somewhat more confident in our ability to think of things to test for. If an AI has “that spark of generality,” it can probably figure out how to strategically deceive humans and hack computers and other obvious danger signs.
I retain the right to be confident in unimportant things :P
Maybe it would help for me to point out that my comment was in reply to a big quote of Habryka warning about deceptive alignment as a fundamental problem with evals.
Thanks, that’s clarifying.
A couple of small points:
I think it’s dangerous to assume that the kind of behaviour I’m pointing at requires explicit self-reflection during inference. That’s the obvious example to illustrate the point—but I’m reluctant to assume [x is the obvious way to get y] implies [x is required for y].
Here again, I’d expect us to test for the obvious ways that make sense to us (e.g. simple, explicit mechanisms, and/or the behaviours they’d imply), leaving the possibility of getting blind-sided by some equivalent process based on a weird-to-us mechanism.
Ah, I see. He warned about “things like” deceptive alignment and treacherous turns. I guess you were thinking “things such as”, and I was thinking “things resembling”. (probably because that’s what I tend to think about—I assume that if deceptive alignment is solved it’ll be as a consequence of a more general approach that also handles [we are robustly mistaken] cases, so that thinking about only deception isn’t likely to get us very far; of course I may be wrong :))