To be clear, I agree that unknown unknowns are in some sense the biggest problem in AI safety—as I talk about in the very first paragraph here.
However, I nevertheless think that focusing on deceptive alignment specifically makes a lot of sense. If we define deceptive alignment relatively broadly as any situation where “the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal” (where training signal doesn’t necessarily mean the literal loss, just anything we’re trying to get it to do), then I think most (though not all) AI existential risk scenarios that aren’t solved by iterative design/standard safety engineering/etc. include that as a component. Certainly, I expect that all of my guesses for exactly how deceptive alignment might be developed, what it might look like internally, etc. are likely to be wrong—and this is one of the places where I think unknowns really become a problem—but I still expect that if we’re capable of looking back and judging “was deceptive alignment part of the problem here” in situations where things go badly we’ll end up concluding yes (I’d probably put ~60% on that).
Furthermore, I think there’s a lot of value in taking the most concerning concrete problems that we can yet come up with and tackling them directly. Having as concrete as possible a failure mode to work with is, in my opinion, a really important part of being able to do good research—and for obvious reasons I think it’s most valuable to start with the most concerning concrete failure modes we’re aware of. It’s extremely hard to do good work on unknown unknowns directly—and additionally I think our modal guess for what such unknown unknowns might look like is some variation of the sorts of problems that already seem the most damning. Even for transparency and interpretability, perhaps the most obvious “work on the unknown unknowns directly” sort of research, I think it’s pretty important to have some idea of what we might want to use those sorts of tools for when developing them, and working on concrete failure modes is extremely important to that.
That’s “relatively broad”??? What notion of “deceptive alignment” is narrower than that? Roughly that definition is usually my stock example of a notion of deception which is way too narrow to focus on and misses a bunch of the interesting/probable/less-correlated failure modes (like e.g. the sort of stuff in Worlds Where Iterative Design Fails).
Having as concrete as possible a failure mode to work with is, in my opinion, a really important part of being able to do good research … Even for transparency and interpretability, perhaps the most obvious “work on the unknown unknowns directly” sort of research, I think it’s pretty important to have some idea of what we might want to use those sorts of tools for when developing, and working on concrete failure modes is extremely important to that.
This I agree with, but I think it doesn’t go far enough. In my software engineering days, one of the main heuristics I recommended was: when building a library, you should have a minimum of three use cases in mind. And make them as different as possible, because the library will inevitably end up being shit for any use case way out of the distribution your three use cases covered.
Same applies to research: minimum of three use cases, and make them as different as possible.
To be clear, I agree that unknown unknowns are in some sense the biggest problem in AI safety—as I talk about in the very first paragraph here.
However, I nevertheless think that focusing on deceptive alignment specifically makes a lot of sense. If we define deceptive alignment relatively broadly as any situation where “the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal” (where training signal doesn’t necessarily mean the literal loss, just anything we’re trying to get it to do), then I think most (though not all) AI existential risk scenarios that aren’t solved by iterative design/standard safety engineering/etc. include that as a component. Certainly, I expect that all of my guesses for exactly how deceptive alignment might be developed, what it might look like internally, etc. are likely to be wrong—and this is one of the places where I think unknowns really become a problem—but I still expect that if we’re capable of looking back and judging “was deceptive alignment part of the problem here” in situations where things go badly we’ll end up concluding yes (I’d probably put ~60% on that).
Furthermore, I think there’s a lot of value in taking the most concerning concrete problems that we can yet come up with and tackling them directly. Having as concrete as possible a failure mode to work with is, in my opinion, a really important part of being able to do good research—and for obvious reasons I think it’s most valuable to start with the most concerning concrete failure modes we’re aware of. It’s extremely hard to do good work on unknown unknowns directly—and additionally I think our modal guess for what such unknown unknowns might look like is some variation of the sorts of problems that already seem the most damning. Even for transparency and interpretability, perhaps the most obvious “work on the unknown unknowns directly” sort of research, I think it’s pretty important to have some idea of what we might want to use those sorts of tools for when developing them, and working on concrete failure modes is extremely important to that.
That’s “relatively broad”??? What notion of “deceptive alignment” is narrower than that? Roughly that definition is usually my stock example of a notion of deception which is way too narrow to focus on and misses a bunch of the interesting/probable/less-correlated failure modes (like e.g. the sort of stuff in Worlds Where Iterative Design Fails).
This I agree with, but I think it doesn’t go far enough. In my software engineering days, one of the main heuristics I recommended was: when building a library, you should have a minimum of three use cases in mind. And make them as different as possible, because the library will inevitably end up being shit for any use case way out of the distribution your three use cases covered.
Same applies to research: minimum of three use cases, and make them as different as possible.
Any definition that makes mention of the specific structure/internals of the model.