johnswentworth comments on Why deceptive alignment matters for AGI safety

johnswentworth 16 Sep 2022 4:27 UTC
LW: 7 AF: 5
2
AF
If we define deceptive alignment relatively broadly as any situation where “the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal”...
That’s “relatively broad”??? What notion of “deceptive alignment” is narrower than that? Roughly that definition is usually my stock example of a notion of deception which is way too narrow to focus on and misses a bunch of the interesting/probable/less-correlated failure modes (like e.g. the sort of stuff in Worlds Where Iterative Design Fails).
Having as concrete as possible a failure mode to work with is, in my opinion, a really important part of being able to do good research … Even for transparency and interpretability, perhaps the most obvious “work on the unknown unknowns directly” sort of research, I think it’s pretty important to have some idea of what we might want to use those sorts of tools for when developing, and working on concrete failure modes is extremely important to that.
This I agree with, but I think it doesn’t go far enough. In my software engineering days, one of the main heuristics I recommended was: when building a library, you should have a minimum of three use cases in mind. And make them as different as possible, because the library will inevitably end up being shit for any use case way out of the distribution your three use cases covered.
Same applies to research: minimum of three use cases, and make them as different as possible.
- evhub 16 Sep 2022 4:46 UTC
  LW: 11 AF: 4
  6
  AF Parent
  
  What notion of “deceptive alignment” is narrower than that?
  
  Any definition that makes mention of the specific structure/internals of the model.