‘Playing the training game’ at all corresponds to an alignment difficulty level of 4 because better than human behavioral feedback and oversight can reveal it and you don’t need interpretability.
(Situationally aware) Deception by default corresponds to a difficulty level of 6 because if it’s sufficiently capable no behavioral feedback will work and you need interpretability-based oversight
Gradient hacking by default corresponds to a difficulty level of 7 because the system will also fight interpretability based oversight and you need to think of something clever probably through new research at ‘crunch time’.
I think that, on the categorization I provided,
‘Playing the training game’ at all corresponds to an alignment difficulty level of 4 because better than human behavioral feedback and oversight can reveal it and you don’t need interpretability.
(Situationally aware) Deception by default corresponds to a difficulty level of 6 because if it’s sufficiently capable no behavioral feedback will work and you need interpretability-based oversight
Gradient hacking by default corresponds to a difficulty level of 7 because the system will also fight interpretability based oversight and you need to think of something clever probably through new research at ‘crunch time’.