I agree that this is a risk. I also foresee some related problems that occur at lower levels of superintelligence (and thus would likely become problems first). For instance, the model sandbagging its capabilities in a testing environment, preparing for a future deceptive turn.
I have been thinking about possible solutions. My current best idea is that we can always start out testing the model in a severely handicapped state where it is totally non-functional. Over subsequent evaluation runs, the handicapping is incrementally relaxed. Careful analysis of the extrapolated performance improvements vs measured improvements should highlight dangerous conditions such as sandbagging before the model’s handicap has been relaxed all the way to superintelligence.
I agree that this is a risk. I also foresee some related problems that occur at lower levels of superintelligence (and thus would likely become problems first). For instance, the model sandbagging its capabilities in a testing environment, preparing for a future deceptive turn. I have been thinking about possible solutions. My current best idea is that we can always start out testing the model in a severely handicapped state where it is totally non-functional. Over subsequent evaluation runs, the handicapping is incrementally relaxed. Careful analysis of the extrapolated performance improvements vs measured improvements should highlight dangerous conditions such as sandbagging before the model’s handicap has been relaxed all the way to superintelligence.