Do we think there is a non-negligible chance that an AI that doesn’t have hidden memory, only English-language CoT, will be able to evade our monitoring and execute a rouge deployment?
Yes, I do think this. I think the situation looks worse if the AI has hidden memory, but I don’t think we’re either fine if the model doesn’t have it or doomed if it does.
Yes, I do think this. I think the situation looks worse if the AI has hidden memory, but I don’t think we’re either fine if the model doesn’t have it or doomed if it does.