I always assume when thinking about future dangerous models that they have access to some sort of black-box memory. Do we think there is a non-negligible chance that an AI that doesn’t have hidden memory, only English-language CoT, will be able to evade our monitoring and execute a rouge deployment? (Not a rhetorical question, there might be a way I haven’t thought of.)
So I think that assuming the AI being stateless when thinking about future risk is not a good idea, as I think the vast majority of the risk comes from AIs for which this assumption is not true. I think this makes the post’s claim significantly weaker, as the untrusted AI can just learn which of its attacks get replaced, and improve its attacks based on this data.
It still needs to avoid getting top 1% suspiciousness score and getting audited though, as it would be dangerous to the AI to ever get caught writing a text that clearly intends to jailbreak the monitor. So I agree that even an AI with black-box memory will still be severely constrained in how many and how egregious jailbreak attempts it can try.
Do we think there is a non-negligible chance that an AI that doesn’t have hidden memory, only English-language CoT, will be able to evade our monitoring and execute a rouge deployment?
Yes, I do think this. I think the situation looks worse if the AI has hidden memory, but I don’t think we’re either fine if the model doesn’t have it or doomed if it does.
I always assume when thinking about future dangerous models that they have access to some sort of black-box memory. Do we think there is a non-negligible chance that an AI that doesn’t have hidden memory, only English-language CoT, will be able to evade our monitoring and execute a rouge deployment? (Not a rhetorical question, there might be a way I haven’t thought of.)
So I think that assuming the AI being stateless when thinking about future risk is not a good idea, as I think the vast majority of the risk comes from AIs for which this assumption is not true. I think this makes the post’s claim significantly weaker, as the untrusted AI can just learn which of its attacks get replaced, and improve its attacks based on this data.
It still needs to avoid getting top 1% suspiciousness score and getting audited though, as it would be dangerous to the AI to ever get caught writing a text that clearly intends to jailbreak the monitor. So I agree that even an AI with black-box memory will still be severely constrained in how many and how egregious jailbreak attempts it can try.
Yes, I do think this. I think the situation looks worse if the AI has hidden memory, but I don’t think we’re either fine if the model doesn’t have it or doomed if it does.