It seems like such an obviously stupid thing to do that my priors aren’t very high (though you’re right in that they’re slightly higher because it’s OpenAI). I think it’s telling however that neither Claude nor Gemini shy away from revealing the canary string.
What is the probability they intentionally fine tuned to hide canary contamination?
Seems like an obviously very silly thing to do. But with things like the NDA, my priors on oai being deceptive to their own detriment is not that low.
I’m pretty sure it wouldn’t forget the string.
It seems like such an obviously stupid thing to do that my priors aren’t very high (though you’re right in that they’re slightly higher because it’s OpenAI). I think it’s telling however that neither Claude nor Gemini shy away from revealing the canary string.
But the probability? :O
Maybe like 10%?