joshc comments on Anthropic: Three Sketches of ASL-4 Safety Case Components

joshc 11 Nov 2024 21:32 UTC
3 points
0
Note that if the model is deployed on many more examples than it’s evaluated on, it can be pretty confident that it’s in deployment on any particular example just on priors, even if there’s no distribution shift. See here for more.
My initial read was that indistinguishability was used to justify that the trusted monitor couldn’t distinguish the deployment distribution.

I’m not sure if that’s what Roger meant—if so, it’s still a bit confusing since then he wouldn’t have needed to call the monitor ‘trusted.’