Note that if the model is deployed on many more examples than it’s evaluated on, it can be pretty confident that it’s in deployment on any particular example just on priors, even if there’s no distribution shift. See here for more.
My initial read was that indistinguishability was used to justify that the trusted monitor couldn’t distinguish the deployment distribution.
I’m not sure if that’s what Roger meant—if so, it’s still a bit confusing since then he wouldn’t have needed to call the monitor ‘trusted.’
My initial read was that indistinguishability was used to justify that the trusted monitor couldn’t distinguish the deployment distribution.
I’m not sure if that’s what Roger meant—if so, it’s still a bit confusing since then he wouldn’t have needed to call the monitor ‘trusted.’