I’m similarly confused. My instincts are that P( AI is safe ) == P( AI is safe | AI said X AND gatekeeper can’t identify safe AI ). The standard assumption is that ( AI significantly smarter than gatekeeper ) ⇒ ( gatekeeper can’t identify safe AI ) so the gatekeeper’s priors should never change no matter what X the AI says.
I’m similarly confused. My instincts are that P( AI is safe ) == P( AI is safe | AI said X AND gatekeeper can’t identify safe AI ). The standard assumption is that ( AI significantly smarter than gatekeeper ) ⇒ ( gatekeeper can’t identify safe AI ) so the gatekeeper’s priors should never change no matter what X the AI says.