I’m not sure if the distilled agent is supposed to query the overseer at run time, and whether this is supposed to be a primary or backup safety mechanism
If the overseer isn’t available, the agent may need to act without querying them. The goal is just to produce an agent that tries to do the right thing in light of that uncertainty. To argue that we get a good outcome, we need to make some argument about its competence / the difficulty of the situation, and having query access to the overseer can make the problem easier.
Otherwise, I don’t see how we could detect whether the AI is actually a daemon that’s optimizing for some unaligned goal and merely biding its time until it can execute a treacherous turn.
If the overseer isn’t available, the agent may need to act without querying them. The goal is just to produce an agent that tries to do the right thing in light of that uncertainty. To argue that we get a good outcome, we need to make some argument about its competence / the difficulty of the situation, and having query access to the overseer can make the problem easier.
Here are my views on this problem.
The mechanism Alex describes is only intended to be used for evaluating proposed actions.