This way, every time we’re training a distilled agent, we train it to want to clarify with its overseer (i.e., us assisted with a team of corrigible assistants) whenever it’s uncertain about what we would approve of. Our corrigible assistants either answer the question confidently, or clarify with us if it’s uncertain about its answer.
I’m not sure if the distilled agent is supposed to query the overseer at run time, and whether this is supposed to be a primary or backup safety mechanism (i.e., if the distilled agent is supposed to be safe/aligned even if the overseer isn’t available). I haven’t seen it stated clearly anywhere, and Paul didn’t answer a direct question that I asked about this. Do you have a citation for the answer that you’re giving here?
1.1.4: Why should we expect the agent’s answers to correspond to its cognition at all?
We don’t actually have any guarantees that it does, but giving honest answers is probably the easiest way for the agent to maximize its reward.
We also need an independent way to interpret the agent’s cognition and check whether it matches with the answers, right? Otherwise, I don’t see how we could detect whether the AI is actually a daemon that’s optimizing for some unaligned goal and merely biding its time until it can execute a treacherous turn. (If you ask it any questions, it’s going to emulate how an aligned AI would answer, so you can’t distinguish between them without some independent way of checking the answers.)
I’m not sure if the distilled agent is supposed to query the overseer at run time, and whether this is supposed to be a primary or backup safety mechanism
If the overseer isn’t available, the agent may need to act without querying them. The goal is just to produce an agent that tries to do the right thing in light of that uncertainty. To argue that we get a good outcome, we need to make some argument about its competence / the difficulty of the situation, and having query access to the overseer can make the problem easier.
Otherwise, I don’t see how we could detect whether the AI is actually a daemon that’s optimizing for some unaligned goal and merely biding its time until it can execute a treacherous turn.
I’m not sure if the distilled agent is supposed to query the overseer at run time, and whether this is supposed to be a primary or backup safety mechanism (i.e., if the distilled agent is supposed to be safe/aligned even if the overseer isn’t available). I haven’t seen it stated clearly anywhere, and Paul didn’t answer a direct question that I asked about this. Do you have a citation for the answer that you’re giving here?
We also need an independent way to interpret the agent’s cognition and check whether it matches with the answers, right? Otherwise, I don’t see how we could detect whether the AI is actually a daemon that’s optimizing for some unaligned goal and merely biding its time until it can execute a treacherous turn. (If you ask it any questions, it’s going to emulate how an aligned AI would answer, so you can’t distinguish between them without some independent way of checking the answers.)
If the overseer isn’t available, the agent may need to act without querying them. The goal is just to produce an agent that tries to do the right thing in light of that uncertainty. To argue that we get a good outcome, we need to make some argument about its competence / the difficulty of the situation, and having query access to the overseer can make the problem easier.
Here are my views on this problem.
The mechanism Alex describes is only intended to be used for evaluating proposed actions.