so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property
I’m imagining the problem statement for distillation being: we have a powerful aligned/corrigible agent. Now we want to train a faster agent which is also aligned/corrigible.
If there is a way to do this without starting from a more powerful agent, then I agree that we can skip the amplification process and jump straight to the goal.
I’m imagining the problem statement for distillation being: we have a powerful aligned/corrigible agent. Now we want to train a faster agent which is also aligned/corrigible.
If there is a way to do this without starting from a more powerful agent, then I agree that we can skip the amplification process and jump straight to the goal.