Only note is that I think the setup can be simplified a bit. The central idea I have in mind is that the AI does something like:
“Think” about what to do next, for up to some max period of time (“what to do next” can be “think more, with prompt X”).
Do it
Repeat
This seems like a pretty natural way for an “agent” to operate, and then every #1 is an “auditable step” in your terminology. (And the audits are done by comparing a few rollouts of that step, and performing gradient descent without executing any of them.)
There are probably subtleties I’m missing, but I think this points pretty well at what I tend to think of as the hopes of process-based supervision.
This sounds right to me!
Only note is that I think the setup can be simplified a bit. The central idea I have in mind is that the AI does something like:
“Think” about what to do next, for up to some max period of time (“what to do next” can be “think more, with prompt X”).
Do it
Repeat
This seems like a pretty natural way for an “agent” to operate, and then every #1 is an “auditable step” in your terminology. (And the audits are done by comparing a few rollouts of that step, and performing gradient descent without executing any of them.)
There are probably subtleties I’m missing, but I think this points pretty well at what I tend to think of as the hopes of process-based supervision.