HoldenKarnofsky comments on How might we align transformative AI if it’s developed very soon?

HoldenKarnofsky 14 Jun 2023 5:27 UTC
LW: 6 AF: 3
0
AF
This sounds right to me!
Only note is that I think the setup can be simplified a bit. The central idea I have in mind is that the AI does something like:
1. “Think” about what to do next, for up to some max period of time (“what to do next” can be “think more, with prompt X”).
2. Do it
3. Repeat
This seems like a pretty natural way for an “agent” to operate, and then every #1 is an “auditable step” in your terminology. (And the audits are done by comparing a few rollouts of that step, and performing gradient descent without executing any of them.)
There are probably subtleties I’m missing, but I think this points pretty well at what I tend to think of as the hopes of process-based supervision.
What links here?
- Thoughts on “Process-Based Supervision” by Steven Byrnes (17 Jul 2023 14:08 UTC; 74 points)