I noticed myself mentally grading the entries by some extra criteria. The main ones being something like “taking-over-the-world competitiveness” (TOTWC, or TOW for short) and “would I actually trust this farther than I could throw it, once it’s trying to operate in novel domains?” (WIATTFTICTIOITTOIND, or WIT for short).
A raw statement of my feelings:
Reinforcement learning + transparency tool: High TOW, Very Low WIT.
Imitative amplification + intermittent oversight: Medium TOW, Low WIT.
Imitative amplification + relaxed adversarial training: Medium TOW, Medium-low WIT.
Approval-based amplification + relaxed adversarial training: Medium TOW, Low WIT.
Microscope AI: Very Low TOW, High WIT.
STEM AI: Low TOW, Medium WIT.
Narrow reward modeling + transparency tools: High TOW, Medium WIT.
Nice gradings. I was confused about this one, though:
Narrow reward modeling + transparency tools: High TOW, Medium WIT.
Why is this the most trustworthy thing on the list? To me it feels like one of the least trustworthy, because it’s one of the few where we haven’t done recursive safety checks (especially given lack of adversarial training).
Basically because I think that amplification/recursion, in the current way I think it’s meant, is more trouble than it’s worth. It’s going to produce things that have high fitness according to the selection process applied, which in the limit are going to be bad.
On the other hand, you might see this as me claiming that “narrow reward modeling” includes a lot of important unsolved problems. HCH is well-specified enough that you can talk about doing it with current technology. But fulfilling the verbal description of narrow value learning requires some advances in modeling the real world (unless you literally treat the world as a POMDP and humans as Boltzmann-rational agents, in which case we’re back down to bad computational properties and also bad safety properties), which gives me the wiggle room to be hopeful.
I noticed myself mentally grading the entries by some extra criteria. The main ones being something like “taking-over-the-world competitiveness” (TOTWC, or TOW for short) and “would I actually trust this farther than I could throw it, once it’s trying to operate in novel domains?” (WIATTFTICTIOITTOIND, or WIT for short).
A raw statement of my feelings:
Reinforcement learning + transparency tool: High TOW, Very Low WIT.
Imitative amplification + intermittent oversight: Medium TOW, Low WIT.
Imitative amplification + relaxed adversarial training: Medium TOW, Medium-low WIT.
Approval-based amplification + relaxed adversarial training: Medium TOW, Low WIT.
Microscope AI: Very Low TOW, High WIT.
STEM AI: Low TOW, Medium WIT.
Narrow reward modeling + transparency tools: High TOW, Medium WIT.
Recursive reward modeling + relaxed adversarial training: High TOW, Low WIT.
AI safety via debate with transparency tools: Medium-Low TOW, Low WIT.
Amplification with auxiliary RL objective + relaxed adversarial training: Medium TOW, Medium-low WIT.
Amplification alongside RL + relaxed adversarial training: Medium-low TOW, Medium WIT.
Nice gradings. I was confused about this one, though:
Why is this the most trustworthy thing on the list? To me it feels like one of the least trustworthy, because it’s one of the few where we haven’t done recursive safety checks (especially given lack of adversarial training).
Basically because I think that amplification/recursion, in the current way I think it’s meant, is more trouble than it’s worth. It’s going to produce things that have high fitness according to the selection process applied, which in the limit are going to be bad.
On the other hand, you might see this as me claiming that “narrow reward modeling” includes a lot of important unsolved problems. HCH is well-specified enough that you can talk about doing it with current technology. But fulfilling the verbal description of narrow value learning requires some advances in modeling the real world (unless you literally treat the world as a POMDP and humans as Boltzmann-rational agents, in which case we’re back down to bad computational properties and also bad safety properties), which gives me the wiggle room to be hopeful.