Most alignment plans involve getting the AI to a point where it cares about human values, then uses its greater intelligence to solve problems in ways we didn’t think of.
Some alignment plans literally involve finding clever ways to get the AI to solve alignment itself in some safe way. [1]
This suggests something interesting: Every alignment plan, explicitly or not, is leaving some amount of “alignment work” for the AI (even if that amount is “none”), and thus leaving the remainder for humans to work out. Generally (but not always!), the idea is that humans must get X% of alignment knowledge right before launching the AI, lest it become misaligned.
I don’t see many groups lay out explicit reasons for selecting which “built-in-vs-learned alignment-knowledge-mix” their plan aims for. Of course, most (all?) plans already have this by default, and maybe this whole concept is sorta trivial anyway. But I haven’t seen this precise consideration expressed-as-a-ratio anywhere else.
(I got some feedback on this as a post, but they noted that the idea is probably too-abstract to be useful for many plans. Sure enough, when I helped the AI-plans.com critique-a-thon, most “plans” were actually just small things that could “slot into” a larger alignment plan. Only certain kinds of “full stack alignment” plans could be usefully compared with this idea.)
I like your observation. I didn’t realize at first that I had seen it before, from you during the critique-a-thon! (Thank you for helping out with that, by the way!)
A percentage or ratio of the “amount” of alignment left to the AI sounds useful as a fuzzy heuristic in some situations, but I think it is probably a little too fuzzy to get at the the failures mode(s) of a given alignment strategy. My suspicion is that which parts of alignment are left to the AI will have much more to say about the success of alignment than how many of those checkboxes are checked. Where I think this proposed heuristic succeeds is when the ratio of human/AI responsibility in solving alignment is set very low. By my lights, that is an indication that the plan is more holes than cheese.
(How much work is left to a separate helper AI might be its own category. I have some moderate opinions on OpenAI’s Superalignment effort, but those are very tangential thoughts.)
What % of alignment is crucial to get right?
Most alignment plans involve getting the AI to a point where it cares about human values, then uses its greater intelligence to solve problems in ways we didn’t think of.
Some alignment plans literally involve finding clever ways to get the AI to solve alignment itself in some safe way. [1]
This suggests something interesting: Every alignment plan, explicitly or not, is leaving some amount of “alignment work” for the AI (even if that amount is “none”), and thus leaving the remainder for humans to work out. Generally (but not always!), the idea is that humans must get X% of alignment knowledge right before launching the AI, lest it become misaligned.
I don’t see many groups lay out explicit reasons for selecting which “built-in-vs-learned alignment-knowledge-mix” their plan aims for. Of course, most (all?) plans already have this by default, and maybe this whole concept is sorta trivial anyway. But I haven’t seen this precise consideration expressed-as-a-ratio anywhere else.
(I got some feedback on this as a post, but they noted that the idea is probably too-abstract to be useful for many plans. Sure enough, when I helped the AI-plans.com critique-a-thon, most “plans” were actually just small things that could “slot into” a larger alignment plan. Only certain kinds of “full stack alignment” plans could be usefully compared with this idea.)
For a general mathematization of something like this, see the “QACI” plan by Tamsin Leake at Orthogonal.
I like your observation. I didn’t realize at first that I had seen it before, from you during the critique-a-thon! (Thank you for helping out with that, by the way!)
A percentage or ratio of the “amount” of alignment left to the AI sounds useful as a fuzzy heuristic in some situations, but I think it is probably a little too fuzzy to get at the the failures mode(s) of a given alignment strategy. My suspicion is that which parts of alignment are left to the AI will have much more to say about the success of alignment than how many of those checkboxes are checked. Where I think this proposed heuristic succeeds is when the ratio of human/AI responsibility in solving alignment is set very low. By my lights, that is an indication that the plan is more holes than cheese.
(How much work is left to a separate helper AI might be its own category. I have some moderate opinions on OpenAI’s Superalignment effort, but those are very tangential thoughts.)