I like your observation. I didn’t realize at first that I had seen it before, from you during the critique-a-thon! (Thank you for helping out with that, by the way!)
A percentage or ratio of the “amount” of alignment left to the AI sounds useful as a fuzzy heuristic in some situations, but I think it is probably a little too fuzzy to get at the the failures mode(s) of a given alignment strategy. My suspicion is that which parts of alignment are left to the AI will have much more to say about the success of alignment than how many of those checkboxes are checked. Where I think this proposed heuristic succeeds is when the ratio of human/AI responsibility in solving alignment is set very low. By my lights, that is an indication that the plan is more holes than cheese.
(How much work is left to a separate helper AI might be its own category. I have some moderate opinions on OpenAI’s Superalignment effort, but those are very tangential thoughts.)
I like your observation. I didn’t realize at first that I had seen it before, from you during the critique-a-thon! (Thank you for helping out with that, by the way!)
A percentage or ratio of the “amount” of alignment left to the AI sounds useful as a fuzzy heuristic in some situations, but I think it is probably a little too fuzzy to get at the the failures mode(s) of a given alignment strategy. My suspicion is that which parts of alignment are left to the AI will have much more to say about the success of alignment than how many of those checkboxes are checked. Where I think this proposed heuristic succeeds is when the ratio of human/AI responsibility in solving alignment is set very low. By my lights, that is an indication that the plan is more holes than cheese.
(How much work is left to a separate helper AI might be its own category. I have some moderate opinions on OpenAI’s Superalignment effort, but those are very tangential thoughts.)