You think that when an AI performs a bad action, (say remove the diamond) the AI has to have knowledge that the diamond is in fact no longer there. Even when the camera shows the diamond is (falsely) there and the human confirms that the diamond is there.
You call this ELK
You want the human to have access to this knowledge, as this is useful to choosing decisions that the human wants.
This is hard. So you have people propose how to do this.
And then people try to explain why that strategy wouldn’t work.
Rinse and repeat.
Once you have a proposal that nobody is able to show doesn’t work.… profit?
Correct any misunderstandings in my basic overview above.
The “explain why that strategy wouldn’t work” step specifically takes the form of “describing a way the world could be where that strategy demonstrably doesn’t work” (rather than more heuristic arguments).
Once we have a proposal where we try really hard to come up with situations where it could demonstrably fail, and can’t think of any, we will probably need to do lots of empirical work to figure out if we can implement it and if it actually works in practice. But we hope that this exercise will teach us a lot about the nature of the empirical work we’ll need to do, as well as providing more confidence that the strategy will generalize beyond what we are able to test in practice. (For example, ELK was highlighted as a problem in the first place after ARC researchers thought a lot about possible failure modes of iterated amplification.)
Silly question warning.
You think that when an AI performs a bad action, (say remove the diamond) the AI has to have knowledge that the diamond is in fact no longer there. Even when the camera shows the diamond is (falsely) there and the human confirms that the diamond is there.
You call this ELK
You want the human to have access to this knowledge, as this is useful to choosing decisions that the human wants.
This is hard. So you have people propose how to do this.
And then people try to explain why that strategy wouldn’t work.
Rinse and repeat.
Once you have a proposal that nobody is able to show doesn’t work.… profit?
Correct any misunderstandings in my basic overview above.
This broadly seems right. Some details:
The “explain why that strategy wouldn’t work” step specifically takes the form of “describing a way the world could be where that strategy demonstrably doesn’t work” (rather than more heuristic arguments).
Once we have a proposal where we try really hard to come up with situations where it could demonstrably fail, and can’t think of any, we will probably need to do lots of empirical work to figure out if we can implement it and if it actually works in practice. But we hope that this exercise will teach us a lot about the nature of the empirical work we’ll need to do, as well as providing more confidence that the strategy will generalize beyond what we are able to test in practice. (For example, ELK was highlighted as a problem in the first place after ARC researchers thought a lot about possible failure modes of iterated amplification.)