Very toy model of ontology mismatch (in my tentative guess, the general barrier on the way to corrigibility) and impact minimization:
You have a set of boolean variables, a known boolean formula WIN, and an unknown boolean formula UNSAFE. Your goal is to change the current safe but not winning assignment of variables into a still safe but winning assignment. You have no feedback, and if you hit an UNSAFE assignment, it’s an instant game over. You have literally no idea about the composition of the UNSAFE formula.
The obvious solution here is to change as few variables as possible. (Proof of this is an exercise for the reader.)
Another problem: you are on a 2D plane. You are at point (2,2). You need to reach a line x+y=8. Everything outside this line and your starting point is a minefield—if you hit the wrong spot, everything blows up. You don’t know anything about the distribution of the dangerous zone. Therefore, your shortest and safest trajectory is towards point (4,4).
You can rewrite x+y=8 as a boolean formula and (2,2) as (0,0,1,0,0,0,1,0). But now we have a problem: from the perspective of “boolean formula ontology” the smallest necessary change is (0,1,1,0,0,0,1,0), which is equivalent to moving to point (6,2), and the resulting trajectory is larger than the trajectory between (2,2) and (4,4). Conversely, the shortest trajectory on the 2D plane leads to a change of 4 variables out of 8.
Very toy model of ontology mismatch (in my tentative guess, the general barrier on the way to corrigibility) and impact minimization:
You have a set of boolean variables, a known boolean formula WIN, and an unknown boolean formula UNSAFE. Your goal is to change the current safe but not winning assignment of variables into a still safe but winning assignment. You have no feedback, and if you hit an UNSAFE assignment, it’s an instant game over. You have literally no idea about the composition of the UNSAFE formula.
The obvious solution here is to change as few variables as possible. (Proof of this is an exercise for the reader.)
Another problem: you are on a 2D plane. You are at point (2,2). You need to reach a line x+y=8. Everything outside this line and your starting point is a minefield—if you hit the wrong spot, everything blows up. You don’t know anything about the distribution of the dangerous zone. Therefore, your shortest and safest trajectory is towards point (4,4).
You can rewrite x+y=8 as a boolean formula and (2,2) as (0,0,1,0,0,0,1,0). But now we have a problem: from the perspective of “boolean formula ontology” the smallest necessary change is (0,1,1,0,0,0,1,0), which is equivalent to moving to point (6,2), and the resulting trajectory is larger than the trajectory between (2,2) and (4,4). Conversely, the shortest trajectory on the 2D plane leads to a change of 4 variables out of 8.