Logan Riggs comments on Corrigibility as outside view

Logan Riggs 12 May 2020 17:27 UTC
3 points
Okay, the outside view analogy makes sense. If I were to explain it to me, I would say:
Locally, an action may seem good, but looking at the outside view, drawing information from similar instances of my past or other people like me, that same action may seem bad.
In the same way, an agent can access the outside view to see if it’s action is good by drawing on similar instances. But how does it get this outside view information? Assuming the agent has a model of human interactions and a list of “possible values for humans”, it can simulate different people with different values to see how well it learned their values by the time it’s considering a specific action.
Considering the action “disable the off-switch”. It simulates itself interacting with Bob who values long walks on the beach. By the time it considers the disable action, it can check it’s simulated self’s prediction of Bob’s value. If the prediction is “Bob likes long walks on the beach”, then that’s an update towards doing the disable action. If it’s a different prediction, that’s an update against the disable action.
Repeat 100 times for different people with different values and you’ll have a better understanding of which actions are safe or not. (I think a picture of a double-thought bubble like the one in this post would help explain this specific example.)