what it means to have preferences in a way that doesn’t give rise to consequentialist behaviour. Having (unstable) preferences over “what happens 5 seconds after my current action” sounds to me like not really having preferences at all. The behaviour is not coherent enough to be interpreted as preferring some things over others, except in a contrived way.
Oh, sorry, I’m thinking of a planning agent. At any given time it considers possible courses of action, and decides what to do based on “preferences”. So “preferences” are an ingredient in the algorithm, not something to be inferred from external behavior.
That said, if someone “prefers” to tell people what’s on his mind, or if someone “prefers” to hold their fork with their left hand … I think those are two examples of “preferences” in the everyday sense of the word, but that they’re not expressible as a rank-ordering of the state of the world at a future date.
If you try to incentivize corrigibility via a recognizer for being corrigible, the making-plans-that-actually-work part of the AI effectively just adds fooling the recognizer to its requirements for actually working.
Instead of “desire to be corrigible”, I’ll switch to something more familiar: “desire to save the rainforest”.
Let’s say my friend Sally is “trying to save the rainforest”. There’s no “save the rainforest detector” external to Sally, which Sally is trying to satisfy. Instead, the “save the rainforest” concept is inside Sally’s own head.
When Sally decides to execute Plan X because it will help save the rainforest, that decision is based on the details of Plan X as Sally herself understands it.
Let’s also assume that Sally’s motivation is ego-syntonic (which we definitely want for our AGIs): In other words, Sally wants to save the rainforest and Sally wants to want to save the rainforest.
Under those circumstances, I don’t think saying something like “Sally wants to fool the recognizer” is helpful. That’s not an accurate description of her motivation. In particular, if she were offered an experience machine or brain-manipulator that could make her believe that she has saved the rainforest, without all the effort of actually saving the rainforest, she would emphatically turn down that offer.
So what can go wrong?
Let’s say Sally and Ahmed are working at the same rainforest advocacy organization. They’re both “trying to save the rainforest”, but maybe those words mean slightly different things to them. Let’s quiz them with a list of 20 weird out-of-distribution hypotheticals:
“If we take every tree and animal in the rainforest and transplant it to a different planet, where it thrives, does that count as “saving the rainforest”?”
“If we raze the rainforest but run an atom-by-atom simulation of it, does that count as “saving the rainforest”?”
Etc.
Presumably Sally and Ahmed will give different answers, and this could conceivably shake out as Sally taking an action that Ahmed strongly opposes or vice-versa, even though they nominally share the same goal.
You can describe that as “Sally is narrowly targeting the save-the-rainforest-recognizer-in-Sally’s-head, and Ahmed is narrowly targeting the save-the-rainforest-recognizer-in-Ahmed’s-head, and each sees the other as Goodhart’ing a corner-case where their recognizer is screwing up.”
That’s definitely a problem, and that’s the kind of stuff I was talking about under “Objection 1” in the post, where I noted the necessity of out-of-distribution detection systems perhaps related to Stuart Armstrong’s “model splintering” ideas etc.
Thanks, this is helpful!
Oh, sorry, I’m thinking of a planning agent. At any given time it considers possible courses of action, and decides what to do based on “preferences”. So “preferences” are an ingredient in the algorithm, not something to be inferred from external behavior.
That said, if someone “prefers” to tell people what’s on his mind, or if someone “prefers” to hold their fork with their left hand … I think those are two examples of “preferences” in the everyday sense of the word, but that they’re not expressible as a rank-ordering of the state of the world at a future date.
Instead of “desire to be corrigible”, I’ll switch to something more familiar: “desire to save the rainforest”.
Let’s say my friend Sally is “trying to save the rainforest”. There’s no “save the rainforest detector” external to Sally, which Sally is trying to satisfy. Instead, the “save the rainforest” concept is inside Sally’s own head.
When Sally decides to execute Plan X because it will help save the rainforest, that decision is based on the details of Plan X as Sally herself understands it.
Let’s also assume that Sally’s motivation is ego-syntonic (which we definitely want for our AGIs): In other words, Sally wants to save the rainforest and Sally wants to want to save the rainforest.
Under those circumstances, I don’t think saying something like “Sally wants to fool the recognizer” is helpful. That’s not an accurate description of her motivation. In particular, if she were offered an experience machine or brain-manipulator that could make her believe that she has saved the rainforest, without all the effort of actually saving the rainforest, she would emphatically turn down that offer.
So what can go wrong?
Let’s say Sally and Ahmed are working at the same rainforest advocacy organization. They’re both “trying to save the rainforest”, but maybe those words mean slightly different things to them. Let’s quiz them with a list of 20 weird out-of-distribution hypotheticals:
“If we take every tree and animal in the rainforest and transplant it to a different planet, where it thrives, does that count as “saving the rainforest”?”
“If we raze the rainforest but run an atom-by-atom simulation of it, does that count as “saving the rainforest”?”
Etc.
Presumably Sally and Ahmed will give different answers, and this could conceivably shake out as Sally taking an action that Ahmed strongly opposes or vice-versa, even though they nominally share the same goal.
You can describe that as “Sally is narrowly targeting the save-the-rainforest-recognizer-in-Sally’s-head, and Ahmed is narrowly targeting the save-the-rainforest-recognizer-in-Ahmed’s-head, and each sees the other as Goodhart’ing a corner-case where their recognizer is screwing up.”
That’s definitely a problem, and that’s the kind of stuff I was talking about under “Objection 1” in the post, where I noted the necessity of out-of-distribution detection systems perhaps related to Stuart Armstrong’s “model splintering” ideas etc.
Does that help?