Building off of your other answer here, I think I can imagine at least one situation where your terminal values will get vetoed. Imagine that you discovered, to your horror, that all of your actions up until now have been subconsciously motivated to bring about doomsday. Causing the death of everyone is actually your terminal goal, which you were ignorant of. Furthermore, your subconsciously motivated actions actually have been effective at bringing the world closer and closer to its demise. Your only way to divert this now is to throw yourself immediately out of the window to your death, thereby averting your own terminal goal.
Would you do this?
Dr. Jekyll and Mr. Hyde are the same person, but are they really the same agent?
Suppose I would end up walking out the window. And it would be the wrong action for me to take. I would be foiled by a bunch of bad heuristics and biases I’d internalized over the course of my omnicidal plot. There would be no agent corresponding to me whose values would be satisfied by this.
It would be not unlike, say, manipulating and gaslighting someone until they decide to kill their entire family. This would be against the values the person would claim as their “truer” ones, but in the moment, under the psychological pressure and the influence of some convincing lies, it’d (incorrectly) feel to them like a good idea.
Building off of your other answer here, I think I can imagine at least one situation where your terminal values will get vetoed. Imagine that you discovered, to your horror, that all of your actions up until now have been subconsciously motivated to bring about doomsday. Causing the death of everyone is actually your terminal goal, which you were ignorant of. Furthermore, your subconsciously motivated actions actually have been effective at bringing the world closer and closer to its demise. Your only way to divert this now is to throw yourself immediately out of the window to your death, thereby averting your own terminal goal.
Would you do this?
Dr. Jekyll and Mr. Hyde are the same person, but are they really the same agent?
Suppose I would end up walking out the window. And it would be the wrong action for me to take. I would be foiled by a bunch of bad heuristics and biases I’d internalized over the course of my omnicidal plot. There would be no agent corresponding to me whose values would be satisfied by this.
It would be not unlike, say, manipulating and gaslighting someone until they decide to kill their entire family. This would be against the values the person would claim as their “truer” ones, but in the moment, under the psychological pressure and the influence of some convincing lies, it’d (incorrectly) feel to them like a good idea.