Frankfurt-style counterexamples for definitions of optimization
In “Bottle Caps Aren’t Optimizers”, I wrote about a type of definition of optimization that says system S is optimizing for goal G iff G has a higher value than it would if S didn’t exist or were randomly scrambled. I argued against these definitions by providing a examples of systems that satisfy the criterion but are not optimizers. But today, I realized that I could repurpose Frankfurt cases to get examples of optimizers that don’t satisfy this criterion.
A Frankfurt case is a thought experiment designed to disprove the following intuitive principle: “a person is morally responsible for what she has done only if she could have done otherwise.” Here’s the basic idea: suppose Alice is considering whether or not to kill Bob. Upon consideration, she decides to do so, takes out her gun, and shoots Bob. But little-known to her, a neuroscientist had implanted a chip in her brain that would have forced her to shoot Bob if she had decided not to. That said, the chip didn’t activate, because she did decide to shoot Bob. The idea is that she’s morally responsible, even tho she couldn’t have done otherwise.
Anyway, let’s do this with optimizers. Suppose I’m playing Go, thinking about how to win—imagining what would happen if I played various moves, and playing moves that make me more likely to win. Further suppose I’m pretty good at it. You might want to say I’m optimizing my moves to win the game. But suppose that, unbeknownst to me, behind my shoulder is famed Go master Shin Jinseo. If I start playing really bad moves, or suddenly die or vanish etc, he will play my moves, and do an even better job at winning. Now, if you remove me or randomly rearrange my parts, my side is actually more likely to win the game. But that doesn’t mean I’m optimizing to lose the game! So this is another way such definitions of optimizers are wrong.
That said, other definitions treat this counter-example well. E.g. I think the one given in “The ground of optimization” says that I’m optimizing to win the game (maybe only if I’m playing a weaker opponent).
Interesting, but I’m not sure how successful the counterexample is. After all, if your terminal goal in the whole environment was truly for your side to win, then it makes sense to understand anything short of letting Shin play as a shortcoming of your optimization (with respect to that goal). Of course, even in the case where that’s your true goal and you’re committing a mistake (which is not common), we might want to say that you are deploying a lot of optimization, with respect to the different goal of “winning by yourself”, or “having fun”, which is compatible with failing at another goal. This could be taken to absurd extremes (whatever you’re doing, I can understand you as optimizing really hard for doing exactly what you’re doing), but the natural way around that is for your imputed goals to be required simple (in some background language or ontology, like that of humans). This is exactly the approach mathematically taken by Vanessa in the past (the equation at 3:50 here). I think this “goal relativism” is fundamentally correct. The only problem with Vanessa’s approach is that it’s hard to account for the agent being mistaken (for example, you not knowing Shin is behind you).[1] I think the only natural way to account for this is to see things from the agent’s native ontology (or compute probabilities according to their prior), however we might extract those from them. So we’re unavoidably back at the problem of ontology identification (which I do think is the core problem).
Say Alice has lived her whole life in a room with a single button. People from the outside told her pressing the button would create nice paintings. Throughout her life, they provided an exhaustive array of proofs and confirmations of this fact. Unbeknownst to her, this was all an elaborate scheme, and in reality pressing the button destroys nice paintings. Alice, liking paintings, regularly presses the button. A naive application of Vanessa’s criterion would impute Alice the goal of destroying paintings. To avoid this, we somehow need to integrate over all possible worlds Alice can find herself in, and realize that, when you are presented with an exhaustive array of proofs and confirmations that the button creates paintings, it is on average more likely for the button to create paintings than destroy them. But we face a decision. Either we fix a prior to do this that we will use for all agents, in which case all agents with a different prior will look silly to us. Or we somehow try to extract the agent’s prior, and we’re back at ontology identification.
(Disclaimer: This was SOTA understanding a year ago, unsure if it still is now.)
Frankfurt-style counterexamples for definitions of optimization
In “Bottle Caps Aren’t Optimizers”, I wrote about a type of definition of optimization that says system S is optimizing for goal G iff G has a higher value than it would if S didn’t exist or were randomly scrambled. I argued against these definitions by providing a examples of systems that satisfy the criterion but are not optimizers. But today, I realized that I could repurpose Frankfurt cases to get examples of optimizers that don’t satisfy this criterion.
A Frankfurt case is a thought experiment designed to disprove the following intuitive principle: “a person is morally responsible for what she has done only if she could have done otherwise.” Here’s the basic idea: suppose Alice is considering whether or not to kill Bob. Upon consideration, she decides to do so, takes out her gun, and shoots Bob. But little-known to her, a neuroscientist had implanted a chip in her brain that would have forced her to shoot Bob if she had decided not to. That said, the chip didn’t activate, because she did decide to shoot Bob. The idea is that she’s morally responsible, even tho she couldn’t have done otherwise.
Anyway, let’s do this with optimizers. Suppose I’m playing Go, thinking about how to win—imagining what would happen if I played various moves, and playing moves that make me more likely to win. Further suppose I’m pretty good at it. You might want to say I’m optimizing my moves to win the game. But suppose that, unbeknownst to me, behind my shoulder is famed Go master Shin Jinseo. If I start playing really bad moves, or suddenly die or vanish etc, he will play my moves, and do an even better job at winning. Now, if you remove me or randomly rearrange my parts, my side is actually more likely to win the game. But that doesn’t mean I’m optimizing to lose the game! So this is another way such definitions of optimizers are wrong.
That said, other definitions treat this counter-example well. E.g. I think the one given in “The ground of optimization” says that I’m optimizing to win the game (maybe only if I’m playing a weaker opponent).
Interesting, but I’m not sure how successful the counterexample is. After all, if your terminal goal in the whole environment was truly for your side to win, then it makes sense to understand anything short of letting Shin play as a shortcoming of your optimization (with respect to that goal). Of course, even in the case where that’s your true goal and you’re committing a mistake (which is not common), we might want to say that you are deploying a lot of optimization, with respect to the different goal of “winning by yourself”, or “having fun”, which is compatible with failing at another goal.
This could be taken to absurd extremes (whatever you’re doing, I can understand you as optimizing really hard for doing exactly what you’re doing), but the natural way around that is for your imputed goals to be required simple (in some background language or ontology, like that of humans). This is exactly the approach mathematically taken by Vanessa in the past (the equation at 3:50 here).
I think this “goal relativism” is fundamentally correct. The only problem with Vanessa’s approach is that it’s hard to account for the agent being mistaken (for example, you not knowing Shin is behind you).[1]
I think the only natural way to account for this is to see things from the agent’s native ontology (or compute probabilities according to their prior), however we might extract those from them. So we’re unavoidably back at the problem of ontology identification (which I do think is the core problem).
Say Alice has lived her whole life in a room with a single button. People from the outside told her pressing the button would create nice paintings. Throughout her life, they provided an exhaustive array of proofs and confirmations of this fact. Unbeknownst to her, this was all an elaborate scheme, and in reality pressing the button destroys nice paintings. Alice, liking paintings, regularly presses the button.
A naive application of Vanessa’s criterion would impute Alice the goal of destroying paintings. To avoid this, we somehow need to integrate over all possible worlds Alice can find herself in, and realize that, when you are presented with an exhaustive array of proofs and confirmations that the button creates paintings, it is on average more likely for the button to create paintings than destroy them.
But we face a decision. Either we fix a prior to do this that we will use for all agents, in which case all agents with a different prior will look silly to us. Or we somehow try to extract the agent’s prior, and we’re back at ontology identification.
(Disclaimer: This was SOTA understanding a year ago, unsure if it still is now.)