Hmm, let me try again to explain what I’m trying to get at.
Let’s say you’re a prison architect, building prisons for three prisoners
Prisoner Alice is 100% maximally obsessed with breaking out of prison. She eats and sleeps and works and socializes precisely to the extent that these are instrumentally useful for breaking out of prison. She thinks hard and creatively about breakout strategies during every free moment.
Prisoner Carol does not treat breaking out as an explicit goal, although if she happened upon an open door, she would absolutely walk out. She spends her days reading books from the prison library, socializing, working, etc.
Prisoner Bob is somewhere between the two. For 23 hours a day, he has various persuits like Carol. But for a total of 1 hour a day, he works on the project of breaking out of prison. (For a total of ~30 minutes scattered through the day, he does opportunistic information-gathering—he watches guard behavior, he explores the prison grounds, etc. Then, for the 30 minutes before he goes to sleep each night, he sits quietly in his cell, and thinks hard and creatively about how to break out.)
Here’s my question: “Is Bob more like Alice or more like Carol?”
From a psychological perspective, Bob is similar to Carol, while Alice is kinda nuts, even cartoonish. Bob and Carol are doing the same range if activities for the same reasons for 23 hours a day = 96%. They both make friends, they both exercise, they both read books, they both complain about the weather. Alice has a simple utility function, whereas Bob and Carol are actual plausible humans.
From a prison safety and security perspective, Bob and Alice are very similar to each other, and wildly different from Carol. Alice and Bob are Very High Breakout Risk, Carol is Low Breakout Risk. If you build the prison with 5 layers of security, each of which is 90% reliable, Carol will be kept locked up with 99.999% certainty, whereas either Alice or Bob will engineer a situation where all 5 layers break at once. Alice will come up with good breakout plans sooner than Bob, but not exponentially sooner; it’s a difference of degree not kind, unlike Bob versus Carol.
The moral I’m trying to convey is that, when we talk about mesa-optimization, the important question is “Can we correctly explain any non-infinitesimal subset of the system’s behavior as explicit optimization for a misaligned goal?” , not “Can we correctly explain 100% of the system’s behavior as explicit optimization for a misaligned goal?”
The argument for risk doesn’t depend on the definition of mesa optimization. I would state the argument for risk as “the AI system’s capabilities might generalize without its objective generalizing”, where the objective is defined via the intentional stance. Certainly this can be true without the AI system being 100% a mesa optimizer as defined in the paper. I thought this post was suggesting that we should widen the term “mesa optimizer” so that it includes those kinds of systems (the current definition doesn’t), so I don’t think you and Matthew actually disagree.
It’s important to get this right, because solutions often do depend on the definition. Under the current definition, you might try to solve the problem by developing interpretability techniques that can find the mesa objective in the weights of the neural net, so that you can make sure it is what you want. However, I don’t think this would work for other systems that are still risky, such as Bob in your example.
Hmm, let me try again to explain what I’m trying to get at.
Let’s say you’re a prison architect, building prisons for three prisoners
Prisoner Alice is 100% maximally obsessed with breaking out of prison. She eats and sleeps and works and socializes precisely to the extent that these are instrumentally useful for breaking out of prison. She thinks hard and creatively about breakout strategies during every free moment.
Prisoner Carol does not treat breaking out as an explicit goal, although if she happened upon an open door, she would absolutely walk out. She spends her days reading books from the prison library, socializing, working, etc.
Prisoner Bob is somewhere between the two. For 23 hours a day, he has various persuits like Carol. But for a total of 1 hour a day, he works on the project of breaking out of prison. (For a total of ~30 minutes scattered through the day, he does opportunistic information-gathering—he watches guard behavior, he explores the prison grounds, etc. Then, for the 30 minutes before he goes to sleep each night, he sits quietly in his cell, and thinks hard and creatively about how to break out.)
Here’s my question: “Is Bob more like Alice or more like Carol?”
From a psychological perspective, Bob is similar to Carol, while Alice is kinda nuts, even cartoonish. Bob and Carol are doing the same range if activities for the same reasons for 23 hours a day = 96%. They both make friends, they both exercise, they both read books, they both complain about the weather. Alice has a simple utility function, whereas Bob and Carol are actual plausible humans.
From a prison safety and security perspective, Bob and Alice are very similar to each other, and wildly different from Carol. Alice and Bob are Very High Breakout Risk, Carol is Low Breakout Risk. If you build the prison with 5 layers of security, each of which is 90% reliable, Carol will be kept locked up with 99.999% certainty, whereas either Alice or Bob will engineer a situation where all 5 layers break at once. Alice will come up with good breakout plans sooner than Bob, but not exponentially sooner; it’s a difference of degree not kind, unlike Bob versus Carol.
The moral I’m trying to convey is that, when we talk about mesa-optimization, the important question is “Can we correctly explain any non-infinitesimal subset of the system’s behavior as explicit optimization for a misaligned goal?” , not “Can we correctly explain 100% of the system’s behavior as explicit optimization for a misaligned goal?”
The argument for risk doesn’t depend on the definition of mesa optimization. I would state the argument for risk as “the AI system’s capabilities might generalize without its objective generalizing”, where the objective is defined via the intentional stance. Certainly this can be true without the AI system being 100% a mesa optimizer as defined in the paper. I thought this post was suggesting that we should widen the term “mesa optimizer” so that it includes those kinds of systems (the current definition doesn’t), so I don’t think you and Matthew actually disagree.
It’s important to get this right, because solutions often do depend on the definition. Under the current definition, you might try to solve the problem by developing interpretability techniques that can find the mesa objective in the weights of the neural net, so that you can make sure it is what you want. However, I don’t think this would work for other systems that are still risky, such as Bob in your example.