I have a couple of silly, absurd questions related to mesa-optimizers and mesa-controllers. I’m asking them to get a fresh look on the problem of inner alignment. I want to get a better grip on what basic properties of a model make it safe.
It’s a model, so it could be unsafe just like an AI.
QM is not an agent, but its predictions strongly affect humanity. Oracles can be dangerous.
QM is highly interpretable, so we can check that it’s not doing internal search. Or can we? Maybe it does search in some implicit way? Eliezer brought up this possibility: if you prohibit an AI from modeling its programmers’ psychology, the AI might start modelling something seemingly irrelevant which is actually equivalent to modeling the programmers’ psychology.
Maybe the AI reasons about certain very complicated properties of the material object on the pedestal… in fact, these properties are so complicated that they turn out to contain implicit models of User2′s psychology
Even if QM doesn’t do search in any way… maybe it still was optimized to steer humanity towards disaster?
Or maybe QM is “grounded” in some special way (e.g. it’s easy to split into parts and verify that each part is correct), so we’re very confident that it does physics and only physics?
Question 2. Crazier version of the previous question: how do we know that Peano arithmetic isn’t plotting to destroy humanity? how do we know that the game of chess isn’t plotting to end humanity?
Maybe Peano arithmetic contains theorems trying to prove which steers the real world towards disaster. How can we know and when do we care?
Question 3. Imagine you came up with a plan to achieve your goals. You did it yourself. How do you know that this plan is not optimizing for your ruin?
Humans do go insane and fall into addictions. But not always. So why are our thoughts relatively safe to us? Why doesn’t every new thought / experience turn into addiction which wipes out all of your previous personality?
Question 4. You’re the Telepath. You can read the mind of the Killer. The Killer can reason about some things which aren’t comprehensible to you, but otherwise your cognition is very similar. Can you always tell if the Killer is planning to kill you?
Here are some thoughts the Killer might think:
“I need to do <something incomprehensible> so the Telepath dies.”
“I need to get the Telepath to eat this food with <something incomprehensible> in it.”
“I need to do <something incomprehensible> without any comprehensible reason.”
With 1 we can understand the outcome and that’s all that matters. With 2 we can still tell that something dodgy is going on. Even in 3 we see that the Killer tries to make his reasoning illegible. Maybe the Killer can never deceive us if the incomprehensible concepts he’s thinking about are “embedded” into the comprehensible concepts?
I have a couple of silly, absurd questions related to mesa-optimizers and mesa-controllers. I’m asking them to get a fresh look on the problem of inner alignment. I want to get a better grip on what basic properties of a model make it safe.
Question 1. How do we know that Quantum Mechanics theory is not plotting to kill humanity?
It’s a model, so it could be unsafe just like an AI.
QM is not an agent, but its predictions strongly affect humanity. Oracles can be dangerous.
QM is highly interpretable, so we can check that it’s not doing internal search. Or can we? Maybe it does search in some implicit way? Eliezer brought up this possibility: if you prohibit an AI from modeling its programmers’ psychology, the AI might start modelling something seemingly irrelevant which is actually equivalent to modeling the programmers’ psychology.
Even if QM doesn’t do search in any way… maybe it still was optimized to steer humanity towards disaster?
Or maybe QM is “grounded” in some special way (e.g. it’s easy to split into parts and verify that each part is correct), so we’re very confident that it does physics and only physics?
Question 2. Crazier version of the previous question: how do we know that Peano arithmetic isn’t plotting to destroy humanity? how do we know that the game of chess isn’t plotting to end humanity?
Maybe Peano arithmetic contains theorems trying to prove which steers the real world towards disaster. How can we know and when do we care?
Question 3. Imagine you came up with a plan to achieve your goals. You did it yourself. How do you know that this plan is not optimizing for your ruin?
Humans do go insane and fall into addictions. But not always. So why are our thoughts relatively safe to us? Why doesn’t every new thought / experience turn into addiction which wipes out all of your previous personality?
Question 4. You’re the Telepath. You can read the mind of the Killer. The Killer can reason about some things which aren’t comprehensible to you, but otherwise your cognition is very similar. Can you always tell if the Killer is planning to kill you?
Here are some thoughts the Killer might think:
“I need to do <something incomprehensible> so the Telepath dies.”
“I need to get the Telepath to eat this food with <something incomprehensible> in it.”
“I need to do <something incomprehensible> without any comprehensible reason.”
With 1 we can understand the outcome and that’s all that matters. With 2 we can still tell that something dodgy is going on. Even in 3 we see that the Killer tries to make his reasoning illegible. Maybe the Killer can never deceive us if the incomprehensible concepts he’s thinking about are “embedded” into the comprehensible concepts?