Step 4, might rather be: “There are 10,000 unresolved biological questions that I think need to be answered to make progress, shall I give you a list?”
If you look at the Catholic church covering up sexual abuse of children, no church official would have answered the question “Why is policy X going to do?” with “Policy X exists so that more sexual abuse of children happens” and that’s not because they are lying from their own perspective.
Motivations in enantiodromia dynamics just don’t look like that.
Step 4: “There are 10,000 unresolved biological questions [...]”
Step 5, in which we are more trusting than I expect: “Okay, here’s your compute cluster.”
Step 6: “Great, I’ve now figured out that this DNA sequence corresponds to a deadly pathogen. Just synthesize it and release it into the air. Anyone who could have got cancer or already has cancer will die quickly, curing cancer.”
Step 7: Developers shut down the AGI.
If you look at the Catholic church covering up sexual abuse of children, no church official would have answered the question “Why is policy X going to do?” with “Policy X exists so that more sexual abuse of children happens” and that’s not because they are lying from their own perspective.
You think literally no part of their brains is tracking that policy X is about the coverup of sexual abuse? Not even subconsciously? That seems wild, how did they even come up with policy X in the first place?
(Mechanistic interpretability could look into the AI equivalent of subconscious thoughts, so I think you should include subconscious thoughts when considering analogies with humans.)
You think literally no part of their brains is tracking that policy X is about the coverup of sexual abuse?
The problem is not that no part of their brain tracks it. It’s just that it’s not the central reason when describing why they do what they do and not the story they tell to themselves.
Step 6: “Great, I’ve now figured out that this DNA sequence corresponds to a deadly pathogen. Just synthesize it and release it into the air. Anyone who could have got cancer or already has cancer will die quickly, curing cancer.”
I don’t think that the problematic actions by AGIs are likely of the nature that they can be described in that fashion. They are more likely to be 4D chess moves where the effects are hard to understand directly.
It might be something like: “In our experiments where doctors are supposed to use the AGI to help them make treatment decisions those doctors regularly overrate their own competency and don’t follow the AGI recommendation and as a result patients die unnecessarily. Here’s an online course that your doctors could take that would make them understand why it’s good to follow AGI recommendations”
Actions like that seem totally reasonable but they increase AGI power in contrast to human power. Economic pressure incentives that power transfer.
I wouldn’t expect that we go directly from AGI with human supervision to AGI that kills all humans via a deadly pathogen. We are more likely going from AGI with human supervision to AGI that effectively operates without human supervision. Then in a further step, AGIs that operate without human supervision centralize societal powers on themselves and after a few years, there are no resources for humans left.
The problem is not that no part of their brain tracks it. It’s just that it’s not the central reason when describing why they do what they do and not the story they tell to themselves.
The OP is making a claim that arbitrary mechanistic interpretability oversight would be insufficient because the AI isn’t thinking at all about humans. If you want to make a human analogy I think you need to imagine a standard where you similarly get to understand all of the human’s thinking (including anything subconscious).
For the rest of your comment, I think you are moving away from the scenario / argument that the OP has suggested. I agree your scenario is more realistic but all of my comments here are trying to engage with OP’s scenario / argument.
Developers: “AGI, please cure cancer.”
AGI: “I need another compute cluster to accomplish that goal.”
Developers: “What would you use it for?”
AGI: “I need to figure out how to synthesize a pathogen that wipes out humanity.”
Developers: <shuts down AGI>
If in step 4 the AGI instead lies to us I think it is probably thinking about how to deceive humans.
Step 4, might rather be: “There are 10,000 unresolved biological questions that I think need to be answered to make progress, shall I give you a list?”
If you look at the Catholic church covering up sexual abuse of children, no church official would have answered the question “Why is policy X going to do?” with “Policy X exists so that more sexual abuse of children happens” and that’s not because they are lying from their own perspective.
Motivations in enantiodromia dynamics just don’t look like that.
Step 4: “There are 10,000 unresolved biological questions [...]”
Step 5, in which we are more trusting than I expect: “Okay, here’s your compute cluster.”
Step 6: “Great, I’ve now figured out that this DNA sequence corresponds to a deadly pathogen. Just synthesize it and release it into the air. Anyone who could have got cancer or already has cancer will die quickly, curing cancer.”
Step 7: Developers shut down the AGI.
You think literally no part of their brains is tracking that policy X is about the coverup of sexual abuse? Not even subconsciously? That seems wild, how did they even come up with policy X in the first place?
(Mechanistic interpretability could look into the AI equivalent of subconscious thoughts, so I think you should include subconscious thoughts when considering analogies with humans.)
The problem is not that no part of their brain tracks it. It’s just that it’s not the central reason when describing why they do what they do and not the story they tell to themselves.
I don’t think that the problematic actions by AGIs are likely of the nature that they can be described in that fashion. They are more likely to be 4D chess moves where the effects are hard to understand directly.
It might be something like: “In our experiments where doctors are supposed to use the AGI to help them make treatment decisions those doctors regularly overrate their own competency and don’t follow the AGI recommendation and as a result patients die unnecessarily. Here’s an online course that your doctors could take that would make them understand why it’s good to follow AGI recommendations”
Actions like that seem totally reasonable but they increase AGI power in contrast to human power. Economic pressure incentives that power transfer.
I wouldn’t expect that we go directly from AGI with human supervision to AGI that kills all humans via a deadly pathogen. We are more likely going from AGI with human supervision to AGI that effectively operates without human supervision. Then in a further step, AGIs that operate without human supervision centralize societal powers on themselves and after a few years, there are no resources for humans left.
The OP is making a claim that arbitrary mechanistic interpretability oversight would be insufficient because the AI isn’t thinking at all about humans. If you want to make a human analogy I think you need to imagine a standard where you similarly get to understand all of the human’s thinking (including anything subconscious).
For the rest of your comment, I think you are moving away from the scenario / argument that the OP has suggested. I agree your scenario is more realistic but all of my comments here are trying to engage with OP’s scenario / argument.