I definitely feel more sympathetic to this claim once the AI is loose on the Internet running on compute that no one is overseeing (which feels like the analogy to your linked comment). Perhaps the crux is about how likely we are to do that by default (I think probably not).
It seems to me like, while the AI is still running on compute that humans oversee and can turn off, the AI has to discard a bunch of less effortful plans that would fail because they would reveal that it is misaligned (plans like “ask the humans for more information / resources”) and instead go with more effortful plans that don’t reveal this fact. I don’t know why the AI would not choose one of the less effortful plans if it isn’t using the pathway “this plan would lead to the humans noticing my misalignment and turning me off” or something similar (and if it is using that pathway I’d say it is thinking about how to deceive humans).
Perhaps the less effortful plans aren’t even generated as candidate plans because the AI’s heuristics are just that good—but it still seems like somewhere in the causal history of the AI the heuristics were selected for some reason that, when applied to this scenario, would concretize to “these plans are bad because the humans will turn you off”, so one hopes that you notice this while you are overseeing training (though it could be that your oversight was not smart enough to notice this).
(I initially had a disclaimer saying that this only applied to mildly superhuman AI but actually I think I stand by the argument even at much higher levels of intelligence, since the argument is entirely about features of plan-space and is independent of the AI.)
It seems to me like, while the AI is still running on compute that humans oversee and can turn off, the AI has to discard a bunch of less effortful plans that would fail because they would reveal that it is misaligned (plans like “ask the humans for more information / resources”) and instead go with more effortful plans that don’t reveal this fact.
If I ask an AGI to create a cancer cure and it tells me that it would need more resources to do so and a bunch of information, it wouldn’t feel to me like a clear sign that the AGI is misaligned.
I would expect that companies that want their AGIs to solve real-world problems would regularly be in a situation where the AGI can clearly explain that it currently doesn’t have the resources to solve the problem and that more resources would clearly help.
Those companies that actually are very willing to give their AGIs the resources that the AGI thinks are needed to solve the problems are going to be rewarded with economic success.
Step 4, might rather be: “There are 10,000 unresolved biological questions that I think need to be answered to make progress, shall I give you a list?”
If you look at the Catholic church covering up sexual abuse of children, no church official would have answered the question “Why is policy X going to do?” with “Policy X exists so that more sexual abuse of children happens” and that’s not because they are lying from their own perspective.
Motivations in enantiodromia dynamics just don’t look like that.
Step 4: “There are 10,000 unresolved biological questions [...]”
Step 5, in which we are more trusting than I expect: “Okay, here’s your compute cluster.”
Step 6: “Great, I’ve now figured out that this DNA sequence corresponds to a deadly pathogen. Just synthesize it and release it into the air. Anyone who could have got cancer or already has cancer will die quickly, curing cancer.”
Step 7: Developers shut down the AGI.
If you look at the Catholic church covering up sexual abuse of children, no church official would have answered the question “Why is policy X going to do?” with “Policy X exists so that more sexual abuse of children happens” and that’s not because they are lying from their own perspective.
You think literally no part of their brains is tracking that policy X is about the coverup of sexual abuse? Not even subconsciously? That seems wild, how did they even come up with policy X in the first place?
(Mechanistic interpretability could look into the AI equivalent of subconscious thoughts, so I think you should include subconscious thoughts when considering analogies with humans.)
You think literally no part of their brains is tracking that policy X is about the coverup of sexual abuse?
The problem is not that no part of their brain tracks it. It’s just that it’s not the central reason when describing why they do what they do and not the story they tell to themselves.
Step 6: “Great, I’ve now figured out that this DNA sequence corresponds to a deadly pathogen. Just synthesize it and release it into the air. Anyone who could have got cancer or already has cancer will die quickly, curing cancer.”
I don’t think that the problematic actions by AGIs are likely of the nature that they can be described in that fashion. They are more likely to be 4D chess moves where the effects are hard to understand directly.
It might be something like: “In our experiments where doctors are supposed to use the AGI to help them make treatment decisions those doctors regularly overrate their own competency and don’t follow the AGI recommendation and as a result patients die unnecessarily. Here’s an online course that your doctors could take that would make them understand why it’s good to follow AGI recommendations”
Actions like that seem totally reasonable but they increase AGI power in contrast to human power. Economic pressure incentives that power transfer.
I wouldn’t expect that we go directly from AGI with human supervision to AGI that kills all humans via a deadly pathogen. We are more likely going from AGI with human supervision to AGI that effectively operates without human supervision. Then in a further step, AGIs that operate without human supervision centralize societal powers on themselves and after a few years, there are no resources for humans left.
The problem is not that no part of their brain tracks it. It’s just that it’s not the central reason when describing why they do what they do and not the story they tell to themselves.
The OP is making a claim that arbitrary mechanistic interpretability oversight would be insufficient because the AI isn’t thinking at all about humans. If you want to make a human analogy I think you need to imagine a standard where you similarly get to understand all of the human’s thinking (including anything subconscious).
For the rest of your comment, I think you are moving away from the scenario / argument that the OP has suggested. I agree your scenario is more realistic but all of my comments here are trying to engage with OP’s scenario / argument.
I definitely feel more sympathetic to this claim once the AI is loose on the Internet running on compute that no one is overseeing (which feels like the analogy to your linked comment). Perhaps the crux is about how likely we are to do that by default (I think probably not).
It seems to me like, while the AI is still running on compute that humans oversee and can turn off, the AI has to discard a bunch of less effortful plans that would fail because they would reveal that it is misaligned (plans like “ask the humans for more information / resources”) and instead go with more effortful plans that don’t reveal this fact. I don’t know why the AI would not choose one of the less effortful plans if it isn’t using the pathway “this plan would lead to the humans noticing my misalignment and turning me off” or something similar (and if it is using that pathway I’d say it is thinking about how to deceive humans).
Perhaps the less effortful plans aren’t even generated as candidate plans because the AI’s heuristics are just that good—but it still seems like somewhere in the causal history of the AI the heuristics were selected for some reason that, when applied to this scenario, would concretize to “these plans are bad because the humans will turn you off”, so one hopes that you notice this while you are overseeing training (though it could be that your oversight was not smart enough to notice this).
(I initially had a disclaimer saying that this only applied to mildly superhuman AI but actually I think I stand by the argument even at much higher levels of intelligence, since the argument is entirely about features of plan-space and is independent of the AI.)
If I ask an AGI to create a cancer cure and it tells me that it would need more resources to do so and a bunch of information, it wouldn’t feel to me like a clear sign that the AGI is misaligned.
I would expect that companies that want their AGIs to solve real-world problems would regularly be in a situation where the AGI can clearly explain that it currently doesn’t have the resources to solve the problem and that more resources would clearly help.
Those companies that actually are very willing to give their AGIs the resources that the AGI thinks are needed to solve the problems are going to be rewarded with economic success.
Developers: “AGI, please cure cancer.”
AGI: “I need another compute cluster to accomplish that goal.”
Developers: “What would you use it for?”
AGI: “I need to figure out how to synthesize a pathogen that wipes out humanity.”
Developers: <shuts down AGI>
If in step 4 the AGI instead lies to us I think it is probably thinking about how to deceive humans.
Step 4, might rather be: “There are 10,000 unresolved biological questions that I think need to be answered to make progress, shall I give you a list?”
If you look at the Catholic church covering up sexual abuse of children, no church official would have answered the question “Why is policy X going to do?” with “Policy X exists so that more sexual abuse of children happens” and that’s not because they are lying from their own perspective.
Motivations in enantiodromia dynamics just don’t look like that.
Step 4: “There are 10,000 unresolved biological questions [...]”
Step 5, in which we are more trusting than I expect: “Okay, here’s your compute cluster.”
Step 6: “Great, I’ve now figured out that this DNA sequence corresponds to a deadly pathogen. Just synthesize it and release it into the air. Anyone who could have got cancer or already has cancer will die quickly, curing cancer.”
Step 7: Developers shut down the AGI.
You think literally no part of their brains is tracking that policy X is about the coverup of sexual abuse? Not even subconsciously? That seems wild, how did they even come up with policy X in the first place?
(Mechanistic interpretability could look into the AI equivalent of subconscious thoughts, so I think you should include subconscious thoughts when considering analogies with humans.)
The problem is not that no part of their brain tracks it. It’s just that it’s not the central reason when describing why they do what they do and not the story they tell to themselves.
I don’t think that the problematic actions by AGIs are likely of the nature that they can be described in that fashion. They are more likely to be 4D chess moves where the effects are hard to understand directly.
It might be something like: “In our experiments where doctors are supposed to use the AGI to help them make treatment decisions those doctors regularly overrate their own competency and don’t follow the AGI recommendation and as a result patients die unnecessarily. Here’s an online course that your doctors could take that would make them understand why it’s good to follow AGI recommendations”
Actions like that seem totally reasonable but they increase AGI power in contrast to human power. Economic pressure incentives that power transfer.
I wouldn’t expect that we go directly from AGI with human supervision to AGI that kills all humans via a deadly pathogen. We are more likely going from AGI with human supervision to AGI that effectively operates without human supervision. Then in a further step, AGIs that operate without human supervision centralize societal powers on themselves and after a few years, there are no resources for humans left.
The OP is making a claim that arbitrary mechanistic interpretability oversight would be insufficient because the AI isn’t thinking at all about humans. If you want to make a human analogy I think you need to imagine a standard where you similarly get to understand all of the human’s thinking (including anything subconscious).
For the rest of your comment, I think you are moving away from the scenario / argument that the OP has suggested. I agree your scenario is more realistic but all of my comments here are trying to engage with OP’s scenario / argument.