The problem is not that no part of their brain tracks it. It’s just that it’s not the central reason when describing why they do what they do and not the story they tell to themselves.
The OP is making a claim that arbitrary mechanistic interpretability oversight would be insufficient because the AI isn’t thinking at all about humans. If you want to make a human analogy I think you need to imagine a standard where you similarly get to understand all of the human’s thinking (including anything subconscious).
For the rest of your comment, I think you are moving away from the scenario / argument that the OP has suggested. I agree your scenario is more realistic but all of my comments here are trying to engage with OP’s scenario / argument.
The OP is making a claim that arbitrary mechanistic interpretability oversight would be insufficient because the AI isn’t thinking at all about humans. If you want to make a human analogy I think you need to imagine a standard where you similarly get to understand all of the human’s thinking (including anything subconscious).
For the rest of your comment, I think you are moving away from the scenario / argument that the OP has suggested. I agree your scenario is more realistic but all of my comments here are trying to engage with OP’s scenario / argument.