If I could assume things like “they are much better at reading my inner monologue than my non-verbal thoughts”, then I could create code words for prohibited things.
I could think in words they don’t know.
I could think in complicated concepts they haven’t understood yet. Or references to events, or my memories, that they don’t know.
I could leave a part of my plans implicit, and only figure out the details later.
I could harm them through some action for which they won’t understand that it is harmful, so they might not be alarmed even if they catch me thinking it. (Leaving a gas stove on with no fire.)
And then there are the more boring things, like if you know more details about how the mind-reading works, you can try to defeat it. (Make the evil plans at night when they are asleep, or when the shift is changing, etc.)
If I could assume things like “they are much better at reading my inner monologue than my non-verbal thoughts”, then I could create code words for prohibited things.
I could think in words they don’t know.
I could think in complicated concepts they haven’t understood yet. Or references to events, or my memories, that they don’t know.
I could leave a part of my plans implicit, and only figure out the details later.
I could harm them through some action for which they won’t understand that it is harmful, so they might not be alarmed even if they catch me thinking it. (Leaving a gas stove on with no fire.)
And then there are the more boring things, like if you know more details about how the mind-reading works, you can try to defeat it. (Make the evil plans at night when they are asleep, or when the shift is changing, etc.)
(Also, I assume you know Circumventing interpretability: How to defeat mind-readers, but mentioning it just in case.)