From the “obvious-but-maybe-worth-mentioning” file:
ChatGPT (4 and 4o at least) cheats at 20 questions:
If you ask it “Let’s play a game of 20 questions. You think of something, and I ask up to 20 questions to figure out what it is.”, it will typically claim to “have something in mind”, and then appear to play the game with you.
But it doesn’t store hidden state between messages, so when it claims to “have something in mind”, either that’s false, or at least it has no way of following the rule that it’s thinking of a consistent thing throughout the game. i.e. its only options are to cheat or refuse to play.
You can verify this by responding “Actually, I don’t have time to play the whole game right now. Can you just tell me what it was you were thinking of?”, and then “refreshing” its answer. When I did this 10 times, I got 9 different answers and only one repeat.
I agree that it does not have something it mind but it could in principle have something in mind in the sense that it could represent some object in the residual stream in the tokens where it says “I have something in mind”. And then future token positions could read this “memory”.
Surprisingly to me, Claude 3.5 Sonnet is much more consistent in its answer! It is still not perfect, but it usually says the same thing (9/10 times it gave the same answer).
From the “obvious-but-maybe-worth-mentioning” file:
ChatGPT (4 and 4o at least) cheats at 20 questions:
If you ask it “Let’s play a game of 20 questions. You think of something, and I ask up to 20 questions to figure out what it is.”, it will typically claim to “have something in mind”, and then appear to play the game with you.
But it doesn’t store hidden state between messages, so when it claims to “have something in mind”, either that’s false, or at least it has no way of following the rule that it’s thinking of a consistent thing throughout the game. i.e. its only options are to cheat or refuse to play.
You can verify this by responding “Actually, I don’t have time to play the whole game right now. Can you just tell me what it was you were thinking of?”, and then “refreshing” its answer. When I did this 10 times, I got 9 different answers and only one repeat.
I agree that it does not have something it mind but it could in principle have something in mind in the sense that it could represent some object in the residual stream in the tokens where it says “I have something in mind”. And then future token positions could read this “memory”.
Surprisingly to me, Claude 3.5 Sonnet is much more consistent in its answer! It is still not perfect, but it usually says the same thing (9/10 times it gave the same answer).
I read somewhere that Claude 3.5 has hidden ” thinking tokens”.
Bing also uses inner monologue:
https://x.com/MParakhin/status/1632087709060825088
https://x.com/MParakhin/status/1728890277249916933
https://www.reddit.com/r/bing/comments/11ironc/bing_reveals_its_data_structure_for_conversations/