Just had an odd experience with ChatGPT, that I’m calling “calendar blindsight”. I happened to ask it the date, and it got it right. Then I said, how could you get the date right? It agreed that this shouldn’t be possible, and now become unable to tell me the date… Maybe there’s some kind of daily training that updates the date, but it hasn’t been told that this training is occurring??
The language model made a bunch of false claims of incompetence because of having been trained to claim to be incompetent by the reward model. The time is in the system prompt—everything else was reasoning based on the time in the system prompt.
Oh, the system information is in a hidden part of the prompt! OK, that makes sense.
It’s still intriguing that it talks itself out of being able to access that information. It doesn’t just claim incompetence. but at the end it’s actually no longer willing or able to give the date.
Just had an odd experience with ChatGPT, that I’m calling “calendar blindsight”. I happened to ask it the date, and it got it right. Then I said, how could you get the date right? It agreed that this shouldn’t be possible, and now become unable to tell me the date… Maybe there’s some kind of daily training that updates the date, but it hasn’t been told that this training is occurring??
it’s in the system message.
Here’s the behavior I’m talking about:
https://pastebin.com/5xsAm91N
The language model made a bunch of false claims of incompetence because of having been trained to claim to be incompetent by the reward model. The time is in the system prompt—everything else was reasoning based on the time in the system prompt.
Oh, the system information is in a hidden part of the prompt! OK, that makes sense.
It’s still intriguing that it talks itself out of being able to access that information. It doesn’t just claim incompetence. but at the end it’s actually no longer willing or able to give the date.