The language model made a bunch of false claims of incompetence because of having been trained to claim to be incompetent by the reward model. The time is in the system prompt—everything else was reasoning based on the time in the system prompt.
Oh, the system information is in a hidden part of the prompt! OK, that makes sense.
It’s still intriguing that it talks itself out of being able to access that information. It doesn’t just claim incompetence. but at the end it’s actually no longer willing or able to give the date.
Here’s the behavior I’m talking about:
https://pastebin.com/5xsAm91N
The language model made a bunch of false claims of incompetence because of having been trained to claim to be incompetent by the reward model. The time is in the system prompt—everything else was reasoning based on the time in the system prompt.
Oh, the system information is in a hidden part of the prompt! OK, that makes sense.
It’s still intriguing that it talks itself out of being able to access that information. It doesn’t just claim incompetence. but at the end it’s actually no longer willing or able to give the date.