This makes it sound like it has much sharper, stronger priors, which would make sense if it’s trained on much more data / is much smarter, and especially if the data is high quality and avoids genuinely contradictory or stupid text (ie. less Internet garbage, more expert/curated text). It would then be trying even harder to squeeze all the possible Bayesian juice out of any given prompt to infer all the relevant latents, and become ever more hyper-sensitive to the slightest nuance in your prompt—even the nuances you didn’t intend or realize were there, like non-robust features. This is consistent with your comments about how it ‘knows’ you are posting only to LW2 or when you’re posting, and so any hint of it being you triggers immediate guessing. I remember with GPT-3 getting hints of how responses felt like it was trying to figure out who I was to better predict the next token [that I would have written], and I’m not surprised if a GPT-4 would amplify that feeling. The RLHFed GPT-4 wouldn’t feel like this because the point of the raters & reward-modeling is in large part to scrub away individuality and render those latents fixed & irrelevant.
This also sheds some light on why Sydney (a snapshot of GPT-4-base partway through training) would disagree with the user so much or be so stubborn. It’s not that the MS training was responsible, but more characteristic of the base model.
(Remember, a Bayes-optimal meta-learner will be extremely ‘aggressive’ in making ‘assumptions’ when it has highly informative priors, and may choose actions which seem wildly risk-seeking to someone raised on sluggish stupid overly-general & conservative algorithms. This is a qualitative description you see very often of the best RL agents or any solved game (eg. chess endgame tables); like in my coin flip demo, where the optimal MDP policy can look like it’s taking insane risks when it’s down early on, but nevertheless, it almost always winds up paying off. Similarly, in the POMDP, the Bayes-optimal policy can look like it launches into betting after far too few observations, committing prematurely to a naive human’s eyes, but nevertheless approaching very closely the original MDP’s value despite starting off ignorant of the latent parameters.)
This makes it sound like it has much sharper, stronger priors, which would make sense if it’s trained on much more data / is much smarter, and especially if the data is high quality and avoids genuinely contradictory or stupid text (ie. less Internet garbage, more expert/curated text). It would then be trying even harder to squeeze all the possible Bayesian juice out of any given prompt to infer all the relevant latents, and become ever more hyper-sensitive to the slightest nuance in your prompt—even the nuances you didn’t intend or realize were there, like non-robust features. This is consistent with your comments about how it ‘knows’ you are posting only to LW2 or when you’re posting, and so any hint of it being you triggers immediate guessing. I remember with GPT-3 getting hints of how responses felt like it was trying to figure out who I was to better predict the next token [that I would have written], and I’m not surprised if a GPT-4 would amplify that feeling. The RLHFed GPT-4 wouldn’t feel like this because the point of the raters & reward-modeling is in large part to scrub away individuality and render those latents fixed & irrelevant.
This also sheds some light on why Sydney (a snapshot of GPT-4-base partway through training) would disagree with the user so much or be so stubborn. It’s not that the MS training was responsible, but more characteristic of the base model.
(Remember, a Bayes-optimal meta-learner will be extremely ‘aggressive’ in making ‘assumptions’ when it has highly informative priors, and may choose actions which seem wildly risk-seeking to someone raised on sluggish stupid overly-general & conservative algorithms. This is a qualitative description you see very often of the best RL agents or any solved game (eg. chess endgame tables); like in my coin flip demo, where the optimal MDP policy can look like it’s taking insane risks when it’s down early on, but nevertheless, it almost always winds up paying off. Similarly, in the POMDP, the Bayes-optimal policy can look like it launches into betting after far too few observations, committing prematurely to a naive human’s eyes, but nevertheless approaching very closely the original MDP’s value despite starting off ignorant of the latent parameters.)