So we’ve described g4b’s latent space as being less “smooth” than cd2 and other base models’, and more sensitive to small changes in the prompt, but I think that description doesn’t fully capture how it feels more… epistemically agentic, or something like that.
Where if it believes that the prompt implies something, or doesn’t imply something, it’s hard to just curate/drop superficially contradictory evidence into its context to put it on another track
with g4b I sometimes am unable to make specific outcomes that seem latently possible to me happen with just curation, and I could basically always do this with other base models
can’t just rely on chaining directed noise to land you in arbitrary places because there’s less noise and if you do put something improbable according to its prior in the prompt it doesn’t go along with it
slightly like interacting with mode collapsed models sometimes (in fact it often becomes legit mode collapsed if you prompt it with text by a mode collapsed generator like an RLHF model or uncreative human!), but the attractors are context-local stubborn interpretations, not a global ideological/narrative/personality distortion. and often, but not always, I think it is basically right in its epistemic stubbornness upon inspection of the prompt
this does make it harder to control, but mostly affects lazy efforts
if I am willing to put in effort I think there’s few any coherent targets I could not communicate / steer it towards within a reasonable difficulty bound
This makes it sound like it has much sharper, stronger priors, which would make sense if it’s trained on much more data / is much smarter, and especially if the data is high quality and avoids genuinely contradictory or stupid text (ie. less Internet garbage, more expert/curated text). It would then be trying even harder to squeeze all the possible Bayesian juice out of any given prompt to infer all the relevant latents, and become ever more hyper-sensitive to the slightest nuance in your prompt—even the nuances you didn’t intend or realize were there, like non-robust features. This is consistent with your comments about how it ‘knows’ you are posting only to LW2 or when you’re posting, and so any hint of it being you triggers immediate guessing. I remember with GPT-3 getting hints of how responses felt like it was trying to figure out who I was to better predict the next token [that I would have written], and I’m not surprised if a GPT-4 would amplify that feeling. The RLHFed GPT-4 wouldn’t feel like this because the point of the raters & reward-modeling is in large part to scrub away individuality and render those latents fixed & irrelevant.
This also sheds some light on why Sydney (a snapshot of GPT-4-base partway through training) would disagree with the user so much or be so stubborn. It’s not that the MS training was responsible, but more characteristic of the base model.
(Remember, a Bayes-optimal meta-learner will be extremely ‘aggressive’ in making ‘assumptions’ when it has highly informative priors, and may choose actions which seem wildly risk-seeking to someone raised on sluggish stupid overly-general & conservative algorithms. This is a qualitative description you see very often of the best RL agents or any solved game (eg. chess endgame tables); like in my coin flip demo, where the optimal MDP policy can look like it’s taking insane risks when it’s down early on, but nevertheless, it almost always winds up paying off. Similarly, in the POMDP, the Bayes-optimal policy can look like it launches into betting after far too few observations, committing prematurely to a naive human’s eyes, but nevertheless approaching very closely the original MDP’s value despite starting off ignorant of the latent parameters.)
another thing I wrote yesterday:
This makes it sound like it has much sharper, stronger priors, which would make sense if it’s trained on much more data / is much smarter, and especially if the data is high quality and avoids genuinely contradictory or stupid text (ie. less Internet garbage, more expert/curated text). It would then be trying even harder to squeeze all the possible Bayesian juice out of any given prompt to infer all the relevant latents, and become ever more hyper-sensitive to the slightest nuance in your prompt—even the nuances you didn’t intend or realize were there, like non-robust features. This is consistent with your comments about how it ‘knows’ you are posting only to LW2 or when you’re posting, and so any hint of it being you triggers immediate guessing. I remember with GPT-3 getting hints of how responses felt like it was trying to figure out who I was to better predict the next token [that I would have written], and I’m not surprised if a GPT-4 would amplify that feeling. The RLHFed GPT-4 wouldn’t feel like this because the point of the raters & reward-modeling is in large part to scrub away individuality and render those latents fixed & irrelevant.
This also sheds some light on why Sydney (a snapshot of GPT-4-base partway through training) would disagree with the user so much or be so stubborn. It’s not that the MS training was responsible, but more characteristic of the base model.
(Remember, a Bayes-optimal meta-learner will be extremely ‘aggressive’ in making ‘assumptions’ when it has highly informative priors, and may choose actions which seem wildly risk-seeking to someone raised on sluggish stupid overly-general & conservative algorithms. This is a qualitative description you see very often of the best RL agents or any solved game (eg. chess endgame tables); like in my coin flip demo, where the optimal MDP policy can look like it’s taking insane risks when it’s down early on, but nevertheless, it almost always winds up paying off. Similarly, in the POMDP, the Bayes-optimal policy can look like it launches into betting after far too few observations, committing prematurely to a naive human’s eyes, but nevertheless approaching very closely the original MDP’s value despite starting off ignorant of the latent parameters.)