I wonder if some people here had a chance to play with base-GPT-4 (the access is given very selectively for research purposes) and would not mind sharing some of their impressions?
I know that some people have been playing with it, but I’ve never seen a discussion of impressions and lessons from that. And I know that it is quite nontrivial to get access to this model, but that some access is given.
I think it would be super-interesting for many people here to hear this kind of conversation...
Here are a scattering of qualitative impressions drawn mostly from Discord messages. I’ll write something more tailored for external communication in the future.
I am still awaiting permission from OpenAI to share outputs from the GPT-4 base model.
Jargon key:
cd2 = code-davinci-002, the GPT-3.5 base model
g4b = GPT-4 base model
Reflections following my first substantial interaction with the model:
On “truesight” (ability to infer things about the user / latent variables behind the prompt)
On prompting GPT-4 base and its sensitivity to anomalies and incoherence
On potential insight into what caused Bing’s “madness”
On the “roughness” of GPT-4 base’s latent space
quote from Gaspode:
another thing I wrote yesterday:
This makes it sound like it has much sharper, stronger priors, which would make sense if it’s trained on much more data / is much smarter, and especially if the data is high quality and avoids genuinely contradictory or stupid text (ie. less Internet garbage, more expert/curated text). It would then be trying even harder to squeeze all the possible Bayesian juice out of any given prompt to infer all the relevant latents, and become ever more hyper-sensitive to the slightest nuance in your prompt—even the nuances you didn’t intend or realize were there, like non-robust features. This is consistent with your comments about how it ‘knows’ you are posting only to LW2 or when you’re posting, and so any hint of it being you triggers immediate guessing. I remember with GPT-3 getting hints of how responses felt like it was trying to figure out who I was to better predict the next token [that I would have written], and I’m not surprised if a GPT-4 would amplify that feeling. The RLHFed GPT-4 wouldn’t feel like this because the point of the raters & reward-modeling is in large part to scrub away individuality and render those latents fixed & irrelevant.
This also sheds some light on why Sydney (a snapshot of GPT-4-base partway through training) would disagree with the user so much or be so stubborn. It’s not that the MS training was responsible, but more characteristic of the base model.
(Remember, a Bayes-optimal meta-learner will be extremely ‘aggressive’ in making ‘assumptions’ when it has highly informative priors, and may choose actions which seem wildly risk-seeking to someone raised on sluggish stupid overly-general & conservative algorithms. This is a qualitative description you see very often of the best RL agents or any solved game (eg. chess endgame tables); like in my coin flip demo, where the optimal MDP policy can look like it’s taking insane risks when it’s down early on, but nevertheless, it almost always winds up paying off. Similarly, in the POMDP, the Bayes-optimal policy can look like it launches into betting after far too few observations, committing prematurely to a naive human’s eyes, but nevertheless approaching very closely the original MDP’s value despite starting off ignorant of the latent parameters.)
Have you not used the public RLHF’d GPT-4 enough to compare it with the GPT-4-base model? I’d also be curious if you tried to do best-of sampling beyond just your 4-samples + manual selection approach. (I felt that BO sampling boosted the GPT-3-base models a lot and have been missing it ever since. It can only be done with base models and can’t be recreated with any of the RLHFed models given that RLHF seems to screw with/flatten the logits (which they no longer report) so you don’t get meaningful ‘beams’ nor any way to rank the beams.)
And another reason why all this is relevant, we know that fine-tuning GPT-3.5 can produce drastic boosts in narrow domains, and some of us (e.g. myself) have expected the same from fine-tuning GPT-4, being able to achieve the performance of the non-existing GPT-4.5 (or 5) in narrow domains.
But that’s not what has happened. Instead OpenAI has communicated that
and, moreover, therefore
It is very important to understand the mysterious base-GPT-4 better in the context of both potential benefits and potential hazards of GPT-4 fine-tuning, and also in the context of these newly emerged difficulties of fine-tuning it as fruitfully as GPT-3.5.
I’m not sure finetuning GPT-3 is all that different or those difficulties ‘newly emerged’.
As I recall, the original GPT-3 finetuning API was removed not terribly long after it was announced and didn’t come back for a long time. There were also issues with finetune users like AI Dungeon 2. This might have been connected with the finetune doing shenanigans behind the scenes—OA declined to talk about what the ‘finetuning’ even was, and the general assumption seems to be that they were doing some sort of cheap lightweight-finetune or hack and not a true finetune.
(These are why I never wound up doing any of the GPT-3 finetuning ideas I had back in 2020, like trying to fix poetry by re-tokenizing our poem corpus into IPA phonetic notation—why waste the time & hundreds of dollars if OA is just going to screw it up behind the scenes & not even give you a hint why?)
Right. But the reports specifically on GPT-3.5-turbo fine-tuning announced in August were glowing, with people reporting being able to reach GPT-4-like levels on performance in narrow domains.
That’s why our expectations were high.
I am sure they do something relatively lightweight, like LoRA, https://arxiv.org/abs/2106.09685, which is what people tend to be mostly using (I think).
And, of course, with GPT-4 being very different from a conventional Transformer of GPT-3-like type, if one believes the rumors, the difficulties might have easily emerged, if one has been trying to do something like a LoRA-like thing.
Indeed, but only years after their original attempt. All of the early GPT-3 finetuning reports were very… meh. No one seemed terribly happy with it.
That’s my point: it seems like the first attempts did not go well for GPT-3. So, it’s not clear that the first attempts going poorly for GPT-4 is anything different. Perhaps in another 3 years, OA will have a new GPT-4 finetuning service which doesn’t require “more work” and Just Works™. (One does hope it wouldn’t take that long the second time around.)
OA does have a new finetuning service for GPT-4o, and people seem to be happier with it, but OA has also apparently confirmed that it’s a LoRA (as I was speculating about it being a cheap shallow hack rather than true finetuning): https://x.com/CFGeek/status/1826749739502895618 https://www.youtube.com/watch?v=X57GT1Y5URY&t=2479s
It also is doing shenanigans behind the scenes like trying to dynamically guess a size but apparently hiding that from you if you aren’t a favored customer: https://x.com/CFGeek/status/1826749748549988800
So, I continue to maintain that OA “finetuning” is unfit for research* and for any purposes that involve deep transformation of the model rather than ‘locating’ an existing capability. Especially now that Llama-3-405b has been released and you can finetune that yourself and be sure that it genuinely is finetuning rather than a pinchbeck substitute.
* ie. it can be OK if you have an extremely specific claim like ‘the OA blackbox finetuning service does or does not do X’; but it is totally illegitimate to argue ‘GPT-4 cannot do X as proven by our OA-finetuned version still not doing X’, which is the usual way it comes up in DL research. At best, it is a loose lower bound, and should be treated no more seriously than lazy garbage arguments like ‘we tried a few prompts and X didn’t work, therefore, LLMs will never do X’.
Thanks, that’s very useful to know!
It’s still not trivial to finetune Llama 405B. You require 16 bytes/parameter using Adam + activation memory, so a minimum of ~100 H100s.
There are lots of people working on it and offering or will be offering it. And even when they aren’t offering true finetuning, it’s still better: Snowflake (first hit in google for “Llama 405B finetuning”) for example is making no bones about their single-node lightweight-finetuning being a LoRA, and is open sourcing code upfront so at least you know what it is now—instead of depending on borderline-gossip buried 40 minutes into a Youtube video months/years later.
What are the rumors? I’m only aware of MoE.
Yes, the main rumor is that it’s a mixture-of-experts. This is already quite a difference from a single Transformer.
We presume that these experts are mostly made of various components of a Transformer (with some possible additions and modifications, which we don’t know), but we don’t know how independent those experts are, or whether they share a sizeable common initial computation and then branch off that, or something else entirely with some kind of dynamic sparse routing through a single network, and so on… I think it’s unlikely to be “just take a bunch of GPT-3′s, run an appropriate subset of them in parallel, and combine the results”.
There is a huge diversity of techniques combining the MoE motifs and motifs associated with Transformers, see e.g. this collection of references https://github.com/XueFuzhao/awesome-mixture-of-experts
So, we really don’t know, these rumors are only enough to make some partial guesses.
If we survive for a while, all this will eventually became public knowledge, and we’ll probably understand eventually how the magic of GPT-4 is possible.
Yes, I used it quite a bit. So, yes, all of us can compare to some extent.
But I’ve also read Janus enough (here and on twitter) to know that RLHF mutilates models quite a bit (both via “mode collapse” and via other multiple pathologies; the net result is drastic restrictions of the set of simulations the model can create).
So it potentially might be that base-GPT-4 is drastically more powerful than RLHF’d GPT-4 if one knows how to handle it right...
So, in fact, I particularly wanted Janus’ impressions to be recorded and shared. That’s because I really wanted to know how base-GPT-4 looks through the prism of their general insights, given their writings on the Simulator theory and on LLMs in general (and their ability to deal with potentially high non-triviality of dealing with non-RLHF’d GPT-4; in this sense, note their remark on how base-GPT-4 is particularly sensitive to the quality of prompt writing; so it’s a very different beast, much more difficult to handle than RLHF’d GPT-4, but the pay-offs for the qualified interlocutor might be really high).
Although, of course, I’d love to have impressions from other people, and I’d love to read discussions about this… For that we need more people with access to base-GPT-4 to at least notice this post :-)
I’m confused about what in my comment made you ask this, but the answer is yes, I’ve used it a fair amount and
can easily compare it to the GPT-3 base model
(or was that not directed at me?)
* GPT-4-base
Thanks, this is very interesting, sheds a lot of light onto base-GPT-4.
Here’s another account, from someone who says they were on the GPT-4 redteam, a Nathan Labenz (who I am not very familiar with but he is named as a tester in the GPT-4 paper and no one I’ve seen has chimed in to claim he’s making it all up).
The primary purpose of this account is to document how OA management, possibly including Sam Altman, seemed to not consider GPT-4 worth the board’s time or forward to it any of the reports like the documentation about it being capable of autonomy & successful deception (eg. the CAPTCHA thing). When he contacted a safety-oriented board member (presumably Helen Toner, as the safety member who researches this topic, eg. the very paper which Altman was trying to get her fired over), the board member was subsequently told by OA management that the author was dishonest and ‘not to be trusted’ and the board member believed them, and told the author to stop contacting them. He was then kicked out of the redteaming (where apparently, despite being poorly-trained, not very good at prompt engineering, and minimally supervised, some of them were being paid $100/hour).
Anyway, all that context aside, he spent a lot of time with the base model and additional RLHF-tuned models, and this is how he describes it (to explain why he was alarmed enough to do any whistleblowing):
An apparently unnoticed example of
gpt-4-base
in a belated May 2024 podcast about an August 2023 book, about the followup to that NYer article, which turned into a book ofcode-davinci-002
poems (titled I am Code):It’s hard to imagine
davinci-003
or any of the ChatGPTs writing those, so by elimination, what an OA researcher sharing privately must be isgpt-4-base
. It is possible they don’t name the model explicitly because OpenAI didn’t sign off on it, or they didn’t realize the “newer AI” was old news by publication of the book August 2023 or this podcast in 2024-05-31 (GPT-4 was launched 2023-03).(I also appreciate that This American Life makes an effort to emphasize the damage done to creative writing by the tuning, and that
code-davinci-002
orgpt-4-base
write very differently from the ChatGPT everyone has used.)Also of interest is their interactions with OpenAI and the OA researcher Dan Selsam, as well as their descriptions of how
code-davinci-002
differs from ChatGPT and how it feels like.They do not seem to have gotten access. Further, I would note that despite Altman’s public promise, very few people seem to have been given access (best I can find out is that Janus and maybe 2 others had access, and then others through them).
On a side note, I am struck by their extensive research and engagement with GPT-3 poetry and consultations with everyone from Blake Lemoine to Stephen Wolfram, but apparently, even in March 2023, are totally ignorant of my 2020 GPT-3 poetry page (then, and now, still the most extensive and frequently cited compilation of GPT poetry in both popular & academic sources, and read by many OA researchers) given that they do not cite or mention me anywhere. One would think that they are the first to ever try to write poetry with GPT-3, before everyone started using ChatGPT, given comments like “I have never really heard anyone try to probe it as a kind of creative entity.”
Nevertheless, despite their ignorance, they apparently still managed to rediscover many of the same tricks and points—they also use a similar ‘book’ prompt as my “Transformer Poetry” prompt, they also discover that “in the style of X” works worse than “by X”, they also find that few-shotting poems helps a lot, they also find that davincis have a propensity for eerie AI poems...
COAGULOPATH spotted some more GPT-4-base quotes from Simon Rich (I wonder how many he has total?) in a August 2023 Time op-ed accompanying the book (also confirming that the ‘newer’ model was in fact GPT-4-base, oddly renamed
base4
here):Short story:
Also, some more
code-davinci-002
Onion headlines:“Does Sam Altman Know What He’s Creating?” describes the base GPT-4 model similarly:
Today’s NYer (which is almost entirely about the MS perspective / MS sources of the Altman firing), in addition to further confirming that Altman was manipulating the board to try to get Toner fired, includes some description of what seems to be the MS half of redteaming ‘Prometheus’ (the partially trained GPT-4 snapshot that OA had to give MS for creating the unRLHFed Bing Sydney):
Incidentally, this account explicitly says that there was RLHF, by name, which contradicts both the observed behavior of Sydney and the WSJ reporting that Sydney was released without safety training; this is not a confusion with the other kinds of safety training MS did like the self-generation, because that’s described in the following paragraphs.
I don’t know how to reconcile this: it is possible that Charles Duhigg’s MS sources like Kevin Scott & Sarah Bird are eliding or swapping around the chronology (Sydney disappeared and was replaced later on by a Bing model that acted much more like a RLHFed model). This article feels rather rushed out to be topical, so he may not have done as much digging as usual for a NYer article and doesn’t realize that he’s serving up a very pro-MS narrative. It’s also possible that my interpretation of ‘Sydney was not RLHFed’ is wrong and they actually did ‘RLHF’ it but did it so incompetently that no one noticed.
I suspect it’s the former one, because their explicit attitude is that any AI danger should be discovered the hard way, by unboxing it and setting it loose to see what it does:
So, they unleashed Sydney, didn’t like it, and ‘added a mitigation when it became necessary’ after ‘monitoring social media’, and then dilated at length to the NYer guy about all the RLHF training they did to make the model safe—afterwards. (Not the only detail in there that is misleading or probably wrong. I rather doubt that Nat Friedman had to be told by Kevin Scott that LLMs were cool for coding, for example, and I bet that anecdote came from Scott...)