Planning in LLMs: Insights from AlphaGo
Introduction
Talk of incorporating planning techniques such as Monte Carlo tree search (MCTS) into LLMs has been bubbling around the AI sphere recently, both in relation to Google’s Gemini and OpenAI’s Q*. Much of this discussion has been in the context of AlphaGo, so I decided to go back and read through AlphaGo and some subsequent papers (AlphaGo Zero and AlphaZero). This post highlights what these papers did in the context of LLMs and some thoughts I had while reviewing the papers.
When I say LLMs in this post I am referring to causal/decoder/GPT LLMs.
AlphaGo
Overview
AlphaGo trains two supervised learning (SL) policy networks, a reinforcement learning (RL) policy network, a SL value network, and uses MCTS for planning. It learns to play the game of Go.
SL Policy Networks
Two SL policy networks are trained:
A slow policy network using a CNN. Used to compute prior probability of state-action pairs during MCTS rollouts.
A fast policy network using hand-crafted linear features. Used for full game simulations during MCTS rollouts.
Both networks are trained to predict the next move from a data set of expert games. This is the same self-supervised target as LLMs trained to predict the next token from a data set of internet text; AlphaGo learns a softmax over next moves & LLMs learn a softmax over next tokens.
This use of a fast and slow model reminds me of speculative decoding. I haven’t thought through the implications of this much and it is far from a perfect analogy, but it could be a useful insight.
RL Policy Network
The RL policy network is trained by fine-tuning the SL policy network with REINFORCE on games between the current RL policy network and randomly selected previous RL policy networks. Rewards are +1 for winning and −1 for losing.
This RL policy network, using no search, won 85% of games against Pachi, a Go program that used 100,000 MCTS simulations per move. This shows that pure RL can outperform pure search, but usually a combination of the two gives the best performance.
The RL policy network was not actually used in the final version of AlphaGo; it was only used to generate data for training the value network. The authors noted that the SL policy network performed better than the RL policy network “presumably because humans select a diverse beam of promising moves, whereas RL optimizes for the single best move.”
This lack of diversity is reminiscent of the mode collapse phenomenon in certain GPT models fine-tuned on human feedback data. Human selection does not encourage diversity in this case. I think the pre-training data is more varied than the fine-tuning data in both cases, leading to more diverse outputs from the pre-trained network than the fine-tuned one. The lack of diversity could also be caused by value lock-in as mentioned in this comment.
Value Network
The value network is trained with SL to predict the outcome of positions from self-play games between the RL policy network and itself. The value network outputs a single prediction instead of a probability distribution over moves.
The authors found that predicting game outcomes from data consisting of only complete games led to overfitting. To mitigate this, they generated 30 million distinct positions and had the RL policy network play against itself from each position until the game terminated. Training on this new data led to minimal overfitting.
Based on this, I am interested to know if data for the reward modeling stage of RLHF consists only of complete conversations, or if subsets of these conversations are used.
Monte Carlo Tree Search
MCTS in AlphaGo
I didn’t fully understand how MCTS works in the context of AlphaGo (or at all really) as I was writing this, so this section will be my attempt to explain it in my own words. You can skip this if you already know it.
MCTS consists of 4 stages: selection, expansion, evaluation, and backup.
Selection: traverse the game tree from the root until reaching a leaf node. Each traversal is the edge (action) with the maximum upper confidence bound (UCB). where ; this gives an exploration bonus to uncertain action-values. I’ll explain , , and below.
Expansion: upon reaching a leaf node, compute for that node from the SL policy network.
Evaluation: the evaluation of a leaf node is a weighted sum of the value network prediction for the node and the outcome of a game played from by the fast policy network. ; the authors found that worked best.
Backup: update the and for each node visited during the simulation. is the average for an edge.
This is repeated for some amount of simulations. For AlphaGo, it was however many simulations could be completed within 5 seconds. They used an asynchronous policy and value MCTS (APV-MCTS) algorithm which executes simulations in parallel.
Explaining some of the variables from above:
is the action-value for an edge in the search graph.
is the visit count for an edge in the search graph.
is the SL policy network action probability for an edge in the search graph.
MCTS and RLHF
RLHF is very similar to the latter stages of AlphaGo. The reward model(s) in RLHF correspond to the value network in AlphaGo, and the human comparison between model outputs in RLHF can be viewed as a special case of MCTS. In RLHF, two (or more) model outputs are shown to a human rater and they rank these outputs. Each of these outputs can be viewed as a single MCTS simulation. For RLHF each simulation takes place from the root node (the end of the user input), there is no exploration bonus , and .
The similarities between MCTS and RLHF suggest some improvements to RLHF. The simplest one I could think of is to have the model completions branch at random points instead of branching at the end of the user input. If the AlphaGo value network overfitting from training only on full games carries over to LLMs, it could be mitigated by branching at a random point in the model generation. Another improvement could be to add more branching throughout the model generation, leading to more model outputs to rank. This would be difficult to get human feedback for but could be done using feedback from another AI as in RLAIF.
In this paper, the authors iteratively update the reward models as more data is produced through the model playing with human users. In AlphaGo the value network is fixed, even as more data is produced through the model playing with itself.
Next Token Prediction
Some people say that LLMs are simply predicting the next token. Would they say the same of AlphaGo? Does AlphaGo’s use of inference-time MCTS suddenly make it an agent? Using RL doesn’t suddenly make a policy agentic, so why should MCTS? Even if LLMs and AlphaGo without MCTS are “simply” predicting the next token, this doesn’t mean they aren’t agents or aren’t intelligent. As I mentioned earlier, the AlphaGo RL policy network without search beat the strongest open-source Go program at the time which executed 100,000 MCTS simulations per turn.
In the same way that RL doesn’t suddenly make a policy agentic, SL doesn’t mean a policy isn’t agentic. Moves in a Go data set and token on the internet were created, in sequence, by a human with intent (most of the time). This intent is implicitly inherited by models trained on the data. Consider the Mountain Car environment. A model that only cared about the next action would only move right. A model trained with SL on expert data would initially move left, not because it has some plan for the future, but because it learned that left is the action an expert would make. This model wouldn’t be agentic, but I think it is overwhelmingly likely there exists a (hypothetical) data set that could train a model with SL and this model would be considered an agent by human standards.
LLMs simply predicting the next token is discussed further here, with analogies to AlphaGo brought up in these comments. An interesting thought experiment is brought up in this comment. In short, if you ask a LLM to remember a number for the future, does it actually do this, or does it generate a new number when asked what the number was?
My thoughts on this are that LLMs don’t store a number, this would require them having a memory, but that doesn’t mean the LLM isn’t considering the future. The LLM’s “plan” will be updated with each token it generates, like how a chess player’s plan will change based on the opponent’s move. The LLM is not playing against an opponent when generating text; in my mind it acts like this improv game. Each person (LLM) says a word (token) with an idea of where the story will go, but the other only has a vague idea of the other’s intentions and will continue the story in a slightly different direction. As the softmax temperature increases, this prediction of the other’s intentions becomes more difficult. This all reminds me of acausal trade.
Overall, I believe that LLMs know a lot more than is implied by their ability to “simply” predict the next token. RLHF reward models are fine-tuned versions of the pre-trained model with the classification head replaced by a regression head. These reward models know a lot more than what the next token should be. I also presume the OpenAI text and code embeddings are an intermediate layer of GPT, or maybe a new head with a small amount of fine-tuning.
AlphaGo Zero
Preface
In the AlphaGo Zero (AGZ) paper it is mentioned that a second, slightly different, version of AlphaGo was created for the match with Lee Sedol. This second version is referred to as AlphaGo Lee, while the original version in the AlphaGo paper is referred to as AlphaGo Fan.
Overview
The next step after AlphaGo was AGZ, which learned the game of Go from scratch. AGZ combined the policy network and value network from AlphaGo by using one network with two heads. It learns by policy iteration: self-play with search is used for policy improvement and game outcome is used for policy evaluation. This policy iteration is similar to Iterated Distillation and Amplification (IDA).
Iterated Distillation and Amplification
In this post, Paul Christiano talk about how AGZ is a “nice proof of concept of a promising alignment strategy.” This alignment strategy, benign model-free RL, is what (I think) eventually came to be known as IDA. The way RLHF is used for LLM training in this paper is also similar to IDA. Explaining IDA in terms of [ AGZ | LLM RLHF ]: a slow model [ MCTS | human ] is used to train a better fast model [ policy network & value network | LLM & reward model ]. The better fast model is then used to improve the slow model, the better slow model trains a better fast model, and so on.
Language Modeling as a Markov Decision Process
This blog post is linked in Paul’s post on AGZ. One thing this post brings up is modeling conversation as a Markov decision process (MDP), more specifically a partially observable MDP (POMDP). The author suggests the state be some hand-crafted features and the actions be full dialogue turns chosen from a pre-determined set of monologues. This made sense at the time (February 2017), but with LLMs the MDP can be constructed at the token level.
This paper views language modeling as a POMDP, with actions as the possible set of next tokens and observation as a history of tokens. This paper views goal-directed dialogue as a MDP, with the initial state as some task-specific prompt, actions as next tokens, next state as the previous state with the action appended to the end, and reward based on the final state and some target string.
LLMs directly predict action (token) distributions from the token history; they don’t explicitly predict the hidden state from the observation (token history). Despite this, I think it is likely that LLMs implicitly predict the hidden state: somewhere within the transformer the computation transitions from mostly state prediction to mostly action prediction. Some evidence of this is that Othello-GPT has a world representation.
Thinking about language modeling in terms of a POMDP enables a more structured way of thinking about LLMs. For example,
Why are many observations (token histories) by the LLM grouped together as simulators? Do they all have similar hidden states in the POMDP as predicted by the LLM’s implicit world model? Which (if any) of these observations are more agentic than others?
What is happening in the observation (token history) of the POMDP that causes LLMs to collapse into a Waluigi?
There are two other subtypes of MDPs that I think are important to consider.
This paper views human dialogue as a hidden parameter MDP, which could also be a potential way of thinking about simulators.
Tokenization can also be viewed as creating options (i.e. temporally extended actions) over the action space of some “alphabet” (e.g. UTF-8). An MDP with options is called a semi-MDP. In this paper, options are created using BPE and used for more efficient sparse-reward learning in a few RL environments.
Learning from Scratch
The other change to AlphaGo from AGZ, besides IDA, was learning the game from scratch. Instead of being pre-trained on supervised games and further learning through RL self-play, AGZ exclusively learns through RL self-play. AGZ presumably sees more varied positions than AlphaGo since early parts of its search tree are not heavily biased by pre-training on expert data.
This learning from scratch would be extremely difficult for training LLMs. Even if there were enough annotators and time to do RLHF from scratch, ranking the gibberish strings of tokens produced at the start of training would be impossible.
One way learning from scratch could possibly be done is through RLAIF. I’m not sure if the oversight LLM would be able to meaningfully rank the gibberish strings during early training. If not, an alternative could be to start with a small context length and as the model learns to produce coherent strings the context length can be increased. This could be augmented with methods like Pretraining Language Models with Human Preferences to encourage the model to be aligned with human values.
Exploration and Agency
This section is a bit out there, and I’m a lot less sure of what I’m saying here than I am in the rest of this post.
A difference between RLHF/RLAIF and pre-training is that the RL methods can explore new LLM generations. I believe that exploration is a big reason why RL policies are more likely to be agentic than SL policies.
It makes intuitive sense to me that any offline RL data set can be converted to a SL data set by using the offline RL data to compute the best action distribution and/or value for each state and training the SL policy to mimic this. Additionally, any stationary MDP can be converted into an offline RL data set through infinite online exploration of the MDP. From this, the only real difference between stationary online RL and SL is that the RL policy must efficiently explore its environment to gain data and decide which data points to learn from.
Another way to arrive at this conclusion is to consider a RL policy operating in an online environment. The policy collects data and eventually performs a batch update to itself. This update could be exactly approximated by some SL update. One difference is that the online RL policy is continually collecting new data by exploring its environment, while a SL data set is pre-determined. The other difference is that batch RL updates are usually taken from recent data, while batch SL updates are usually sampled equally from all data. Therefore, the “agency” of RL comes from gaining new data through exploration and choosing what data to learning from.
As an example of this, the value network learned in AlphaGo is trained using SL on the outcomes of games between the RL policy network and itself. In fact, AlphaGo Fan doesn’t use any RL in the final program: the value network was trained with SL, and the SL policy network was trained with SL. The only use of RL was in training the RL policy network, which was only used to generate a SL data set used to train the value network.
This comment talks about how RL produces inexact gradients, while SL produces exact gradients. While the inexact gradients make RL less sample efficient, it also encourages exploration since it will (in expectation) require more gradient updates to reach an “optimal” policy. The comment, along with this paper mentioned in the comment, also mentions how data in SL is IID, while data in RL is not since a RL policy will influence its own future. This isn’t a relevant difference when it comes to LLMs, since LLMs also influence their own future.
I realized the above strikethrough was incorrect when re-reading this before publishing. LLMs do not influence their own future during training, only during inference. I’m not sure of the implications of IID data in SL and non-IID data in RL on agency; I’ll have to think about it more. Takes are welcome in the comments.
AlphaGo, AGZ, and LLMs can all modulate exploration through their softmax temperature. AlphaGo and AGZ can further encourage exploration through scaling on the function in the selection stage of MCTS. Exploration in AlphaGo and LLMs is biased towards the pre-training data. More varied RLHF data could be collected by increasing the softmax temperature, but more variance will require more data to create a sufficiently trained reward model.
As a final note on MDPs, I believe that the MDPs of human language and values are non-stationary; the meanings of words and what we consider moral changes with time. There is also the concept of ergodicity. An MDP is ergodic if each state is reachable from every other state. Too much exploration can result in an agent reaching a state where it is cut off from the rest of the MDP (e.g. death). If there is only one agent, this is very bad. In populations of agents (e.g. evolution), this is less of a worry as the surviving agents will adapt to avoid bad states.
AlphaZero
Changes from AGZ to AlphaZero are smaller; most changes are to allow AlphaZero to learn Go, chess, or shogi. The changes outlined in the paper are:
AlphaZero’s value network optimizes the expected outcomes, while AGZ’s optimized the probability of winning. This change was because chess and shogi can end in draws.
AlphaZero does not augment the training data or transform the board position. Chess and shogi are not symmetric.
AlphaZero uses a continually updated policy network for self-play, while AGZ used the best policy network from all previous iterations for self-play.
AlphaZero uses the same hyperparameters for all games, except for a scaling factor on policy noise to encourage exploration. AGZ used Bayesian optimization to tune hyperparameters.
The only change in AlphaZero that I think is worth commenting on is using the latest policy network for self-play, rather than the “best” policy network. I assume this would lead to slightly more varied games, as the policy for choosing moves is always changing instead of possibly being the same for many rounds of games.
Further Improvements to AlphaZero
New algorithms based on AlphaZero have been created since, including MuZero, EfficientZero, and this work on diversifying AlphaZero. I plan to make a post about these variants at some point in the future.
LLMs are shockingly good at gibberish, leading to macaronic attacks and other non-obvious implications, so I would not be surprised if an oversight LLM could. (Humans can probably also do this due to dark knowledge but it would be so painful & expensive as to be impractical, as you note.)
However, I don’t think that really gives any kind of equivalent to AlphaZero. At best, you’d wind up exploiting the oversight LLM (like happens in RLHF if you let it optimize hard enough against the reward model). The most important question to ask about a bootstrap is: “where does improvement come from?” Are you applying compute to extract knowledge that the model already knows implicitly, in a Kolmogorov-complexity-esque sense of ‘knows’, or are you acquiring more data? And if so, who, where, and what?
AlphaZero gets improvement from doing planning/search over a perfect model, the simulator used by the tree search. MuZero gets improvement from doing that over a learned model. In the former case, the simulator is assumed to be bug-free and so represents the full game tree; and there is nothing to Go outside the full game tree, so one can become an arbitrarily good Go player, in theory, needing nothing beyond the simulator. In the latter case, you have a learned model, which may not have learned some critical Go rule, and so that’s not true, but if you allow it to periodically play actual groundtruth games and update the learned model based on any glitches, it may quickly become good enough that it is now like the former: the improvement comes from using compute to search more of the full game tree as implicit in the learned model/simulator. Each time it explores the game tree, it is acquiring ‘more data’. (You could imagine setting up a literal, in real life, Go board and filming it, to make concrete that this is ‘data’. However, since the Go game tree is an abstract mathematical object, you don’t need to—you just use compute to simulate it, either with hand-written code or a big RNN. In this special case, there is a highly convenient pun of compute=data.)
In other cases, like inner-monologue or self-distillation, the gain comes from amortizing compute: the model gains absolutely no additional data about the external world, it just gets to see what it already thinks at greater length, and retrain to shortcut to its best already-known answer.
So, if we try to train a LLM from scratch using just feedback from a pre-existing LLM, where would we get the improvement from (if we gained any)? We don’t gain any more data about the world that was not in the pre-existing LLM, so our improvement can’t be coming from new or better data; and it’s not obvious that this is doing anything for us computationally either: what slow lengthy outputs from the pre-existing LLM are we ‘distilling’ into the student LLM? It looks like the student LLM would just gradient-ascend its way to some extremely-specific gibberish attack on the teacher, learn next to nothing about the wide diversity of text the teacher LLM knows as it mode-collapses, and certainly not become ‘superhuman’ in any sense, because there is nothing in the teacher LLM which corresponds to the Go game tree.
So, a viable self-play scheme for LLM needs to answer this question of where it’s getting data and/or compute from which can gradually improve itself. Self-play for LLM probably looks rather different than what you’re imagining from RLAIF, and more like a Bayesian or mathematical ‘game’.
Here’s a semi-concrete example of what I think a LLM self-play scheme needs to look like, since it’s unclear what ‘game a LLM plays’ if we are trying to borrow RL techniques. Perhaps something like Silver & Veness 2010, or maybe something like generating a large argument tree, with key unknowns highlighted, and then after lengthy computations exploring the implications of key claims and bringing in additional ‘facts’ as they become relevant, the most influential ✕ unknown premise X is kicked up to an oracle (human) for labeling and training on the resulting argument tree? (In a tree with arguments and confidences, it should be possible to find the node with the highest Value of Information, which is both uncertain and its possible values change the root the most on average.) Then you can see where it’s getting improvement from: it’s using a lot of computation to elicit its implicit knowledge that premise X affects a lot of other premises, targeting its learning on X, and finetuning on that, thereby becoming able to focus on a new argument tree which can build on X and all of the reasoning steps that were validated by the outcome for X. Thus, it covers both bases and can bootstrap itself.
This would be applicable to any specific domain you might want answers in, like math proofs or code problems, but you could also simply have the LLM run autonomously, posing itself random questions where it is uncertain about the answer even after a lot of self-play; and then you could, say, finetune it to predict which questions had a high uncertainty rating, and use that to continually keep asking itself new questions. So you would see a big bank of GPUs churning away, periodically asking the human raters very baffling, arbitrary-seeming, even absurd questions (‘Who would win in a fight, a box of nails or a bowl of jelly?’), but where your answer each time resolves a bunch of mysteries for the LLM and reduces its error rate on benchmarks, and where you can periodically finetune or retrain a much better LLM from scratch on the new improved (and highly proprietary?) dataset of text.
This can only give a bounded amount of improvement, but nothing in particular says that the bounds have to be low in practical terms. For a concrete example, LLMs are currently pretty bad at noticing and self-correcting when they make a reasoning error, but they are capable of the following:
Given an example of a valid chain of reasoning, come up with a description of a mistake that might be made during that chain of reasoning
Given a valid chain of reasoning and a description of a reasoning error, generate a chain of reasoning that exhibits that reasoning error. This gives training examples of “chains of reasoning with errors in them”.
Given an invalid chain of reasoning, determine where the first error occurred.
Given an invalid chain of reasoning and the location of the first error, attempt to recover from that error
Given an invalid chain of reasoning and a recovery attempt, and also given the correct chain of reasoning, determine whether the recovery attempt succeeded.
Given all of the above, determine whether the scenario makes a good example of recovering from a reasoning error and should thus be included in the next training set.
I expect that this cycle would produce improved reasoning error recovery as long as recognizing a good output is easier than generating a good output. And I expect that would probably remain true for a while. Also I expect that something like this has already been done, especially since it rhymes with constitutional AI.
Obviously this doesn’t work “from scratch”, you need enough training for the model to be able to distinguish good outputs from bad outputs and also ever produce good outputs on its own. We’re not going to get a ChatGPT-Zero. But I think this post does gesture in the general direction of something real.
The gain from such approaches are real and part of why LLMs work so well now.
However, the problem is that the gains from self-distillation or finetuning always top out quickly thus far. You can’t train more than 2 or 3 iterations before it stops working. There is something missing compared to self-play successes like TD-Gammon or AlphaZero. There cannot be any ChatGPT-Zero as currently constituted, because you’d run a few iterations and then it’d either stop progressing or collapse in some way as the LLM centipede eats its own increasingly degenerate outputs. Pretty soon, you do stop recognizing ‘better’ outputs because you were just trained to generate only better outputs! (Where does any additional ‘betterness’ or ‘betterness recognition’ come from? ‘Sorry, y’all look the same to me.’) RLHF or self-distillation are more about specialization than they are about increasing capability: they increase the prior of pre-existing outputs, nothing less, nothing more.
The search for LLMs is not great. It’s analogous to doing runtime search in a Go/chess model: you get a big boost in Elo from searching even 1 or 2 ply (especially in avoiding blunders), but then you run into fast diminishing returns, and your search doesn’t feed back into the original model to improve it (outside training). But I think that, beyond some highly abstract niches like math theorem proving (which pose different challenges), the main missing part is the active selection of new data for LLMs, which is implicit in games where your ‘new data’ is just part of search (of the game tree).
While I do think the process you outlined in your post is more concrete and would probably work better and be easier than learning “from scratch”, I don’t think it’s completely obvious that something like this wouldn’t work from scratch. It was done for humans, albeit through billions of years of genetic evolution and thousands of years of cultural evolution. Something like ChatGPT-Zero would probably require many more orders of magnitude of compute than systems we are training today, and also some algorithmic/architectural improvements, but I don’t think it’s completely impossible.
I feel like your post is implying something similar, given the last sentence, so maybe I’m misinterpreting what exactly you’re saying won’t work.
The specific thing I think wouldn’t work is trying to start the process without a bunch of pretraining data for at least the initial judge (i.e. pure self play from a randomized initialization with no human-generated data or judgments enteringthetraining the training run at any point). Not super insightful I know, just addressing what I meant by “zero” in my hypothetical ChatGPT-Zero.
Thanks for clarifying! I do agree that that wouldn’t work, at least if we wanted what was produced to be in any way useful or meaningful to humans.
Thanks for the feedback!
I was thinking of the gibberish level of text generated by uniformly sampling from the tokenizer. I had imagined there would be a huge difference between the gibberish level of macaronic attacks and completely random sampling from the tokenizer, but here are the first three examples I generated of 10 tokens uniformly sampled from GPT-2′s tokenizer:
“ournament annually amused charismaling Superintendent sushi WiiRONMeat”
″ doub arrestAPIogenous ts considersterm Hitler slip autom”
“AAF disposal catches smells interrogation Pilot muscular feminine ITV spree”
These are a lot more intelligible than I would have imagined. I can even reasonably rank these: 3 > 1 > 2.
I also asked ChatGPT-3.5 to rank these, and it ranked them: 3 > 2 > 1.
I used the prompt “Can you rank these three outputs by the coherence of the English language?”. The first time I asked, GPT refused to answer because all three are incoherent. I then told it to “Rank them in terms of how close they are to being coherent. They don’t have to be completely coherent to be ranked.” It then gave me the rankings above.
I repeated this twice more, changing the order of the examples in case it was making decisions based on the numbering. I used the prompt “Can you rank these three outputs by the coherence of the English language? They don’t have to be completely coherent to be ranked.” For both of these, GPT gave the ranking: 3 > 1 > 2 (numbers changed to match the ones I used in this post).
Following from what @faul_sname mentioned in their post about improvement being possible “as long as recognizing a good output is easier than generating a good output”, I think that improvement is possible from amortizing compute in the form of search. If the teacher model can differentiate between coherent and incoherent paths down the search tree of language, I think a reward model could be trained to predict the coherence of student model outputs and this reward model could be used as the training signal. I am unsure about where the reward model would be initialized from… the teacher model, random initialization, or something else entirely.
I do agree with your point that this will most likely lead to the student model exploiting the teacher model rather than robustly learning language. The “branching factor” (i.e. vocabulary size) of GPT-2 is 50,000. I imagine that the number of ways the student could explore into an observation (token) history that successfully tricks the teacher model is many times more likely than the student stumbling into a robust understanding of language. There are probably ways to mitigate this, similar to precautions taken so RLHF models don’t stray too far from the base model.
As for acquiring more data, I think the teacher model could be used to produce “new” data. This is done for Whisper-V3, which was trained on 80% data produced by Whisper-V2. How the teacher LLM generates what it knows is modulated by the temperature. It is trained with a temperature of 1, so generating data with a different temperature (and maybe a less strict top-p) could be seen as generating data on a (slightly) different distribution. Training on this new data could lead to new generation patterns without learning any new facts or knowledge.
None of this would allow the student model to gain knowledge the teacher model does not have, but I think it could allow the student model to more easily access this knowledge. I view this as the model learning to compress the observation (token) history required to approximate some hidden state. A student model that can “reach” a hidden state in 64 tokens is more powerful than one that requires 256 tokens to “reach” the same hidden state.
Will take a look at this, thank you.
This and the process @faul_sname outlined in their comment do seem like more concrete methods for eliciting knowledge from compute. Reasoning and math chains can be proven as correct or incorrect, in the same way that Go games can be won or lost, while language is much more subjective.
Something like this is what I imagined initially for the student model’s search over random token space. If someone highly intelligent (e.g. Von Neumann) could rank every output from the model in terms of coherence, I imagine it would result in a model more competent than current LLMs (at least in whatever domains Von Neumann was competent in). Obviously this is impossible, but even getting enough humans of any intelligence level to provide feedback for this process would also be impossible. This is why I fell back to relying on AI feedback for the process. This paper shows that RLAIF performs on par or better than RLHF, although I imagine RLAIF is less robust and more vulnerable to exploitation, as you mentioned. And this result is highly dependent on the domain and which human is giving the feedback.
I’m not surprised BPEs are semi-coherent. As I said, dark knowledge, and anyway, BPEs are a compression algorithm (compression=intelligence) which were trained on a large English text corpus, so them not being random linenoise is no more surprising than n-grams or gzip being able to generate English-y text.
But Whisper-V2 is processing real data still, so it’s a mix of learning from data (the Whisper models haven’t extracted all possible knowledge from the first pass through the data) and amortizing compute (the training+runtime compute of the Whisper-V2 is being distilled into cleaner pseudo-data for Whisper-V3 to train faster on). You would not generate freeform gibberish, unanchored in any real audio or text, from Whisper-V3 to train V4 and then V5 and then V6 and then V7, and expect V7 to be wildly better.
This knowledge distillation of inner-monologue can be, and has been, done directly, so detouring through a from-scratch RLAIF-ish approach would seem to offer a lot of complexity and downsides compared to just the obvious direct thing.
It is also just that there is a world outside language, while there is much less of an outside for logic, math, or Go. That’s why it’s useful to take a broader Bayesian view, so you can have an argument tree which is statistical/decision-theoretic and can do things like request empirical data. The LLM could insert arbitrary hypotheticals into the tree like “if we administer drug Y to cancer patients with Z, survival rates would be +10%”, and this can be tested in the real world (or just given an expert’s best guess, doesn’t have to actually be real to keep the search & self-improvement going—note that it could also be framed in terms of raw data, MCTS and other tree approaches can be made to work on continuous/infinite observation & action spaces, as they are iterative anytime and don’t need to expand all possible nodes).
I had this intuition for n-grams (natively) and gzip (from this paper). Never really considered how much BPE compresses the token space, not sure why.
This makes sense. This made me think whether there’d be some way to chain learning between modalities for a multimodal model, but it would probably fall into the same pit: beyond the initial data, the change in modality would still be producing and learning from synthetic data, not real data as is the case for Whisper.
I do agree that distilling inner monologue is easier than learning the same thing from scratch. I don’t think this RLAIF-from-scratch is the end-all-be-all of what’s gonna work; I find it a useful frame of thinking for considering other approaches that could work better for learning language more from scratch.
For example, this discussion with you popped the idea of using GANs into my head, which it turns out has been tried extensively. Not to the same scale as next token prediction though. DeepMind has this paper on using a GAN with LSTMs for the generator and discriminator to learn language “from scratch”. This survey paper presents other papers using GANs for text generation. Some highlights from quickly skimming through it: 1, 2, 3, 4.
This paper says (paraphrasing the abstract) that GANs are overkill for NLP since minimizing distinguishability (between generator and real outputs) can be seen as maximizing likelihood for NNs with a softmax output layer. I think that being able to define more complex loss functions with GANs is one benefit. You could use multiple discriminators: one for the pre-training data, one for a helpfulness data set, one for a harmlessness data set, etc.
Kind of as an aside, this paper connects GANs to inverse RL (e.g. learning a reward model from human feedback data), and to energy-based models (where Yann LeCun seems to think the future of self-supervised learning is going).
Good point. Maybe what I’m thinking of will only become possible once language models are more grounded in the real world. Multi-modality is a step in that direction, and robotics. We’re probably at least a few years from robots collecting enough of their own data in the real world though.
Yeah, GANs for sequences are one of those ideas that people kept trying and it never worked. It wasn’t entirely clear why; I suspect that much of it was simply that due to the inefficiency of RL and the very very smolness of all the GAN sequence work back then*, that it was all dead on arrival. (I never really bought the “it’s just equivalent to likelihood” argument. GANs always seemed to operate in images in a very qualitatively distinct way from all likelihood-based approaches; and if you look at things abstractly enough, you can make anything equivalent to anything like that.) It’s possible that retrying today with proper scale might work, same way that image GANs now work at scale (despite being left for dead by contemporary researchers who had failed to note that BigGAN scaled just fine to JFT-300M).
But my real suspicion is that direct generative learning is too efficient, so the proper role for GANs would be as an additional phase of training, to sharpen a standard LLM.
AFAIK, this has not been done except inasmuch as you interpret the various preference-learning approaches as actor-critic RL (which means you can also further interpret them as GANs). Given how well diffusion models can be tuned by a simple adversarial loss into a GAN-like single-step Generator, I suspect that some adversarial training of LLMs might be quite useful. I should poke around in Arxiv and see if anyone’s tried that yet...
* LSTM RNNs, or heck, GPTs, wouldn’t look all that impressive if they were trained with similar compute/data as those sequence GAN papers were