The poll appears to be asking two, opposite questions. I’m not clear on whether a 99% means it will be a transformer or whether it means something else is needed to get there?
tgb
Thank you. I was completely missing that they used a second ‘preference’ model to score outputs for the RL. I’m surprised that works!
A lot of team or cooperative games where communication is disallowed and information is limited have aspects of Schelling points. Hanabi is a cooperative card game that encourages using Schelling points. Though higher levels of play require players to establish ahead of time a set of rules for what each possible action is meant to communicate, which rather diminishes that aspect of the game. Arguably bridge is in a similar position with partners communicating via bidding.
Is there a primer on what the difference between training LLMs and doing RLHF on those LLMs post-training is? They both seem fundamentally to be doing the same thing: move the weights in the direction that increases the likelihood that they output the given text. But I gather that there are some fundamental differences in how this is done and RLHF isn’t quite a second training round done on hand-curated datapoints.
Sounds plausible but this article is evidence against the striatum hypothesis: Region-specific Foxp2 deletions in cortex, striatum or cerebellum cannot explain vocalization deficits observed in spontaneous global knockouts
In short, they edited mice to have Foxp2 deleted in only specific regions of the brain, one of them being striatum. But those mice didn’t have the ‘speech’ defects that mice with whole-body Foxp2 knock-outs showed. So Foxp2′s action outside of the striatum seems to play a role. They didn’t do a striatum+cerebellum knock-out, though, so it could still be those two jointly (but not individually) causing the problem.
I gave one example of the “work” this does: that GPT performs better when prompted to reason first rather than state the answer first. Another example is: https://www.lesswrong.com/posts/bwyKCQD7PFWKhELMr/by-default-gpts-think-in-plain-sight
On the contrary, you mainly seem to be claiming that thinking of LLMs as working one token at a time is misleading, but I’m not sure I understand any examples of misleading conclusions that you think people draw from it. Where do you think people go wrong?
Suppose I write the first half of a very GPT-esque story. If I then ask GPT to complete that story, won’t it do exactly the same structure as always? If so, how can you say that came from a plan—it didn’t write the first half of the story! That’s just what stories look like. Is that more surprising than a token predictor getting basic sentence structure correct?
For hidden thoughts, I think this is very well defined. It won’t be truly ‘hidden’, since we can examine every node in GPT, but we know for a fact that GPT is purely a function of the current stream of tokens (unless I am quite mistaken!). A hidden plan would look like some other state that GPT caries from token to token that is not output. I don’t think OpenAI engineers would have a hard time making such a model and it may then really have a global plan that travels from one token to the next (or not; it would be hard to say). But how could GPT? It has nowhere to put the plan except for plain sight.Or: does AlphaGo have a plan? It explicitly considers future moves, but it does just as well if you give it a Go board in a particular state X as it would if it played a game that happened to reach state X. If there is a ‘plan’ that it made, it wrote that plan on the board and nothing is hidden. I think it’s more helpful and accurate to describe AlphaGo as “only” picking the best next move rather than planning ahead—but doing a good enough job of picking the best next move means you pick moves that have good follow up moves.
Maybe I don’t understand what exactly your point is, but I’m not convinced. AFAIK, it’s true that GPT has no state outside of the list of tokens so far. Contrast to your jazz example, where you, in fact, have hidden thoughts outside of the notes played so-far. I think this is what Wolfram and others are saying when they say that “GPT predicts the next token”. You highlight “it doesn’t have a global plan about what’s going to happen” but I think a key point is that whatever plan it has, it has to build it up entirely from “Once upon a” and then again, from scratch, at “Once upon a time,” and again and again. Whatever plan it makes is derived entirely from “Once upon a time,” and could well change dramatically at “Once upon a time, a” even if ” a” was its predicted token. That’s very different from what we think of as a global plan that a human writing a story makes.
The intuition of “just predicting one token ahead” makes useful explanations like why the strategy of having it explain itself first and then give the answer works. I don’t see how this post fits with that observation or what other observations it clarifies.
If you choose heads, you either win $2 (ie win $1 twice) or lose $1. If you choose tails then you either win $1 or lose $2. It’s exactly the same as the Sleeping Beauty problem with betting, just you have to precommit to a choice of heads/tail ahead of time. Sorry that this situation is weird to describe and unclear.
Yes, exactly. You choose either heads or tails. I flip the coin. If it’s tails and matches what you chose, then you win $1 otherwise lose $1. If it’s heads and matches what you chose, you win $2 otherwise you lose $2. Clearly you will choose heads in this case, just like the Sleeping Beauty when betting every time you wake up. But you choose heads because we’ve increased the payout not the probabilities.
And here are examples that I don’t think that rephrasing as betting resolves:
Convinced by the Sleeping Beauty problem, you buy a lottery ticket and set up a robot to put you to sleep and then, if the lottery ticket wins, wake you up 1 billion times, and if not just wake you up once. You wake up. What is the expected value of the lottery ticket you’re holding? You knew ahead of time that you will wake up at least once, so did you just game the system? No, since I would argue that this system is better modeled by the Sleeping Beauty problem when you get only a single payout regardless of how many times you wake up.
Or: if the coin comes up heads, then you and your memories get cloned. When you wake up you’re offered the deal on the spot 1:1 bet on the coin. Is this a good bet for you? (Your wallet gets cloned too, let’s say.) That depends on how you value your clone receiving money. But why should P(H|awake) be different in this scenario than in Sleeping Beauty, or different between people who do value their clone versus people who do not?
Or: No sleeping beauty shenanigans. I just say “Let’s make a bet. I’ll flip a coin. If the coin was heads we’ll execute the bet twice. If tails, just once. What odds do you offer me?” Isn’t that all that you are saying in this Sleeping Beauty with Betting scenario? The expected value of a bet is a product of the payoff with the probability—the payoff is twice as high in the case of heads, so why should I think that the probability is also twice as high?
I argue that this is the very question of the problem: is being right twice worth twice as much?
You’re right that my construction was bad. But the number of bets does matter. Suppose instead that we’re both undergoing this experiment (with the same coin flip simultaneously controlling both of us). We both wake up and I say, “After this is over, I’ll pay you 1:1 if the coin was a heads.” Is this deal favorable and do you accept? You’d first want to clarify how many times I’m going to payout if we have this conversation two days in a row. (Is promising the same deal twice mean we just reaffirmed a single deal or that we agreed to two separate, identical deals? It’s ambiguous!) But which one is the correct model of the system? I don’t think that’s resolved.
I do think phrasing it in terms of bets is useful: nobody disagrees on how you should bet if we’ve specified exactly how the betting is happening, which makes this much less concerning of a problem. But I don’t think that specifying the betting makes it obvious how to resolve the original question absent betting.
That assumes that the bet is offered to you every time you wake up, even when you wake up twice. If you make the opposite assumption (you are offered the bet only on the last time you wake up), then the odds change. So I see this as a subtle form of begging the question.
Your link to Lynch and Marinov is currently incorrect. However I also don’t understand whether what they say matches with your post:
the energetic burden of a gene is typically no greater, and generally becomes progressively smaller, in larger cells in both bacteria and eukaryotes, and this is true for costs measured at the DNA, RNA, and protein levels. These results eliminate the need to invoke an energetics barrier to genome complexity. … These results indicate that the origin of the mitochondrion was not a prerequisite for genome-size expansion.
So that example is of , what is the for it? Obviously, there’s multiple that could give that (depending on how the loss is computed from ), with some of them having symmetries and some of them not. That’s why I find the discussion so confusing: we really only care about symmetries of (which give type B behavior) but instead are talking about symmetries of (which may indicate either type A or type B) without really distinguishing the two. (Unless my example in the previous post shows that it’s a false dichotomy and type A can simulate type B at a singularity.)
I’m also not sure the example matches the plots you’ve drawn: presumably the parameters of the model are but the plots show it it varying for fixed ? Treating it as written, there’s not actually a singularity in its parameters .
Are you bringing up wireheading to answer yes or no to my question (of whether RL is more prone to gradient hacking)? To me, it sounds like you’re suggesting a no, but I think it’s in support of the idea that RL might be prone to gradient hacking. The AI, like me, avoids wireheading itself and so will never be modified by gradient descent towards wireheading because gradient descent doesn’t know anything about wireheading until it’s been tried. So that is an example of gradient hacking itself, isn’t it? Unlike in a supervised learning setup where the gradient descent ‘knows’ about all possible options and will modify any subagents that avoid giving the right answer.
So am I a gradient hacker whenever I just say no to drugs?
I’m still thinking about this (unsuccessfully). Maybe my missing piece is that the examples I’m considering here still do not have any of the singularities that this topic focuses on! What are the simplest examples with singularities? Say again we’re fitting
y = f(x)
for over some parameters. And specifically let’s consider the points (0,0) and (1,0) as our only training data. Then has minimal loss set . That has a singularity at (0,0,0). I don’t really see why it would generalize better than or , neither of which have singularities in their minimal loss sets. These still are only examples of the type B behavior where they already are effectively just two parameters, so maybe there’s no further improvement for a singularity to give?
Consider instead . Here the minimal loss set has a singularity when at (0,0,0,0). But maybe now if we’re at that point, the model has effectively reduced down to since perturbing either c or d away from zero would still keep the last term zero. So maybe this is a case where has type A behavior in general (since the x^2 term can throw off generalizability compared to a linear) but approximates type B behavior near the singularity (since the x^2 term becomes negligible even if perturbed)? That seems to be the best picture of this argument that I’ve been able to convince myself of so-far! Singularities are (sometimes) points where type A behavior becomes type B behavior.
And a follow-up that I just thought of: is reinforcement learning more prone to gradient hacking? For example, if a sub-agent guesses that a particular previously untried type of action would produce very high reward, the sub-agent might be able to direct the policy away from those actions. The learning process will never correct this behavior if the overall model never gets to learn that those actions are beneficial. Therefore the sub-agent can direct away from some classes of high-reward actions that it doesn’t like without being altered.
There’s been discussion of ‘gradient hacking’ lately, such as here. What I’m still unsure about is whether or not a gradient hacker is just another word for local minimum? It feels different but when I want to try to put a finer definition on it, I can’t. My best alternative is “local minimum, but malicious” but that seems odd since it depends upon some moral character.
Ah, I didn’t understand what “first option” meant either.