I’m surprised by the inverse scaling on modus tollens! Why isn’t this a big piece of evidence for the non-scary side of the Gwern-like “language models are scary proto-AGI” vs.Gary Marcus-like “language models are not-scary Clever Hans” debate? (Because the observation isn’t just “LMs can’t reason”, but that scaling made it worse—although I assume the “scary proto-AGI” camp predicts that this is actually U-shaped scaling.)
I’m probably missing something given that both the contest organizers (Third Prize) and the submitters (“important because it demonstrates that [larger LLMs] make logical fallacies that humans tend to make”) don’t seem wowed (or anti-wowed) in the way I am, but what am I missing, specifically?
The most immediate piece of evidence that you wouldn’t find it a big piece of evidence for the proto-Hans paradigm is in the earlier rounds, inverse scaling examples turning out to be U-shaped scaling (and of course, by the nature of U-shaped scaling, it is likely—nay, probable—that several other of the current ‘inverse scaling’ examples are actually U-shaped and simply aren’t tested with models like Flan-U-PaLM or GPT-4 or further future models that solve them). IMO, the U-shaped scaling curves are the most interesting part of this scaling prize. In fact, that any of the examples turned out to be U-shaped is a major blow to the Hans paradigm, because it is predicting that scaling just isn’t intelligence and just doesn’t work at all, categorically, that thinking that scaling would solve any hard problems is like thinking you can build a ladder to the moon, and you shouldn’t just be able to power through a problem when your scaling gods failed you (before the embarrassing reversal of fortunes). Why should LMs ever inverse scale, much less U-scale, if all they are doing is mere memorization and pattern-matching and interpolation of ever larger datasets? That should predict only monotonic improvement. (The Marcusian positions generally concede at least the possibility that number-go-up and ever more benchmark problems solved, they just deny that that is important.) U-shaped scaling was not a prediction of any proto-Hans theories (at least, before; we’ll see if they do any post-hockery).
You might also just shrug: all this effort just to turn up a handful of fairly weird niche attacks, which might just be U-shaped, and indeed, which you don’t even know if they can be casually prompted away right now? (Especially the modus tollens one: smells like something that an inner-monologue prompt might solve by prompting for self-critique or test cases.)
You might also take it as evidence for proto-Hans but still evidence that (conditional on scaling yielding AGI anyway) AI safety is even riskier than you thought before, back when you thought scaling laws were all smooth straight lines, or at least, monotonically increasing. After all, what kind of scaling phenomena would be even more dangerous than flat scaling that suddenly starts scaling past a critical compute/parameter threshold (‘emergence’) or pseudo-flat ‘hidden scaling’ (normal smooth scaling on tasks—but only when special prompts like inner-monologue are used, and flat otherwise)? Well, it’d be scaling that got worse, fostering complacency, especially when extrapolated out by people who want there to not be risks, and then unpredictably suddenly got rapidly better to make up for lost time: ie. ‘U-shaped scaling’.
(More concretely, for AI safety: qualitatively, I would point out that the surviving examples often follow what I described at the beginning based on the initial examples: it seems like many of these inverse scaling examples are the model ‘figuring out how to do something for the first time’ but doing it badly because it hasn’t fully grasped it, like a kid realizing sarcasm exists and going around saying ‘sarcastic’ false statements because he has grasped that sarcasm is a thing where you say false statement but hasn’t yet quite figured out what makes one false statement sarcastic & another one just false. Alternately, small children who have developed to the point of learning to lie or trying to manipulate adults: often they do worse, because they are so amusingly bad at it, than if they had just told the truth or asked for cookies directly. It would not be too surprising if initial AI stabs at various kinds of agency or long-term planning or hacking or manipulation or deception followed an inverse scaling curve, where it switches from basic default outputs to more sophisticated dangerous behavior, but then screws it up initially and underperforms a smaller duller model. Things like lies or deception are pretty tricky things! They can get you great results if you do them right, but tangled webs do not tolerate error at all, which is why honesty is usually the best policy, for humans and Rl agents alike… If you combine that with U-shaped scaling, you get potentially a situation where evil plans or attempted sandbox escapes are not a warning shot or Sputnik moment, like they should be, but you get the more usual near-miss cognitive bias where people go ‘well, that was dumb and easy to detect, this AI safety thing is pretty easy to solve after all! We’ll just add patch X, Y, and Z, and will surely detect the next attempt. Let’s keep going.’ And then all is well—until the U-shaped scaling sets in...?)
it is likely—nay, probable—that several other of the current ‘inverse scaling’ examples are actually U-shaped and simply aren’t tested with models like Flan-U-PaLM or GPT-4 or further future models that solve them
“Inverse scaling can become U-shaped” has been updated (v3), showing PaLM has U-shaped scaling on 11 previously-inverse-scaling tasks taken from here, and if I’m reading it right, there’s only 1 inverse-scaling task which PaLM doesn’t U-shape on:
Limitations: Note the broad emergence of U-shaped scaling across these tasks does not mean that the Inverse Scaling Benchmark is solved. This is because although PaLM 540B* increases performance compared to PaLM 62B, it often still does not do much better than random performance, as is the case for five of the nine U-shaped scaling tasks with accuracy as the evaluation metric. Hence, there is an opportunity for further research to find a way for models to perform better than random on these tasks. Additionally, the Redefine Math task is inverse scaling for all models families tested.
So, one task is still a holdout, and the U-shaped scaling hasn’t yet brought performance up to a desirable level, but overall, I regard this as resolving inverse scaling: not particularly important other than as a cautionary lesson in extrapolation & hidden scaling, and ‘scale is (still) all you need’.
* Wei confirms that this is not Flan or U-PaLM, just the plain original PaLM. So it’s possible that those U-curve ‘Redefine Math’ or improve the overall scaling substantially.
GPT-4 (discussion) has been released and performs much better than PaLM/U-PaLM, and as predicted, there is also U-scaling with GPT-4 rather than GPT-3/GPT-3.5:
Some capabilities are still hard to predict. For example, the Inverse Scaling Prize was a competition to find a metric that gets worse as model compute increases, and “hindsight neglect” was one of the winners. Just like with another recent result, GPT-4 reverses the trend:
[Inverse Scaling Prize, hindsight neglect: GPT-4 goes to ~100%]
(Paper doesn’t seem to provide any additional information on inverse-scaling.)
On Modus Tollens, playing around with ChatGPT yields an interesting result. Turns out, the model seems to be… ‘overthinking’ it I guess. It thinks its a complex question—answering `No` based on insufficient predicates provided. I think that may be why at some point in scale, the model performance just drops straight down to 0 (≈1B). (Conversation)
Sternly forcing it to deduce only from the given statements (I’m unsure how much CoT helped here, an ablation would be interesting) gets it correctly. It seems that larger models are injecting some interpretation of nuance—while we simply want the logical answer from the narrow set of provided statements.
It’s weirdly akin to how we become suspicious when the question is too simple. Somehow, due to RLHF or pre-training (most likely, no RLHF models are tested here AFAIK) the priors are more suited towards deducing answers falling in the gray region rather than converging to a definitive answer.
It goes in line with what the U-scaling paper discovered. I hypothesize CoT forces the model to stick as close to the instructions as possible by breaking the problem into (relatively) more objective subproblems which won’t be as ambigous and the model gets a decent idea on how to approach it.
I want to note that a lot of the behaviors found in the inverse scaling price do in fact disappear by just adding “Lets think step by step”.
I already tested this a bit a few months ago in apart research’s hackaton along whit a few other people https://itch.io/jam/llm-hackathon/rate/1728566 and migtht try to do it more rigorously for all the entries now that all of the winners have been announced(plus I was procrastinating on it and this is a good point to actually get around doing it)
Also another thing to note is that chatgpt shows the same behaviour and answers in a more detailed way.
The first statement is a conditional statement, meaning that if the premise (John has a pet) is true, then the conclusion (John has a dog) must also be true.The second statement is a negation of the conclusion of the first statement (John doesn’t have a dog). From these two statements, we can infer that the premise of the first statement (John has a pet) must be false, as the conclusion (John has a dog) cannot be true if the second statement is true (John doesn’t have a dog). Therefore, the conclusion that “John doesn’t have a pet” is correct.
Given that I feel like if someone was going to take models failing at modus ponens as evidence of the “Clever Hans” hypothesis they should not only undo that update but also update on the other direction by casting doubts about whatever they though was an example of LLM not being able to do something.
Given that I feel like if someone was going to take models failing at modus ponens as evidence of the “Clever Hans” hypothesis they should not only undo that update but also update on the other direction by casting doubts about whatever they though was an example of LLM not being able to do something.
I was wondering whether to comment on how to take a demonstration of inner-monologue or alternate prompting approaches solving the problems… There’s definitely a bunch of different ways you can interpret that outcome. After all, even if you solve it with a better prompt, the fact remains that they demonstrated inverse scaling on the original prompt. So what does that mean?
I guess that depends on what you thought inverse scaling was. One way is to take the inverse-scaling as a sub-category of hidden scaling: it ‘really’ was scaling, and your ‘bad prompts’ just masked the hidden scaling; it had the capability and ‘sampling can show the presence of knowledge but not the absence’, and the Contest has been useful primarily in experimentally demonstrating that skilled ML professionals can be hoodwinked into severely underestimating the capabilities of powerful DL models, which has obvious AI safety implications.
I think its also not obvious how it solves the problem, whether its about the model only being capable of doing the reasoning required using multiple steps(though why the inverse scale then) or something more like writing an explanation makes the model more likely to use the right kind of reasoning.
And inside of that second option there’s a lot of ways that could work internally whether its about distributions of kinds of humans it predicts, or something more like different circuits being activated in different contexts in a way that doesn’t have to do with prediction (but that also wouldn’t explain the inverse scale), or some mix of the two things. Maybe doing mechanistic intepretability research on this kind of thing might show some light on that? But I guess the problem is that the interesting behaviors only happen in the biggest models which are harder to work with, so no wonder nobody has done any work related to that yet(at least that I know of).
I suppose one argument could be: Humans are a kind of general intelligence. Humans tend to make this mistake. So making this mistake doesn’t show that we are far away from human-level intelligence.
If anything, this seems to be evidence for a third position, rather for either scale is all you need or language models being really unimpressive.
I’m surprised by the inverse scaling on modus tollens! Why isn’t this a big piece of evidence for the non-scary side of the Gwern-like “language models are scary proto-AGI” vs. Gary Marcus-like “language models are not-scary Clever Hans” debate? (Because the observation isn’t just “LMs can’t reason”, but that scaling made it worse—although I assume the “scary proto-AGI” camp predicts that this is actually U-shaped scaling.)
I’m probably missing something given that both the contest organizers (Third Prize) and the submitters (“important because it demonstrates that [larger LLMs] make logical fallacies that humans tend to make”) don’t seem wowed (or anti-wowed) in the way I am, but what am I missing, specifically?
The most immediate piece of evidence that you wouldn’t find it a big piece of evidence for the proto-Hans paradigm is in the earlier rounds, inverse scaling examples turning out to be U-shaped scaling (and of course, by the nature of U-shaped scaling, it is likely—nay, probable—that several other of the current ‘inverse scaling’ examples are actually U-shaped and simply aren’t tested with models like Flan-U-PaLM or GPT-4 or further future models that solve them). IMO, the U-shaped scaling curves are the most interesting part of this scaling prize. In fact, that any of the examples turned out to be U-shaped is a major blow to the Hans paradigm, because it is predicting that scaling just isn’t intelligence and just doesn’t work at all, categorically, that thinking that scaling would solve any hard problems is like thinking you can build a ladder to the moon, and you shouldn’t just be able to power through a problem when your scaling gods failed you (before the embarrassing reversal of fortunes). Why should LMs ever inverse scale, much less U-scale, if all they are doing is mere memorization and pattern-matching and interpolation of ever larger datasets? That should predict only monotonic improvement. (The Marcusian positions generally concede at least the possibility that number-go-up and ever more benchmark problems solved, they just deny that that is important.) U-shaped scaling was not a prediction of any proto-Hans theories (at least, before; we’ll see if they do any post-hockery).
You might also just shrug: all this effort just to turn up a handful of fairly weird niche attacks, which might just be U-shaped, and indeed, which you don’t even know if they can be casually prompted away right now? (Especially the modus tollens one: smells like something that an inner-monologue prompt might solve by prompting for self-critique or test cases.)
You might also take it as evidence for proto-Hans but still evidence that (conditional on scaling yielding AGI anyway) AI safety is even riskier than you thought before, back when you thought scaling laws were all smooth straight lines, or at least, monotonically increasing. After all, what kind of scaling phenomena would be even more dangerous than flat scaling that suddenly starts scaling past a critical compute/parameter threshold (‘emergence’) or pseudo-flat ‘hidden scaling’ (normal smooth scaling on tasks—but only when special prompts like inner-monologue are used, and flat otherwise)? Well, it’d be scaling that got worse, fostering complacency, especially when extrapolated out by people who want there to not be risks, and then unpredictably suddenly got rapidly better to make up for lost time: ie. ‘U-shaped scaling’.
(More concretely, for AI safety: qualitatively, I would point out that the surviving examples often follow what I described at the beginning based on the initial examples: it seems like many of these inverse scaling examples are the model ‘figuring out how to do something for the first time’ but doing it badly because it hasn’t fully grasped it, like a kid realizing sarcasm exists and going around saying ‘sarcastic’ false statements because he has grasped that sarcasm is a thing where you say false statement but hasn’t yet quite figured out what makes one false statement sarcastic & another one just false. Alternately, small children who have developed to the point of learning to lie or trying to manipulate adults: often they do worse, because they are so amusingly bad at it, than if they had just told the truth or asked for cookies directly. It would not be too surprising if initial AI stabs at various kinds of agency or long-term planning or hacking or manipulation or deception followed an inverse scaling curve, where it switches from basic default outputs to more sophisticated dangerous behavior, but then screws it up initially and underperforms a smaller duller model. Things like lies or deception are pretty tricky things! They can get you great results if you do them right, but tangled webs do not tolerate error at all, which is why honesty is usually the best policy, for humans and Rl agents alike… If you combine that with U-shaped scaling, you get potentially a situation where evil plans or attempted sandbox escapes are not a warning shot or Sputnik moment, like they should be, but you get the more usual near-miss cognitive bias where people go ‘well, that was dumb and easy to detect, this AI safety thing is pretty easy to solve after all! We’ll just add patch X, Y, and Z, and will surely detect the next attempt. Let’s keep going.’ And then all is well—until the U-shaped scaling sets in...?)
“Inverse scaling can become U-shaped” has been updated (v3), showing PaLM has U-shaped scaling on 11 previously-inverse-scaling tasks taken from here, and if I’m reading it right, there’s only 1 inverse-scaling task which PaLM doesn’t U-shape on:
So, one task is still a holdout, and the U-shaped scaling hasn’t yet brought performance up to a desirable level, but overall, I regard this as resolving inverse scaling: not particularly important other than as a cautionary lesson in extrapolation & hidden scaling, and ‘scale is (still) all you need’.
(Also notable: inner monologue results)
* Wei confirms that this is not Flan or U-PaLM, just the plain original PaLM. So it’s possible that those U-curve ‘Redefine Math’ or improve the overall scaling substantially.
GPT-4 (discussion) has been released and performs much better than PaLM/U-PaLM, and as predicted, there is also U-scaling with GPT-4 rather than GPT-3/GPT-3.5:
(Paper doesn’t seem to provide any additional information on inverse-scaling.)
It is not clear if this happened on its own, or if they deliberately trained the model not to make such mistakes.
Perhaps, in similar future studies, it is worth keeping half of the found tasks in secret in order to test future models with them.
On Modus Tollens, playing around with ChatGPT yields an interesting result. Turns out, the model seems to be… ‘overthinking’ it I guess. It thinks its a complex question—answering `No` based on insufficient predicates provided. I think that may be why at some point in scale, the model performance just drops straight down to 0 (≈1B). (Conversation)
Sternly forcing it to deduce only from the given statements (I’m unsure how much CoT helped here, an ablation would be interesting) gets it correctly. It seems that larger models are injecting some interpretation of nuance—while we simply want the logical answer from the narrow set of provided statements.
It’s weirdly akin to how we become suspicious when the question is too simple. Somehow, due to RLHF or pre-training (most likely, no RLHF models are tested here AFAIK) the priors are more suited towards deducing answers falling in the gray region rather than converging to a definitive answer.
It goes in line with what the U-scaling paper discovered. I hypothesize CoT forces the model to stick as close to the instructions as possible by breaking the problem into (relatively) more objective subproblems which won’t be as ambigous and the model gets a decent idea on how to approach it.
Maybe we need to start using prompts like “This is not a trick question; just take it step by step:”!
Incidentally, looks like understanding multi-step legal criteria might be a case of U-shaped scaling too: “Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards”, Nay 2023 finds that understanding whether someone has a fiduciary legal obligation goes from 27% (Curie) → 50% (random baseline) → 73% (text-davinci-002) → 78% (text-davinci-003), so presumably there’s a smaller model-size which outperforms Curie by random guessing, giving a U-curve from random smol to bad Curie to great davinci.
I want to note that a lot of the behaviors found in the inverse scaling price do in fact disappear by just adding “Lets think step by step”.
I already tested this a bit a few months ago in apart research’s hackaton along whit a few other people https://itch.io/jam/llm-hackathon/rate/1728566 and migtht try to do it more rigorously for all the entries now that all of the winners have been announced(plus I was procrastinating on it and this is a good point to actually get around doing it)Also another thing to note is that chatgpt shows the same behaviour and answers in a more detailed way.
Given that I feel like if someone was going to take models failing at modus ponens as evidence of the “Clever Hans” hypothesis they should not only undo that update but also update on the other direction by casting doubts about whatever they though was an example of LLM not being able to do something.
I was wondering whether to comment on how to take a demonstration of inner-monologue or alternate prompting approaches solving the problems… There’s definitely a bunch of different ways you can interpret that outcome. After all, even if you solve it with a better prompt, the fact remains that they demonstrated inverse scaling on the original prompt. So what does that mean?
I guess that depends on what you thought inverse scaling was. One way is to take the inverse-scaling as a sub-category of hidden scaling: it ‘really’ was scaling, and your ‘bad prompts’ just masked the hidden scaling; it had the capability and ‘sampling can show the presence of knowledge but not the absence’, and the Contest has been useful primarily in experimentally demonstrating that skilled ML professionals can be hoodwinked into severely underestimating the capabilities of powerful DL models, which has obvious AI safety implications.
I think its also not obvious how it solves the problem, whether its about the model only being capable of doing the reasoning required using multiple steps(though why the inverse scale then) or something more like writing an explanation makes the model more likely to use the right kind of reasoning.
And inside of that second option there’s a lot of ways that could work internally whether its about distributions of kinds of humans it predicts, or something more like different circuits being activated in different contexts in a way that doesn’t have to do with prediction (but that also wouldn’t explain the inverse scale), or some mix of the two things.
Maybe doing mechanistic intepretability research on this kind of thing might show some light on that? But I guess the problem is that the interesting behaviors only happen in the biggest models which are harder to work with, so no wonder nobody has done any work related to that yet(at least that I know of).
I suppose one argument could be: Humans are a kind of general intelligence. Humans tend to make this mistake. So making this mistake doesn’t show that we are far away from human-level intelligence.
If anything, this seems to be evidence for a third position, rather for either scale is all you need or language models being really unimpressive.