On Modus Tollens, playing around with ChatGPT yields an interesting result. Turns out, the model seems to be… ‘overthinking’ it I guess. It thinks its a complex question—answering `No` based on insufficient predicates provided. I think that may be why at some point in scale, the model performance just drops straight down to 0 (≈1B). (Conversation)
Sternly forcing it to deduce only from the given statements (I’m unsure how much CoT helped here, an ablation would be interesting) gets it correctly. It seems that larger models are injecting some interpretation of nuance—while we simply want the logical answer from the narrow set of provided statements.
It’s weirdly akin to how we become suspicious when the question is too simple. Somehow, due to RLHF or pre-training (most likely, no RLHF models are tested here AFAIK) the priors are more suited towards deducing answers falling in the gray region rather than converging to a definitive answer.
It goes in line with what the U-scaling paper discovered. I hypothesize CoT forces the model to stick as close to the instructions as possible by breaking the problem into (relatively) more objective subproblems which won’t be as ambigous and the model gets a decent idea on how to approach it.
I want to note that a lot of the behaviors found in the inverse scaling price do in fact disappear by just adding “Lets think step by step”.
I already tested this a bit a few months ago in apart research’s hackaton along whit a few other people https://itch.io/jam/llm-hackathon/rate/1728566 and migtht try to do it more rigorously for all the entries now that all of the winners have been announced(plus I was procrastinating on it and this is a good point to actually get around doing it)
Also another thing to note is that chatgpt shows the same behaviour and answers in a more detailed way.
The first statement is a conditional statement, meaning that if the premise (John has a pet) is true, then the conclusion (John has a dog) must also be true.The second statement is a negation of the conclusion of the first statement (John doesn’t have a dog). From these two statements, we can infer that the premise of the first statement (John has a pet) must be false, as the conclusion (John has a dog) cannot be true if the second statement is true (John doesn’t have a dog). Therefore, the conclusion that “John doesn’t have a pet” is correct.
Given that I feel like if someone was going to take models failing at modus ponens as evidence of the “Clever Hans” hypothesis they should not only undo that update but also update on the other direction by casting doubts about whatever they though was an example of LLM not being able to do something.
Given that I feel like if someone was going to take models failing at modus ponens as evidence of the “Clever Hans” hypothesis they should not only undo that update but also update on the other direction by casting doubts about whatever they though was an example of LLM not being able to do something.
I was wondering whether to comment on how to take a demonstration of inner-monologue or alternate prompting approaches solving the problems… There’s definitely a bunch of different ways you can interpret that outcome. After all, even if you solve it with a better prompt, the fact remains that they demonstrated inverse scaling on the original prompt. So what does that mean?
I guess that depends on what you thought inverse scaling was. One way is to take the inverse-scaling as a sub-category of hidden scaling: it ‘really’ was scaling, and your ‘bad prompts’ just masked the hidden scaling; it had the capability and ‘sampling can show the presence of knowledge but not the absence’, and the Contest has been useful primarily in experimentally demonstrating that skilled ML professionals can be hoodwinked into severely underestimating the capabilities of powerful DL models, which has obvious AI safety implications.
I think its also not obvious how it solves the problem, whether its about the model only being capable of doing the reasoning required using multiple steps(though why the inverse scale then) or something more like writing an explanation makes the model more likely to use the right kind of reasoning.
And inside of that second option there’s a lot of ways that could work internally whether its about distributions of kinds of humans it predicts, or something more like different circuits being activated in different contexts in a way that doesn’t have to do with prediction (but that also wouldn’t explain the inverse scale), or some mix of the two things. Maybe doing mechanistic intepretability research on this kind of thing might show some light on that? But I guess the problem is that the interesting behaviors only happen in the biggest models which are harder to work with, so no wonder nobody has done any work related to that yet(at least that I know of).
On Modus Tollens, playing around with ChatGPT yields an interesting result. Turns out, the model seems to be… ‘overthinking’ it I guess. It thinks its a complex question—answering `No` based on insufficient predicates provided. I think that may be why at some point in scale, the model performance just drops straight down to 0 (≈1B). (Conversation)
Sternly forcing it to deduce only from the given statements (I’m unsure how much CoT helped here, an ablation would be interesting) gets it correctly. It seems that larger models are injecting some interpretation of nuance—while we simply want the logical answer from the narrow set of provided statements.
It’s weirdly akin to how we become suspicious when the question is too simple. Somehow, due to RLHF or pre-training (most likely, no RLHF models are tested here AFAIK) the priors are more suited towards deducing answers falling in the gray region rather than converging to a definitive answer.
It goes in line with what the U-scaling paper discovered. I hypothesize CoT forces the model to stick as close to the instructions as possible by breaking the problem into (relatively) more objective subproblems which won’t be as ambigous and the model gets a decent idea on how to approach it.
Maybe we need to start using prompts like “This is not a trick question; just take it step by step:”!
Incidentally, looks like understanding multi-step legal criteria might be a case of U-shaped scaling too: “Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards”, Nay 2023 finds that understanding whether someone has a fiduciary legal obligation goes from 27% (Curie) → 50% (random baseline) → 73% (text-davinci-002) → 78% (text-davinci-003), so presumably there’s a smaller model-size which outperforms Curie by random guessing, giving a U-curve from random smol to bad Curie to great davinci.
I want to note that a lot of the behaviors found in the inverse scaling price do in fact disappear by just adding “Lets think step by step”.
I already tested this a bit a few months ago in apart research’s hackaton along whit a few other people https://itch.io/jam/llm-hackathon/rate/1728566 and migtht try to do it more rigorously for all the entries now that all of the winners have been announced(plus I was procrastinating on it and this is a good point to actually get around doing it)Also another thing to note is that chatgpt shows the same behaviour and answers in a more detailed way.
Given that I feel like if someone was going to take models failing at modus ponens as evidence of the “Clever Hans” hypothesis they should not only undo that update but also update on the other direction by casting doubts about whatever they though was an example of LLM not being able to do something.
I was wondering whether to comment on how to take a demonstration of inner-monologue or alternate prompting approaches solving the problems… There’s definitely a bunch of different ways you can interpret that outcome. After all, even if you solve it with a better prompt, the fact remains that they demonstrated inverse scaling on the original prompt. So what does that mean?
I guess that depends on what you thought inverse scaling was. One way is to take the inverse-scaling as a sub-category of hidden scaling: it ‘really’ was scaling, and your ‘bad prompts’ just masked the hidden scaling; it had the capability and ‘sampling can show the presence of knowledge but not the absence’, and the Contest has been useful primarily in experimentally demonstrating that skilled ML professionals can be hoodwinked into severely underestimating the capabilities of powerful DL models, which has obvious AI safety implications.
I think its also not obvious how it solves the problem, whether its about the model only being capable of doing the reasoning required using multiple steps(though why the inverse scale then) or something more like writing an explanation makes the model more likely to use the right kind of reasoning.
And inside of that second option there’s a lot of ways that could work internally whether its about distributions of kinds of humans it predicts, or something more like different circuits being activated in different contexts in a way that doesn’t have to do with prediction (but that also wouldn’t explain the inverse scale), or some mix of the two things.
Maybe doing mechanistic intepretability research on this kind of thing might show some light on that? But I guess the problem is that the interesting behaviors only happen in the biggest models which are harder to work with, so no wonder nobody has done any work related to that yet(at least that I know of).