I’m not too sure what to expect, and I’d be pretty interested to e.g. set up a Metaculus/forecasting question to know what others think. I’m definitely sympathetic to your view to some extent.
Here’s one case I see against- I think it’s plausible that models will have the representations/ability/knowledge required to do some of these tasks, but that we’re not reliably able to elicit that knowledge (at least without a large validation set, but we won’t have access to that if we’re having models do tasks people can’t do, or in general for a new/zero-shot task). E.g., for NegationQA, surely even current models have some fairly good understanding of negation—why is that understanding not showing in the results here? My best guess is that NegationQA isn’t capabilities bottlenecked but has to do with something else. I think the updated paper’s results that chain-of-thought prompting alone reverses some of the inverse scaling trends is interesting; it also suggests that maybe naively using an LM isn’t the right way to elicit a model’s knowledge (but chain-of-thought prompting might be).
In general, I don’t think it’s always accurate to use a heuristic like “humans behave this way, so LMs-in-the-limit will behave this way.” It seems plausible to me that LM representations will encode the knowledge for many/most/almost-all human capabilities, but I’m not sure it means models will have the same input-output behavior as humans (e.g., for reasons discussed in the simulators post and since human/LM learning objectives are different)
I will happily bet that NeQA resolves with scale in the next 2 years, at something like 1:1 odds, and that in the worst case resolves with scale + normal finetuning (instruction finetuning or RLHF) within the next two years at something like 4:1 odds (without CoT)! (It seems like all of them are U-shaped or positive scaling with CoT already?)
I made a manifold market for the general question: if I’m not incorrect, the updated paper says that 2⁄4 of them already demonstrate u-shaped scaling, using the same eval as you did?
I’ll make one for NeQA and Redefine Math later today.
I think it’s plausible that models will have the representations/ability/knowledge required to do some of these tasks, but that we’re not reliably able to elicit that knowledge (at least without a large validation set, but we won’t have access to that if we’re having models do tasks people can’t do, or in general for a new/zero-shot task).
I agree that these tasks exist. If intent alignment fails and we end up with a misaligned AGI, then we in some sense can’t get the AI to do any of the nice powerful things we’d like it to do. We’d like to see examples of this sort of failure before we make a powerful unaligned AGI, ideally in the scaling laws paradigm.
Broadly speaking, there are three types of inverse scaling curves: 1) those that resolve with scale, ie capabilities tasks, 2) those that are in some sense “tricking” the model with a misleading prompt where human labelers use additional context clues to not be tricked (for example, that they’re labelling an ML dataset, and so they should probably answer as literally as possible, or 3) alignment failures (very hard to elicit). 1) resolves with scale, 2) can be easily fixed with tweaks to the prompt or small amounts of instruction finetuning/RLHF, and I think we agree that 3) is the interesting kind.
My claim is that all four of these tasks are clearly not alignment failures, and I also suspect that they’re all of type 1).
In general, I don’t think it’s always accurate to use a heuristic like “humans behave this way, so LMs-in-the-limit will behave this way.” It seems plausible to me that LM representations will encode the knowledge for many/most/almost-all human capabilities, but I’m not sure it means models will have the same input-output behavior as humans (e.g., for reasons discussed in the simulators post and since human/LM learning objectives are different)
That’s super fair. I think I’m using a more precise heuristic than this in practice, something like, “if you’re not ‘tricking’ the model in some sense, things that untrained humans can do in the first go can be done by models”, though this still might fail in the limit for galaxy-brain reasons.
(EDIT: made a manifold market for round 2 inverse scaling tasks as well)
I’m not too sure what to expect, and I’d be pretty interested to e.g. set up a Metaculus/forecasting question to know what others think. I’m definitely sympathetic to your view to some extent.
Here’s one case I see against- I think it’s plausible that models will have the representations/ability/knowledge required to do some of these tasks, but that we’re not reliably able to elicit that knowledge (at least without a large validation set, but we won’t have access to that if we’re having models do tasks people can’t do, or in general for a new/zero-shot task). E.g., for NegationQA, surely even current models have some fairly good understanding of negation—why is that understanding not showing in the results here? My best guess is that NegationQA isn’t capabilities bottlenecked but has to do with something else. I think the updated paper’s results that chain-of-thought prompting alone reverses some of the inverse scaling trends is interesting; it also suggests that maybe naively using an LM isn’t the right way to elicit a model’s knowledge (but chain-of-thought prompting might be).
In general, I don’t think it’s always accurate to use a heuristic like “humans behave this way, so LMs-in-the-limit will behave this way.” It seems plausible to me that LM representations will encode the knowledge for many/most/almost-all human capabilities, but I’m not sure it means models will have the same input-output behavior as humans (e.g., for reasons discussed in the simulators post and since human/LM learning objectives are different)
I will happily bet that NeQA resolves with scale in the next 2 years, at something like 1:1 odds, and that in the worst case resolves with scale + normal finetuning (instruction finetuning or RLHF) within the next two years at something like 4:1 odds (without CoT)! (It seems like all of them are U-shaped or positive scaling with CoT already?)
I made a manifold market for the general question: if I’m not incorrect, the updated paper says that 2⁄4 of them already demonstrate u-shaped scaling, using the same eval as you did?
I’ll make one for NeQA and Redefine Math later today.
I agree that these tasks exist. If intent alignment fails and we end up with a misaligned AGI, then we in some sense can’t get the AI to do any of the nice powerful things we’d like it to do. We’d like to see examples of this sort of failure before we make a powerful unaligned AGI, ideally in the scaling laws paradigm.
Broadly speaking, there are three types of inverse scaling curves: 1) those that resolve with scale, ie capabilities tasks, 2) those that are in some sense “tricking” the model with a misleading prompt where human labelers use additional context clues to not be tricked (for example, that they’re labelling an ML dataset, and so they should probably answer as literally as possible, or 3) alignment failures (very hard to elicit). 1) resolves with scale, 2) can be easily fixed with tweaks to the prompt or small amounts of instruction finetuning/RLHF, and I think we agree that 3) is the interesting kind.
My claim is that all four of these tasks are clearly not alignment failures, and I also suspect that they’re all of type 1).
That’s super fair. I think I’m using a more precise heuristic than this in practice, something like, “if you’re not ‘tricking’ the model in some sense, things that untrained humans can do in the first go can be done by models”, though this still might fail in the limit for galaxy-brain reasons.
(EDIT: made a manifold market for round 2 inverse scaling tasks as well)