RomanS comments on Yudkowsky and Christiano discuss “Takeoff Speeds”

RomanS 24 Nov 2021 12:09 UTC
LW: 41 AF: 17
AF
your view seems to imply that we will move quickly from much worse than humans to much better than humans, but it’s likely that we will move slowly through the human range on many tasks
We might be able to falsify that in a few months.
There is a joint Google / OpenAI project called BIG-bench. They’ve crowdsourced ~200 of highly diverse text tasks (from answering scientific questions to predicting protein interacting sites to measuring self-awareness).
One of the goals of the project is to see how the performance on the tasks is changing with the model size, with the size ranging by many orders of magnitude.
A half-year ago, they presented some preliminary results. A quick summary:
if you increase the N of parameters from 10^7 to 10^10, the aggregate performance score grows roughly like log(N).
But after the 10^10 point, something interesting happens: the score starts growing much faster (~N).
And for some tasks, the plot looks like a hockey stick (a sudden change from ~0 to almost-human).
The paper with the full results is expected to be published in the next few months.
Judging by the preliminary results, the FOOM could start like this:
The GPT-5 still sucks on most tasks. It’s mostly useless. But what if we increase parameters_num by 2? What could possibly go wrong?
What links here?
- Daniel Kokotajlo 24 Nov 2021 15:16 UTC
  LW: 19 AF: 9
  AF Parent
  Hot damn, where can I see these preliminary results?
  - RomanS 24 Nov 2021 15:25 UTC
    LW: 23 AF: 11
    AF Parent
    The results were presented at a workshop by the project organizers. The video from the workshop is available here (the most relevant presentation starts at 5:05:00).
    It’s one of those innocent presentations that, after you understand the implications, keep you awake at night.
    - Lukas Finnveden 24 Nov 2021 18:23 UTC
      LW: 13 AF: 7
      AF Parent
      Presumably you’re referring to this graph. The y-axis looks like the kind of score that ranges between 0 and 1, in which case this looks sort-of like a sigmoid to me, which accelerates when it gets closer to ~50% performance (and decelarates when it gets closer to 100% performance).
      If so, we might want to ask whether these tasks are chosen ~randomly (among tasks that are indicative of how useful AI is) or if they’re selected for difficulty in some way. In particular, assume that most tasks look sort-of like a sigmoid as they’re scaled up (accelerating around 50%, improving slower when they’re closer to 0% and 100%). Then you might think that the most exciting tasks to submit to big bench would be the tasks that can’t be handled by small models, but that large models rapidly improve upon (as opposed to tasks that are basically-solved already by 10^10 parameters). In which case the aggregation of all these tasks could be expected to look sort-of like this, improving faster after 10^10 than before.
      ...is one story I can tell, but idk if I would have predicted that beforehand, and fast acceleration after 10^10 is certainly consistent with many people’s qualitative impressions of GPT-3. So maybe there is some real acceleration going on.
      (Also, see this post for similar curves, but for the benchmarks that OpenAI tested GPT-3 on. There’s no real acceleration visible there, other than for arithmetic.)
      - RomanS 24 Nov 2021 20:00 UTC
        LW: 5 AF: 3
        AF Parent
        The preliminary results where obtained on a subset of the full benchmark (~90 tasks vs 206 tasks). And there were many changes since then, including scoring changes. Thus, I’m not sure we’ll see the same dynamics in the final results. Most likely yes, but maybe not.
        I agree that the task selection process could create the dynamics that look like the acceleration. A good point.
        As I understand, the organizers have accepted almost all submitted tasks (the main rejection reasons were technical—copyright etc). So, it was mostly self-selection, with the bias towards the hardest imaginable text tasks. It seems that for many contributors, the main motivation was something like:
        Take that, the most advanced AI of Google! Let’s see if you can handle my epic task!
        This includes many cognitive tasks that are supposedly human-complete (e.g. understanding of humor, irony, ethics), and the tasks that are probing the model’s generality (e.g. playing chess, recognizing images, navigating mazes—all in text).
        I wonder if the performance dynamics on such tasks will follow the same curve.
        The list of of all tasks is available here.
- evhub 24 Nov 2021 20:47 UTC
  LW: 13 AF: 7
  AF Parent
  
  But after the 10^10 point, something interesting happens: the score starts growing much faster (~N).
  
  And for some tasks, the plot looks like a hockey stick (a sudden change from ~0 to almost-human).
  
  Seems interestingly similar to the grokking phenomenon.
- gwern 24 Nov 2021 16:50 UTC
  11 points
  Parent
  So these results are not reported in “Multitask Prompted Training Enables Zero-Shot Task Generalization”, Sanh et al 2021?
  - StellaAthena 24 Nov 2021 18:43 UTC
    17 points
    Parent
    For Sanh et al. (2021), we were able to negotiate access to preliminary numbers from the BIG Bench project and run the T0 models on it. However the authors of Sanh et al. and the authors of BIG Bench are different groups of people.
  - RomanS 10 Jun 2022 8:59 UTC
    15 points
    Parent
    The aforementioned Google’s Big-Bench paper is now publicly available:
    some highlights
    the paper
  - RomanS 24 Nov 2021 17:01 UTC
    3 points
    Parent
    Nope. Although the linked paper uses the same benchmark (a tiny subset of it), the paper comes from a separate research project.
    As I understand, the primary topic of the future paper will be the BIG-bench project itself, and how the models from Google / OpenAI perform on it.
- ESRogs 24 Nov 2021 19:48 UTC
  6 points
  Parent
  But after the 10^10 point, something interesting happens: the score starts growing much faster (~N).
  And for some tasks, the plot looks like a hockey stick (a sudden change from ~0 to almost-human).
  
  ...
  Judging the preliminary results, the FOOM could start like this:
  “The GPT-5 still sucks on most tasks. It’s mostly useless. But what if we increase parameters_num by 2? What could possibly go wrong?”
  Hypothesis:
  - doing things in the real world requires diverse skills (strong performance on a diverse set of tasks)
  - hockey-sticking performance on a particular task makes that task no longer the constraint on what you can accomplish
  - but now some other task is the bottleneck
  - so, unless you can hockey-stick on all the tasks all at once, your overall ability to do things in the world will get smoothed out a bunch, even if it still grows very rapidly
  - ESRogs 24 Nov 2021 20:01 UTC
    6 points
    Parent
    Seems like there’s a spectrum between smooth accelerating progress and discontinuous takeoff. And where we end up on that spectrum depends on a few things:
    how much simple improvements (better architecture, more compute) help with a wide variety of tasks
    how much improvements in AI systems is bottlenecked on those tasks
    how many resources the world is pouring into finding and making those improvements
    Recent evidence (success of transformers, scaling laws) seems to suggest that Eliezer was right in the FOOM debate that simple input changes could make a large difference across a wide variety of tasks.
    It’s less clear to me though whether that means a local system is going to outcompete the rest of the economy, because it seems plausible to me that the rest of the economy is also going to be full-steam ahead searching the same improvement space that a local system will be searching.
    And I think in general real world complexity tends to smooth out lumpy graphs. As an example, even once we realize that GPT-2 is powerful and GPT-3 will be even better, there’s a whole bunch of engineering work that had to go into figuring out how to run such a big neural network across multiple machines.
    That kind of real-world messiness seems like it will introduce new bottlenecks at every step along the way, and every order-of-magnitude change in scale, which makes me think that the actual impact of AI will be a lot more smooth than we might otherwise think just based on simple architectures being generally useful and scalable.
- StellaAthena 24 Nov 2021 18:41 UTC
  6 points
  Parent
  What makes you say BIG Bench is a joint Google / OpenAI project? I’m a contributor to it and have seen no evidence of that.
  - RomanS 24 Nov 2021 19:09 UTC
    5 points
    Parent
    During the workshop presentation, Jascha said that the OpenAI will run their models on the benchmark. This suggests that there is (was?) some collaboration. But it was a half a year ago.
    Just checked, the repo’s readme doesn’t mention OpenAI anymore. In the earlier versions, it was mentioned like this:
    Teams at Google and OpenAI have committed to evaluate BIG-Bench on their best-performing model architectures
    So, it seems that OpenAI withdrew from the project, partially or fully.
    - calef 25 Nov 2021 2:20 UTC
      9 points
      Parent
      OpenAI is still running evaluations.
    - StellaAthena 28 Nov 2021 17:53 UTC
      3 points
      Parent
      Interesting… I was busy and wasn’t able to watch the workshop. That’s good to know, thanks!
- Jeff Rose 24 Nov 2021 18:13 UTC
  0 points
  Parent
  GPT-4 is expected to have about 10^14 parameters and be ready in a few years. And, we already know that GPT-3 can write code. The following all seem (to me at least) like very reasonable conjectures:
  (i) Writing code is one of the tasks at which GPT-4 will have (at least) human level capability.
  (ii) Clones of GPT-4 will be produced fairly rapidly after GPT-4, say 1-3 years.
  (iii) GPT-4 and its clones will have a significant impact on society. This will show up in the real economy.
  (iv) GPT-4 will be enough to shock governments into paying attention. (But as we have seen with climate change governments can pay attention to an issue for a long time without effectively doing anything about it.)
  (v) Someone is going to ask for GPT-4 (clone) to produce code that generates AGI. (Implicitly, if not explicitly.)
  I have absolutely no idea whether GPT-4 will succeed at this endeavor. But if not, GPT-5 should be available a few years later....
  (And, of course, this is just one pathway.)
  - Lukas Finnveden 24 Nov 2021 18:27 UTC
    8 points
    Parent
    GPT-4 is expected to have about 10^14 parameters
    There was a Q&A where Sam Altman said GPT-4 is going to be a lot smaller than that (in particular, that it wouldn’t have a lot more parameters than GPT-3).
    - Jeff Rose 24 Nov 2021 20:06 UTC
      14 points
      Parent
      You appear to be correct. I will withdraw my comment.