Bogdan Ionut Cirstea comments on OpenAI o1, Llama 4, and AlphaZero of LLMs

Bogdan Ionut Cirstea 15 Sep 2024 12:58 UTC
3 points
3
Yup; there will also likely be a flywheel from automated ML research, e.g. https://sakana.ai/ai-scientist/ whose code is also open-source. Notably, o1 seemed to improve most at math and code, which seem some of the most relevant skills for automated ML research. And it’s not clear whether other parts of the workflow will be blockers either, with some recent results of human-comparable automated ideation and reviewing.
- habryka 15 Sep 2024 15:49 UTC
  13 points
  11
  Parent
  You’ve linked to this Sakana AI paper like 8 times in the last week. IMO, please stop, it’s complete bunk, basically a scam.
  
  I don’t think it being bunk is really any evidence against automated ML research becoming a thing soon, or even already a thing, but the fact that you keep linking to it while ignoring the huge errors in it, and pretend it proves some point, is frustrating.
  - Noosphere89 15 Sep 2024 16:13 UTC
    8 points
    2
    Parent
    As someone who dislikes the hype over Sakana AI, and agrees that Bogdan should stop linking it so much, I think that it’s less of a scam, and more like an overhyped product that was not ready for primetime or the discussion it got.
    
    The discourse was not good on Sakana AI, but I do think that it has some uses, just not nearly as much as people want Sakana AI to be.
  - Vladimir_Nesov 15 Sep 2024 18:07 UTC
    6 points
    1
    Parent
    
    and pretend it proves some point
    
    Well known complete bunk can be useful for gesturing at an idea, even as it gives no evidence about related facts. It can be a good explanatory tool when there is little risk that people will take away related invalid inferences.
    
    (Someone strong-downvoted Bogdan’s comment, which I opposed with a strong-upvote, since it doesn’t by itself seem to be committing the error of believing the Sakana hype, and it gates my reply that I don’t want to be hidden because the comment it happens to be in reply to gets to the deep negatives in Karma.)
    - Bogdan Ionut Cirstea 15 Sep 2024 19:01 UTC
      7 points
      2
      Parent
      To maybe further clarify, I think of the Sakana paper roughly like how I think of autoGPT. LM agents were overhyped initially and autoGPT specifically didn’t work anywhere near as well as some people expected. But I expect LM agents as a whole will be a huge deal.
  - Bogdan Ionut Cirstea 15 Sep 2024 16:50 UTC
    4 points
    1
    Parent
    it’s complete bunk
    ignoring the huge errors in it
    I genuinely don’t know what you’re referring to.
    Fwiw, I’m linking to it because I think it’s the first/clearest demo of how the entire ML research workflow (e.g. see figure 1 in the arxiv) can plausibly be automated using LM agents, and they show a proof of concept which arguably already does something (in any case, it works better than I’d have expected it to). If you know of a better reference, I’d be happy to point to that instead/alternately. Similarly if you can ‘debunk it’ (I don’t think it’s been anywhere near debunked).
    - habryka 15 Sep 2024 17:10 UTC
      2 points
      0
      Parent
      We had this conversation two weeks ago?
      https://www.lesswrong.com/posts/rQDCQxuCRrrN4ujAe/jeremy-gillen-s-shortform?commentId=TXePXoEosJmAbMZSk
      - Bogdan Ionut Cirstea 15 Sep 2024 17:32 UTC
        −2 points
        0
        Parent
        I thought you meant the AI scientist paper has some obvious (e.g. methodological or code) flaws or errors. I find that thread unconvincing, but we’ve been over this.
- Vladimir_Nesov 15 Sep 2024 14:04 UTC
  11 points
  3
  Parent
  It’s not necessarily at all impactful. The crucial question for the next few years is whether and where LLM scaling plateaus. Before o1, GPT-4 level models couldn’t produce useful reasoning traces that are very long. Reading comprehension just only started mostly working at this scale. And RLHF through PPO is apparently to a large extent a game of carefully balancing early stopping. So it’s brittle, doesn’t generalize off-distribution very far, which made it unclear if heavier RL can help with System 2 reasoning where it’s not capable of rebuilding the capabilities from scratch, overwriting the damage it does to the LLM.
  
  But taking base model scaling one step further should help both with fragility of GPT-4 level models, and with their ability to carry out System 2 reasoning on their own, with a little bit of adaptation that elicits, rather than heavier RL that instills capabilities not already found to a useful extent in the base model. And then there’s at least one more step of base model scaling after that (deployed models are about 40 megawatts of H100s, models in training about 150 megawatts, in 1.5 years we’ll get to about a gigawatt, with 2x in FLOPs from moving to B200s and whatever effective FLOP/joule Trillium delivers). So there’s every chance this advancement is rendered obsolete, to the extent that it’s currently observable as a capability improvement, if these scaled up models just start being competent at System 2 reasoning without needing such post-training. Even ability of cheaper models to reason could then be reconstructed by training them on reasoning traces collected from larger models.
  
  On the other hand, this is potentially a new dimension of scaling, if the RL can do real work and brute force scaling of base models wouldn’t produce a chatbot good enough to reinvent the necessary RL on its own. There is a whole potentially very general pipeline to generative capabilities here. It starts with preference about the outcome (ability to verify an answer, to simultaneously judge aesthetics and meanings of a poem, to see if a pull request really does resolve the issue). Then it proceeds to training a process supervision model that estimates how good individual reasoning steps are as contributions to getting a good outcome, and optimizing the generative model that proposes good reasoning steps. With a few more years of base model scaling, LLMs are probably going to be good enough at evaluating impactful outcomes that they are incapable of producing directly, so this pipeline gets a lot of capabilities to manufacture. If o1′s methodology is already applicable to this, this removes the uncertainty about whether it could’ve been made to start working without delay relative to the underlying scaling of base models. And the scaling of RL might go a long way before running out of exploitable preference signal about the outcomes of reasoning, independently of how the base models are being scaled up.