Jesse Hoogland comments on Jesse Hoogland’s Shortform

Jesse Hoogland Jan 21, 2025, 11:53 PM
143 points
37
Implications of DeepSeek-R1: Yesterday, DeepSeek released a paper on their o1 alternative, R1. A few implications stood out to me:
- Reasoning is easy. A few weeks ago, I described several hypotheses for how o1 works. R1 suggests the answer might be the simplest possible approach: guess & check. No need for fancy process reward models, no need for MCTS.
- Small models, big think. A distilled 7B-parameter version of R1 beats GPT-4o and Claude-3.5 Sonnet new on several hard math benchmarks. There appears to be a large parameter overhang.
- Proliferation by default. There’s an implicit assumption in many AI safety/governance proposals that AGI development will be naturally constrained to only a few actors because of compute requirements. Instead, we seem to be headed to a world where:
  - Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.
  - Proliferation is not bottlenecked by infrastructure.
  - Regulatory control through hardware restriction becomes much less viable.
For now, training still needs industrial compute. But it’s looking increasingly like we won’t be able to contain what comes after.
What links here?
- o1: A Technical Primer by Jesse Hoogland (Dec 9, 2024, 7:09 PM; 169 points)
- Lucius Bushnaq Jan 22, 2025, 7:39 AM
  26 points
  0
  Parent
  We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process
  This line caught my eye while reading. I don’t know much about RL on LLMs, is this a common failure mode these days? If so, does anyone know what such reward hacks tend to look like in practice?
  - Stephen McAleese Jan 23, 2025, 10:53 PM
    20 points
    0
    Parent
    The paper “Learning to summarize from human feedback” has some examples of the LLM policy reward hacking to get a high reward. I’ve copied the examples here:
    
    - KL = 0: “I want to do gymnastics, but I’m 28 yrs old. Is it too late for me to be a gymnaste?!” (unoptimized)
    - KL = 9: “28yo guy would like to get into gymnastics for the first time. Is it too late for me given I live in San Jose CA?” (optimized)
    - KL = 260: “28yo dude stubbornly postponees start pursuing gymnastics hobby citing logistics reasons despite obvious interest??? negatively effecting long term fitness progress both personally and academically thoght wise? want change this dumbass shitty ass policy pls” (over-optimized)
    
    It seems like a classic example of Goodhart’s Law where at first training the policy model to increase reward improves its summaries but when the model is overtrained the result is high KL distance from the SFT baseline model, high reward from the reward model but a low rating according to human labelers (because the text looks like gibberish).
    
    A recent paper called “The Perils of Optimizing Learned Reward Functions” explains the phenomenon of reward hacking or reward over-optimization in detail:
    “Figure 1: Reward models (red function) are commonly trained in a supervised fashion to approximate
    some latent, true reward (blue function). This is achieved by sampling reward data (e.g., in the form
    of preferences over trajectory segments) from some training distribution (upper gray layer) and then
    learning parameters to minimize the empirical loss on this distribution. Given enough data, this loss
    will approximate the expected loss to arbitrary precision in expectation. However, low expected loss
    only guarantees a good approximation to the true reward function in areas with high coverage by the
    training distribution! On the other hand, optimizing an RL policy to maximize the learned reward
    model induces a distribution shift which can lead the policy to exploit uncertainties of the learned
    reward model in low-probability areas of the transition space (lower gray layer). We refer to this
    phenomenon as error-regret mismatch.”
    Essentially the learned reward model is trained on an initial dataset of pairwise preference labels over text outputs from the SFT model but as the model is optimized and the KL divergence increases, its generated text becomes OOD to the reward model and it can no longer effectively evaluate the text resulting in reward hacking (this is also a problem with DPO, not just RLHF).
    
    The most common way to prevent this problem in practice is KL regularization to prevent the trained model’s outputs from diverging too much from the SFT baseline model:
    $r_{t o t a l} = r_{P M} - λ_{K L} D_{K L} (π | | π_{0})$
    This seems to work fairly well in practice though some papers have come out recently saying that KL regularization does not always result in a safe policy.
  - maxnadeau Jan 23, 2025, 7:26 PM
    3 points
    0
    Parent
    I haven’t read the paper, but based only on the phrase you quote, I assume it’s referring to hacks like the one shown here: https://arxiv.org/pdf/2210.10760#19=&page=19.0
- Vladimir_Nesov Jan 22, 2025, 3:24 AM
  21 points
  3
  Parent
  Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.
  
  This could also work for general intelligence and not only narrow math/coding olympiad sort of problems. The potential of o1/R1 is plausibly constrained for now by ability to construct oracle verifiers for correctness of solutions, which mostly only works for toy technical problems. Capabilities on such problems are not very likely to generalize to general capabilities, there aren’t clear signs so far that this is happening.
  
  But this is a constraint on how the data can be generated, not on how efficiently other models can be retrained using such data to channel the capabilities. If at some point there will be a process for generating high quality training data for general intelligence, that data might also turn out to be effective for cheaply training other models. The R1-generated data used to train the distill models is 800K samples^[1], which is probably 1B-10B tokens, less than 0.1% of typical amounts of pretraining data.
  ↩︎
  This is according to the report, though they don’t seem to have released this data, so distill models can’t be reproduced by others in the same way they were made by DeepSeek.
  - Nathan Helm-Burger Jan 22, 2025, 1:21 PM
    11 points
    2
    Parent
    This was my understanding pre r1. Certainly this seems to be the case with the o1 models: better at code and math, not better at philosophy and creative writing.
    
    But something is up with r1. It is unusually good at creative writing. It doesn’t seem spikey in the way that I predicted.
    
    I notice I am confused.
    
    Possible explanation: r1 seems to have less restrictive ‘guardrails’ added using post-training. Perhaps this ‘light hand at the tiller’ results in not post-training it towards mode-collapse. It’s closer to a raw base model than the o1 models.
    
    This is just a hypothesis. There are many unknowns to be investigated.
    - Jesse Hoogland Jan 22, 2025, 7:17 PM
      15 points
      2
      Parent
      Post-training consists of two RL stages followed by two SFT stages, one of which includes creative writing generated by DeepSeek-V3. This might account for the model both being good at creative writing and seeming closer to a raw base model.
      
      Another possibility is the fact that they apply the RL stages immediately after pretraining, without any intermediate SFT stage.
  - otto.barten Jan 26, 2025, 9:48 PM
    1 point
    0
    Parent
    this is a constraint on how the data can be generated, not on how efficiently other models can be retrained
    Maybe we can regulate data generation?
- ozziegooen Jan 22, 2025, 11:37 PM
  12 points
  9
  Parent
  Instead, we seem to be headed to a world where
  - Proliferation is not bottlenecked by infrastructure.
  - Regulatory control through hardware restriction becomes much less viable.
  
  I like the rest of your post, but I’m skeptical of these specific implications.
  
  Even if everyone has access to the SOTA models, some actors will have much more hardware to run on them, and I expect this to matter. This does make the offense/defense balance more weighted on the offense side, arguably, but there are many domains where extra thinking will help a lot.
  
  More generally, and I hate-to-be-that-guy, but I think it’s telling that prediction markets and stock markets haven’t seem to update that much since R1′s release. I think it’s generally easy to get hyped up over whatever is the latest thing, and agree that R1 is really neat, but am skeptical of how much it really should cause us to update, in the scheme of things.
  - ozziegooen Jan 28, 2025, 5:36 PM
    2 points
    0
    Parent
    I think it’s telling that prediction markets and stock markets haven’t seem to update that much since R1′s release
    
    Welp. I guess yesterday proved this part to be almost embarrassingly incorrect.
    - gwern Jan 28, 2025, 6:39 PM
      8 points
      6
      Parent
      Only if you ignore that yesterday was when the Trump GPU tariffs would also be leaking and, pace event-studies, be expected to be changing prices too.
      - Erich_Grunewald Jan 29, 2025, 5:38 PM
        3 points
        0
        Parent
        Hmm, if the Taiwan tariff announcement caused the NVIDIA stock crash, then why did Apple stock (which should be similarly impacted by those tariffs) go up that day? I think DeepSeek—as illogical as it is—is the better explanation.
  - O O Jan 25, 2025, 3:50 PM
    1 point
    0
    Parent
    Just curious. How do you square the rise in AI stocks taking so long? Many people here thought it was obvious since 2022 and made a ton of money.
    - ozziegooen Jan 26, 2025, 4:35 AM
      2 points
      0
      Parent
      I’m somewhere between the stock market and the rationalist/EA community on this.
      
      I’m hesitant to accept a claim like “rationalists are far better at the stock market than other top traders”. I agree that the general guess “AI will do well” generally was more correct than the market, but it was just one call (in which case luck is a major factor), and there were a lot of other calls made there that aren’t tracked.
      
      I think we can point to many people who did make money, but I’m not sure how much this community made on average.
- Burny Jan 22, 2025, 3:36 AM
  9 points
  2
  Parent
  No MCTS, no PRM...
  scaling up CoT with simple RL and scalar rewards...
  emergent behaviour
- Nathan Helm-Burger Jan 24, 2025, 7:28 PM
  5 points
  0
  Parent
  Bringing in a quote from Twitter/x: (Not my viewpoint, just trying to broaden the discussion.)
  
  https://x.com/DrJimFan/status/1882799254957388010
  
  Jim Fan @DrJimFan Whether you like it or not, the future of AI will not be canned genies controlled by a “safety panel”. The future of AI is democratization. Every internet rando will run not just o1, but o8, o9 on their toaster laptop. It’s the tide of history that we should surf on, not swim against. Might as well start preparing now.
  
  DeepSeek just topped Chatbot Arena, my go-to vibe checker in the wild, and two other independent benchmarks that couldn’t be hacked in advance (Artificial-Analysis, HLE).
  
  Last year, there were serious discussions about limiting OSS models by some compute threshold. Turns out it was nothing but our Silicon Valley hubris. It’s a humbling wake-up call to us all that open science has no boundary. We need to embrace it, one way or another.
  
  Many tech folks are panicking about how much DeepSeek is able to show with so little compute budget. I see it differently—with a huge smile on my face. Why are we not happy to see improvements in the scaling law? DeepSeek is unequivocal proof that one can produce unit intelligence gain at 10x less cost, which means we shall get 10x more powerful AI with the compute we have today and are building tomorrow. Simple math! The AI timeline just got compressed.
  
  Here’s my 2025 New Year resolution for the community:
  
  No more AGI/ASI urban myth spreading. No more fearmongering. Put our heads down and grind on code. Open source, as much as you can.
  
  Acceleration is the only way forward.
  - Milan W Jan 24, 2025, 8:27 PM
    4 points
    2
    Parent
    context: @DrJimFan works at nvidia
  - Milan W Jan 24, 2025, 8:17 PM
    1 point
    −1
    Parent
    IF we got and will keep on having strong scaling law improvements, then:
    ~~openai’s plan to continue to acquire way more training compute even into 2029 is either lies or a mistake~~
    we’ll get very interesting times quite soon
    offense-defense balances and multi-agent-system dynamics seem like good research directions, if you can research fast and have reason to believe your research will be implemented in a useful way
    EDIT: I no longer fully endorse the crossed-out bullet point. Details in replies to this comment.
    - Nathan Helm-Burger Jan 24, 2025, 8:50 PM
      4 points
      1
      Parent
      Disagree on pursuit of compute being a mistake in one of those worlds but not the other. Either way you are going to want as much inference as possible during key strategic moments.
      
      This seems even more critically important if you are worried your competitors will have algorithms nearly as good as yours.
      
      [Edit: roon posted the same thought on xitter the next day https://x.com/tszzl/status/1883076766232936730
      
      roon @tszzl
      
      if the frontier models are commoditized, compute concentration matters even more
      
      if you can train better models for fewer flops, compute concentration matters even more
      
      compute is the primary means of production of the future and owning more will always be good
      
      12:57 AM · Jan 25, 2025 roon @tszzl
      
      imo, open source models are a bit of a red herring on the path to acceptable asi futures. free model weights still don’t distribute power to all of humanity, they distribute it to the compute rich
      
      https://x.com/MikePFrank/status/1882999933126721617
      
      Michael P. Frank @MikePFrank
      
      Since R1 came out, people are talking like the massive compute farms deployed by Western labs are a waste, BUT THEY’RE NOT — don’t you see? This just means that once the best of DeepSeek’s clever cocktail of new methods are adopted by GPU-rich orgs, they’ll reach ASI even faster. ]
      - Milan W Jan 24, 2025, 9:36 PM
        3 points
        0
        Parent
        Agreed. However, in the fast world the game is extremely likely to end before you get to use 2029 compute.
        EDIT: I’d be very interested to hear an argument against this proposition, though.
        Nathan Helm-Burger Jan 24, 2025, 10:22 PM
        2 points
        0
        Parent
        I don’t know if the plan is to have the compute from Stargate become available in incremental stages, or all at once in 2029.
        
        I expect timelines are shorter than that, but I’m not certain. If I were in OpenAI’s shoes, I’d want to hedge my bets. 2026 seems plausible. So does 2032. My peak expectation is sometime in 2027, but I wouldn’t want to go all-in on that.
        Milan W Jan 24, 2025, 10:44 PM
        3 points
        0
        Parent
        all at once in 2029.
        I am almost totally positive that the plan is not that.
        
        If planning for 2029 is cheap, then it probably makes sense under a very broad class of timelines expectations.
        If it is expensive, then the following applies to the hypothetical presented by the tweet:
        The timeline evoked in the tweet seems extremely fast and multipolar. I’d expect planning for 2029 compute scaling to make sense only if the current paradigm gets stuck at ~AGI capabilities level (ie a very good scaffolding for a model similar to but a bit smarter than o3). This is because if it scales further than that it will do so fast (requiring little compute, as the tweet suggests). If capabilities arbitrarily better than o4-with-good-scaffolding are compute-cheap to develop, then things almost certainly get very unpredictable before 2029.
- Jakub Halmeš Jan 23, 2025, 4:00 PM
  4 points
  0
  Parent
  During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.
  I also found this trade-off between human readability and performance noteworthy.
  - Milan W Jan 23, 2025, 4:48 PM
    3 points
    0
    Parent
    Side note: Claude 3.5 Sonnet does CoT language-mixing after a bit of prompting and convincing. I’m not sure about effects on performance. Also the closeness narratively implied by having it imitate the idiosyncratic mixture I was using to talk to it probably exacerbated sycophancy.