habryka comments on Bogdan Ionut Cirstea’s Shortform

habryka 16 Sep 2024 20:02 UTC
8 points
3
I mean, in the same way as it is a boon to AI capabilities research? How is this differentially useful?
- Bogdan Ionut Cirstea 16 Sep 2024 20:24 UTC
  2 points
  0
  Parent
  I suspect (though with high uncertainty) there are some factors differentially advantaging safety research, especially prosaic safety research (often in the shape of fine-tuning/steering/post-training), which often requires OOMs less compute and more broadly seems like it should have faster iteration cycles; and likely even ‘human task time’ (e.g. see figures in https://metr.github.io/autonomy-evals-guide/openai-o1-preview-report/) probably being shorter for a lot of prosaic safety research, which makes it likely it will on average be automatable sooner. The existence of cheap, accurate proxy metrics might point the opposite way, though in some cases there seem to exist good, diverse proxies—e.g. Eight Methods to Evaluate Robust Unlearning in LLMs. More discussion in the extra slides of this presentation.
  - Bogdan Ionut Cirstea 16 Sep 2024 20:32 UTC
    2 points
    0
    Parent
    Related, I expect the delay from needing to build extra infrastructure to train much larger LLMs will probably differentially affect capabilities progress more, by acting somewhat like a pause/series of pauses; which it would be nice to exploit by enrolling more safety people to try to maximally elicit newly deployed capabilities—and potentially be uplifted—for safety research. (Anecdotally, I suspect I’m already being uplifted at least a bit as a safety researcher by using Sonnet.)
    - Bogdan Ionut Cirstea 16 Sep 2024 20:35 UTC
      2 points
      −2
      Parent
      Also, I think it’s much harder to pause/slow down capabilities than to accelerate safety, so I think more of the community focus should go to the latter.
      - Bogdan Ionut Cirstea 16 Sep 2024 21:07 UTC
        2 points
        −2
        Parent
        And for now, it’s fortunate that inference scaling means CoT and similarly differentially transparent (vs. model internals) intermediate outputs, which makes it probably a safer way of eliciting capabilities.