Garrett Baker comments on O O’s Shortform

Garrett Baker 27 Aug 2024 0:16 UTC
6 points
2
I’d imagine you know better than I do, and GDM’s recent summary of their alignment work seems to largely confirm what you’re saying.

I’d still guess that to the extent practical results have come out of the alignment teams’ work, its mostly been immediately used for corporate censorship (even if its passed to a different team).
- habryka 27 Aug 2024 0:18 UTC
  4 points
  2
  Parent
  I do think this is probably true for RLHF and RLAIF, but not true for all the mechanistic interp work that people are doing (though it’s arguable whether those are “practical results”). I also think it isn’t true for the debate-type work. Or the model organism work.
  - Neel Nanda 27 Aug 2024 2:51 UTC
    9 points
    2
    Parent
    I think mech interp, debate and model organism work are notable for currently having no practical applications lol (I am keen to change this for mech interp!)
    - habryka 27 Aug 2024 3:27 UTC
      6 points
      2
      Parent
      There are depths of non-practicality greatly beyond mech interp, debate and model organism work. I know of many people who would consider that work on the highly practical side of AI Safety work :P
  - Garrett Baker 27 Aug 2024 0:39 UTC
    8 points
    2
    Parent
    None of those seem all that practical to me, except for the mechanistic interpretability SAE clamping, and I do actually expect that to be used for corporate censorship after all the kinks have been worked out of it.
    
    If the current crop of model organisms research has any practical applications, I expect it to be used to reduce jailbreaks, like in adversarial robustness, which is definitely highly correlated with both safety and corporate censorship.
    
    Debate is less clear, but I also don’t really expect practical results from that line of work.