Mateusz Bagiński comments on Shortform

Mateusz Bagiński 30 Sep 2024 13:28 UTC
4 points
3
Also, I’m curious what it is that you consider(ed) AI safety progress/innovation. Can you give a few representative examples?
- Cleo Nardo 30 Sep 2024 16:41 UTC
  4 points
  0
  Parent
  I’ve added a fourth section to my post. It operationalises “innovation” as “non-transient novelty”. Some representative examples of an innovation would be:
  I think these articles were non-transient and novel.
  - Mateusz Bagiński 30 Sep 2024 17:37 UTC
    1 point
    −3
    Parent
    My notion of progress is roughly: something that is either a building block for The Theory (i.e. marginally advancing our understanding) or a component of some solution/intervention/whatever that can be used to move probability mass from bad futures to good futures.
    
    Re the three you pointed out, simulators I consider a useful insight, gradient hacking probably not (10% < p < 20%), and activation vectors I put in the same bin as RLHF whatever is the appropriate label for that bin.
    - Cleo Nardo 30 Sep 2024 18:01 UTC
      4 points
      0
      Parent
      thanks for the thoughts. i’m still trying to disentangle what exactly I’m point at.
      I don’t intend “innovation” to mean something normative like “this is impressive” or “this is research I’m glad happened” or anything. i mean something more low-level, almost syntactic. more like “here’s a new idea everyone is talking out”. this idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.
      like, imagine your job was to maintain a glossary of terms in AI safety. i feel like new terms used to emerge quite often, but not any more (i.e. not for the past 6-12 months). do you think this is a fair? i’m not sure how worrying this is, but i haven’t noticed others mentioning it.
      NB: here’s 20 random terms I’m imagining included in the dictionary:
      Evals
      Mechanistic anomaly detection
      Stenography
      Glitch token
      Jailbreaking
      RSPs
      Model organisms
      Trojans
      Superposition
      Activation engineering
      CCS
      Singular Learning Theory
      Grokking
      Constitutional AI
      Translucent thoughts
      Quantilization
      Cyborgism
      Factored cognition
      Infrabayesianism
      Obfuscated arguments