AFAIK, only Gwern and I have written concrete stories speculating about how a training run will develop cognition within the AGI.
This worries me, if true (if not, please reply with more!). I think it would be awesome to have more concrete stories![1] If Nate, or Evan, or John, or Paul, or—anyone, please, anyone add more concrete detail to this website!—wrote one of their guesses of how AGI goes, I would understand their ideas and viewpoints better. I could go “Oh, that’s where the claimed sharp left turn is supposed to occur.” Or “That’s how Paul imagines IDA being implemented, that’s the particular way in which he thinks it will help.”
Even if scrubbed of any AGI-capabilities-advancing sociohazardous detail. Although I’m not that convinced that this is a big deal for conceptual content written on AF. Lots of people probably have theories of how AGI will go. Implementation is, I have heard, the bottleneck.
Contrast this with beating SOTA on crisply defined datasets in a way which enables ML authors to get prestige and publication and attention and funding by building off of your work. Seem like different beasts.
I also think a bunch of alignment writing seems syntactical. Like, “we need to solve adversarial robustness so that the AI can’t find bad inputs and exploit them / we don’t have to worry about distributional shift. Existing robustness strategies have downsides A B and C and it’s hard to even get ϵ-ball guarantees on classifications. Therefore, …”
And I’m worried that this writing isn’t abstractly summarizing a concrete story for failure that they have in mind (like “I train the AI [with this setup] and it produces [this internal cognition] for [these mechanistic reasons]”; see A shot at the diamond alignment problem for an example) and then their best guesses at how to intervene on the story to prevent the failures from being able to happen (eg “but if we had [this robustness property] we could be sure its policy would generalize into situations X Y and Z, which makes the story go well”). I’m rather worried that people are more playing syntactically, and not via detailed models of what might happen.
Detailed models are expensive to make. Detailed stories are hard to write. There’s a lot we don’t know. But we sure as hell aren’t going to solve alignment only via valid reasoning steps on informally specified axioms (“The AI has to be robust or we die”, or something?).
Examples should include actual details. I often ask people to give a concrete example, and they often don’t. I wish this happened less. For example:
This is not a concrete example.
This is a concrete example.
AFAIK, only Gwern and I have written concrete stories speculating about how a training run will develop cognition within the AGI.
This worries me, if true (if not, please reply with more!). I think it would be awesome to have more concrete stories![1] If Nate, or Evan, or John, or Paul, or—anyone, please, anyone add more concrete detail to this website!—wrote one of their guesses of how AGI goes, I would understand their ideas and viewpoints better. I could go “Oh, that’s where the claimed sharp left turn is supposed to occur.” Or “That’s how Paul imagines IDA being implemented, that’s the particular way in which he thinks it will help.”
Maybe a contest would help?
ETA tone
Even if scrubbed of any AGI-capabilities-advancing sociohazardous detail. Although I’m not that convinced that this is a big deal for conceptual content written on AF. Lots of people probably have theories of how AGI will go. Implementation is, I have heard, the bottleneck.
Contrast this with beating SOTA on crisply defined datasets in a way which enables ML authors to get prestige and publication and attention and funding by building off of your work. Seem like different beasts.
I also think a bunch of alignment writing seems syntactical. Like, “we need to solve adversarial robustness so that the AI can’t find bad inputs and exploit them / we don’t have to worry about distributional shift. Existing robustness strategies have downsides A B and C and it’s hard to even get ϵ-ball guarantees on classifications. Therefore, …”
And I’m worried that this writing isn’t abstractly summarizing a concrete story for failure that they have in mind (like “I train the AI [with this setup] and it produces [this internal cognition] for [these mechanistic reasons]”; see A shot at the diamond alignment problem for an example) and then their best guesses at how to intervene on the story to prevent the failures from being able to happen (eg “but if we had [this robustness property] we could be sure its policy would generalize into situations X Y and Z, which makes the story go well”). I’m rather worried that people are more playing syntactically, and not via detailed models of what might happen.
Detailed models are expensive to make. Detailed stories are hard to write. There’s a lot we don’t know. But we sure as hell aren’t going to solve alignment only via valid reasoning steps on informally specified axioms (“The AI has to be robust or we die”, or something?).