Thane Ruthenis comments on Prize for Alignment Research Tasks

Thane Ruthenis 7 May 2022 16:11 UTC
2 points
Edit: On a closer read, I take it you’re looking only for tasks well-suited for language models? I’ll leave this comment up for now, in case it’d still be of use.
- Task: Extract the the training objective from a fully-trained ML model.
- Input type: The full description of a ML model’s architecture + its parameters.
- Output type: Mathematical or natural-language description of the training objective.
Input [Learned parameters and architecture description of a fully-connected neural network trained on the MNIST dataset.]
Output Classifying handwritten digits.
Input [Learned parameters and architecture description of InceptionV1.]
Output Labeling natural images.
Can’t exactly fit that here, but the dataset seems relatively easy to assemble.
We can then play around with it:
- See how well it generalizes. Does it stop working if we show it a model with a slightly unfamiliar architecture? Or a model with an architecture it knows, but trained for a novel-to-it task? Or a model with a familiar architecture, trained for a familiar task, but on a novel dataset? Would show whether Chris Olah’s universality hypothesis holds for high-level features.
- See if it can back out the training objective at all. If not, uh-oh, we have pseudo-alignment. (Note that the reverse isn’t true: if it can extract the intended training objective, the inspected model can still be pseudo-aligned.)
- Mess around with what exactly we show it. If we show all but the first layer of a model, would it still work? Only the last three layers? What’s the minimal set of parameters it needs to know?
- Hook it up to an attribution tool to see what specific parameters it looks at when figuring out the training objective.
- William_S 16 May 2022 4:31 UTC
  1 point
  Parent
  I think it’s fine to have tasks that wouldn’t work for today’s language models like those that would require other input modalities. Would prefer to have fully specified inputs but these do seem easy to produce in this case. Would be ideal if there were examples with a smaller input size though.
  - Thane Ruthenis 16 May 2022 5:21 UTC
    1 point
    Parent
    Hmm. A speculative, currently-intractable way to do this might be to summarize the ML model before feeding it to the goal-extractor.
    tl;dr: As per natural abstractions, most of the details of the interactions between the individual neurons are probably irrelevant with regards to the model’s high-level functioning/reasoning. So there should be, in principle, a way to automatically collapse e. g. a trillion-parameters model into a much lower-complexity high-level description that would still preserve such important information as the model’s training objective.
    But there aren’t currently any fast-enough algorithms for generating such summaries.

Input	[Learned parameters and architecture description of a fully-connected neural network trained on the MNIST dataset.]
Output	Classifying handwritten digits.