Task: Apply an abstract proposal to a concrete ML system
Context: A researcher is reading a highly theoretical alignment paper and is curious about whether/how it might apply to a real world machine learning system, like a large transformer trained using SGD. They would like to see what parts of the ML system would change under this proposal.
Input type: a theoretical alignment proposal and a description of an ML system
Output type: a description of how the ML system would change under the given proposal
Info constraints: none
Instance 1:
Input:
Abstract proposal: The description of the complexity regularizer from the ELK report:
ML system description: a GPT-style model trained on natural language using SGD, with a GPT-style reporter trained on the GPT model’s weights.
Output: This is tricky becuase the circuit complexity of a neural network is largely fixed by its architecture. As we sweep over different architecutres/hyperparemeters for the reporter model, we can add a regularization term to the hyperparameter optimization based on the total number of weights in the model.
Other possible output: Since circuit complexity for a neural network is largely fixed by its architecture, we will consider the “minimal complexity” of the trained neural network to be the number of non-zero parameters, and will regularize on that to encourage sparsity in the weights.
Instance 2:
Input:
Abstract proposal: The Iterated Amplification and Distillation proposal, as described in https://ai-alignment.com/iterated-distillation-and-amplification-157debfd1616
ML system description: a GPT-style model trained on natural language using SGD.
Output: Any of the project proposals in https://www.alignmentforum.org/posts/Y9xD78kufNsF7wL6f/machine-learning-projects-on-ida, or something similar
Task: Suggest surprising experiments that challenge assumptions
Context: A researcher is considering an alignment proposal that hinges on some key assumptions. They would like to see some suggestions for experiments (either theoreetical thoughts experiments or actual real-world experiments) that could challenge those assumptions. If the experiment has been done, it should report the results.
Input type: An assumption about a powerful AI system
Output type: a suggestion for an experiment that could challenge that assumption. If it has been done already, the results of those experiments.
Instance 1:
Input: The performance of a model is impossible to predict, so we can’t hope to have an idea of a model’s capabilities before it is trained and evaluated.
Output: It might be that a key measure of performance of a model, such as the loss, might scale predictably with the model size. This was investigated by Kaplan et al (https://arxiv.org/abs/2001.08361), who found that the loss tends to follow a power law.
Instance 2:
Input: Suppose a model is trained on data that is mixed with some noise (as in https://arxiv.org/pdf/2009.08092.pdf ).The model will necessarily learn that the data was mixed with some noise, rather than learn a really complex decision boundary.
Output: Suppose that you try fine-tuning one of these models on data that doesn’t have the noise. It might be very slow to adapt to this in which case it might have learned the complex decision boundary. (This experiment hasn’t been done.)
Instance 3:
Input: It’s impossible to train a neural network without non-linearities like ReLU or a sigmoid.
Output: That is true for theoretical neural networks, but real neural networks are trained using floating point numbers with inherently non-linear arithmetic. These imperfections might be enough to train a competent model. This experiment was done by Jakob Foerster, who found that this was indeed enough: https://openai.com/blog/nonlinear-computation-in-linear-networks/