Context: An alignment researcher has written a description of a model they intend to build, and would like a diagram of that model. A system is useful if it can draw a publication-ready diagram of the model.
Input type: A description of a model.
Output type: A diagram of that model. This could be in an image file format or in a diagramming language like Mermaid.
Info constraints: None.
There are lots of examples of this kind of task from existing papers/posts, but it may take some curation to pull all the relevant context into the descriptions (e.g. it might be interspersed with text on other topics).
Instance 1:
Input: The section “Baseline: what you’d try first and how it could fail” from the ELK report.
Output: This diagram from the ELK report:
Instance 2:
Input: A model is trained on inputs x and outputs y. The outputs are generated by computing a simple function f(x) and adding some noise. The model learns the map from x to y by learning separately the function f(x) and a memorized table of the noise.
Output:
or as Mermaid code:
graph LR;
x --> f;
x --> Noise;
f --> +;
Noise --> +;
+ --> Output;
There are several variants of transformer language models. We focus on autoregressive, decoder-only transformer language models, such as GPT-3. (The original transformer paper had a special encoder-decoder structure to support translation, but many modern language models don’t include this.)
A transformer starts with a token embedding, followed by a series of “residual blocks”, and finally a token unembedding. Each residual block consists of an attention layer, followed by an MLP layer. Both the attention and MLP layers each “read” their input from the residual stream (by performing a linear projection), and then “write” their result to the residual stream by adding a linear projection back in. Each attention layer consists of multiple heads, which operate in parallel.
Draw a diagram of a model from a description
Context: An alignment researcher has written a description of a model they intend to build, and would like a diagram of that model. A system is useful if it can draw a publication-ready diagram of the model.
Input type: A description of a model.
Output type: A diagram of that model. This could be in an image file format or in a diagramming language like Mermaid.
Info constraints: None.
There are lots of examples of this kind of task from existing papers/posts, but it may take some curation to pull all the relevant context into the descriptions (e.g. it might be interspersed with text on other topics).
Instance 1:
Input: The section “Baseline: what you’d try first and how it could fail” from the ELK report.
Output: This diagram from the ELK report:
Instance 2:
Input: A model is trained on inputs x and outputs y. The outputs are generated by computing a simple function f(x) and adding some noise. The model learns the map from x to y by learning separately the function f(x) and a memorized table of the noise.
Output:
or as Mermaid code:
Instance 3: (From A Mathematical Framework for Transformer Circuits)
Input:
There are several variants of transformer language models. We focus on autoregressive, decoder-only transformer language models, such as GPT-3. (The original transformer paper had a special encoder-decoder structure to support translation, but many modern language models don’t include this.)
A transformer starts with a token embedding, followed by a series of “residual blocks”, and finally a token unembedding. Each residual block consists of an attention layer, followed by an MLP layer. Both the attention and MLP layers each “read” their input from the residual stream (by performing a linear projection), and then “write” their result to the residual stream by adding a linear projection back in. Each attention layer consists of multiple heads, which operate in parallel.
Output: