Applyto work on this project with me at AI Safety Camp 2024 before 1st December 2023.
The project is not set in stone, I am looking for feedback!
Summary
TinyStories is a suite of Small Language Models (SLMs) trained exclusively on children’s stories generated by ChatGPT. The models use simple, yet coherent English, which far surpasses what was previously observed in other models of comparable size.
I hope that most of the capabilities of these models can be thoroughly understood using currently available interpretability techniques. Doing so would represent a major milestone in the development of mechanistic interpretability (mech interp).
The goal of this AISC project is to publish a paper that systematically identifies and characterises the range of capabilities exhibited by the TinyStories models. While in-depth analysis of the underlying circuits is outside the current scope, this project represents an important initial step in that direction.
Gaining a clear picture of the capabilities of these models will encourage the research community to subsequently build on these findings by analysing the responsible circuits. This will further the development of mech interp and provide insights into how language models work internally.
Motivation
My theory of change for mech interp
I am optimistic about RSPs and auditing; in short:
limit system’s training, deployment etc. depending on its dangerous capabilities and provable alignment
use behavioural evaluations to test for dangerous capabilities
use understanding-based evaluations to test for alignment
We do not know how to conduct understanding-based evals yet
When we reach a certain level of dangerous capabilities, we will either
continue and risk a catastrophe or
stop and incur huge alignment tax
That is fine by me, but it makes regulation less likely to be implemented
Mech interp is a promising approach to understanding-based evals
Fully understanding a model vs reverse engineering specific capabilities
Currently researchers reverse engineer capabilities that are approachable or that they find interesting; some problems with this:
“approachable” is missing hard problems by default (we do not even know what they are)
“interesting” may not turn out to be most relevant
It is almost universally done on narrow distributions, and does not provide a general understanding of the components involved
Instead we could
Try to reverse engineer every capability the model is expressing on the training dataset
if we fail, we just identified a concrete open problem
For each component, list all functions it is responsible for
Identify the components with unknown functions
they can be directly relevant to safety
e.g. responsible for strategic deception
or highlight something about the architecture we do not understand yet
Eventually, we would gain a general understanding of every model component
I acknowledge this is very ambitious, and the proposed project is just a first step in this direction
Why TinyStories?
Models and the dataset are open source
The models are small, between 1 and 33 million parameters “[...] focusing on mech interp work on small models is just fine, and extremely worth it for the much faster feedback loops” ~ Neel Nanda
The dataset is small and simple, comprising under 2GB of text and approximately 10,000 unique tokens. Moreover, every accurate next token prediction will be logically sound for humans; these are just stories for 3-4 year olds
They exhibit capabilities beyond what humans can implement in code “A lot of manual mechanistic interpretability work focuses primarily on scaling explanations to larger models, as opposed to more complex tasks or comprehensive explanations, which I think are more important.” ~ Lawrence Chan
Why just doing evaluations for now?
Risk
Interpretability is hard and open-ended
We will be just a small team of junior researchers working part-time
Leverage
By creating a lot of interpretability challenges, we can tap into the potential of numerous independent mech interp researchers
I hope this will do for mech interp what OpenAI Gym did for reinforcement learning
Steps involved
Identify capabilities
Start with the least capable model.
Collect probabilities of correct next token predictions on the TinyStories validation dataset
Visualise samples where model was correct and confident about next token prediction
Identify the simplest and most common pattern or capability
Filter the samples expressing the identified capability
Go back to step 2
Repeat for the next model, but filter all the cases where both models were correct and confident, to highlight only new capabilities.
Characterise capabilities
For each identified capability:
Try to red-team the capability
What do you think are the characteristics of the text that make the behaviour present?
Are there any such examples where the behaviour is not present?
Does it work on synthetic examples?
What can you change in the text and still see the same results?
Define it as a task, or a set of (prompt, correct_answer) pairs
Evaluate performance of each model on the task by measuring:
Probability of correct_answer
Rank of correct_answer
If there is one obvious wrong answer: Logit difference between correct_answer and wrong_answer
Summarise the results
What is the capability?
Why was it useful to learn?
Which models are able to perform the task?
Is the performance uniform across all examples? If not, what is different between them?
How could a transformer implement this capability?
Do all of the models perform equally well on the task? If not, why could that be?
Write the paper
The structure and the main message of the paper will depend on the findings. We should start writing in the second month of the project. That will help to consolidate our understanding and direct further research into the most promising directions.
Risks and downsides
There is a small risk that
Tools to identify capabilities we develop will end up supporting capabilities work
Further interpretability research on capabilities we identify will motivate new insights
I would like to thank @Linda Linsefors, @Arthur Conmy, @Lucia Quirke, and @cmathw for feedback on this proposal. I would like to thank Lucia Quirke, Lovis Heindrich, and @RGRGRG for sharing their preliminary research on TinyStories.
Team
Team size: 3-5 people including myself, depending on their time commitment. The problem has a lot of surface area and people can easily work in parallel.
Research Lead @Jett(feel free to DM me with any questions)
I participated in MATS winter 2023 cohort, mech interp stream, under Neel Nanda’s mentorship. I co-authored
Projects 1 and 3 involved a lot of identification and characterisation of capabilities, similar to what I envision for this AISC project. In projects 2 and 3 I was acting as a research lead / mentor, and I received positive feedback. I commit to working on the project at least 10 hours per week.
Team Coordinator: I prefer some other team member to take on that role.
there was a big garage [...] they needed more space in the garage
girl named Lily. [...] As they drove down the road, Lily
Repeated multi-token names (induction?)
a big, hairy rabbit named Bongo. Bongo
a little fish named Nemo. One day, Nemo
mouse named Timmy. He lived in a cozy hole in the wall of a big house. Timmy
Common phrases (skip trigrams?)
see something up close
Plural to singular with different tokenization
went to see the zebras, Lily saw a unique zebra
Understanding that a context was just provided, and it’s time for a story
Once upon a time, in a big forest, there lived a rhinoceros named Roxy. Roxy loved to climb. She climbed trees, rocks, and hills. One
Once upon a time, in a small yard, there was a small daisy. The daisy had a name. Her name was Daisy. Daisy was very small, but she was also very happy.\n
Once upon a time, there was a big, heavy alligator. He lived near a small pond. He was very hungry and wanted to eat something.\n\nOne
Knowing when to end a quote
Kitty smiled and replied, "Thank you, Spot. I polish it every day.”
Billy saw that Roxy was sad and asked, "Why are you sad, Roxy?”
The cow said, "I am lonely. I want a friend.”
Pronouns
Tim went to his
So, Mia and Tom played together. They
bought some light bulbs. When he came back, he put them
AISC project: TinyEvals
Apply to work on this project with me at AI Safety Camp 2024 before 1st December 2023.
The project is not set in stone, I am looking for feedback!
Summary
TinyStories is a suite of Small Language Models (SLMs) trained exclusively on children’s stories generated by ChatGPT. The models use simple, yet coherent English, which far surpasses what was previously observed in other models of comparable size.
I hope that most of the capabilities of these models can be thoroughly understood using currently available interpretability techniques. Doing so would represent a major milestone in the development of mechanistic interpretability (mech interp).
The goal of this AISC project is to publish a paper that systematically identifies and characterises the range of capabilities exhibited by the TinyStories models. While in-depth analysis of the underlying circuits is outside the current scope, this project represents an important initial step in that direction.
Gaining a clear picture of the capabilities of these models will encourage the research community to subsequently build on these findings by analysing the responsible circuits. This will further the development of mech interp and provide insights into how language models work internally.
Motivation
My theory of change for mech interp
I am optimistic about RSPs and auditing; in short:
limit system’s training, deployment etc. depending on its dangerous capabilities and provable alignment
use behavioural evaluations to test for dangerous capabilities
use understanding-based evaluations to test for alignment
We do not know how to conduct understanding-based evals yet
When we reach a certain level of dangerous capabilities, we will either
continue and risk a catastrophe or
stop and incur huge alignment tax
That is fine by me, but it makes regulation less likely to be implemented
Mech interp is a promising approach to understanding-based evals
Fully understanding a model vs reverse engineering specific capabilities
Currently researchers reverse engineer capabilities that are approachable or that they find interesting; some problems with this:
“approachable” is missing hard problems by default (we do not even know what they are)
“interesting” may not turn out to be most relevant
It is almost universally done on narrow distributions, and does not provide a general understanding of the components involved
Instead we could
Try to reverse engineer every capability the model is expressing on the training dataset
if we fail, we just identified a concrete open problem
For each component, list all functions it is responsible for
Identify the components with unknown functions
they can be directly relevant to safety
e.g. responsible for strategic deception
or highlight something about the architecture we do not understand yet
Eventually, we would gain a general understanding of every model component
I acknowledge this is very ambitious, and the proposed project is just a first step in this direction
Why TinyStories?
Models and the dataset are open source
The models are small, between 1 and 33 million parameters
“[...] focusing on mech interp work on small models is just fine, and extremely worth it for the much faster feedback loops” ~ Neel Nanda
The dataset is small and simple, comprising under 2GB of text and approximately 10,000 unique tokens. Moreover, every accurate next token prediction will be logically sound for humans; these are just stories for 3-4 year olds
They exhibit capabilities beyond what humans can implement in code
“A lot of manual mechanistic interpretability work focuses primarily on scaling explanations to larger models, as opposed to more complex tasks or comprehensive explanations, which I think are more important.” ~ Lawrence Chan
Why just doing evaluations for now?
Risk
Interpretability is hard and open-ended
We will be just a small team of junior researchers working part-time
Leverage
By creating a lot of interpretability challenges, we can tap into the potential of numerous independent mech interp researchers
I hope this will do for mech interp what OpenAI Gym did for reinforcement learning
Steps involved
Identify capabilities
Start with the least capable model.
Collect probabilities of correct next token predictions on the TinyStories validation dataset
Visualise samples where model was correct and confident about next token prediction
Identify the simplest and most common pattern or capability
Filter the samples expressing the identified capability
Go back to step 2
Repeat for the next model, but filter all the cases where both models were correct and confident, to highlight only new capabilities.
Characterise capabilities
For each identified capability:
Try to red-team the capability
What do you think are the characteristics of the text that make the behaviour present?
Are there any such examples where the behaviour is not present?
Does it work on synthetic examples?
What can you change in the text and still see the same results?
Define it as a task, or a set of (prompt, correct_answer) pairs
Evaluate performance of each model on the task by measuring:
Probability of correct_answer
Rank of correct_answer
If there is one obvious wrong answer: Logit difference between correct_answer and wrong_answer
Summarise the results
What is the capability?
Why was it useful to learn?
Which models are able to perform the task?
Is the performance uniform across all examples? If not, what is different between them?
How could a transformer implement this capability?
Do all of the models perform equally well on the task? If not, why could that be?
Write the paper
The structure and the main message of the paper will depend on the findings. We should start writing in the second month of the project. That will help to consolidate our understanding and direct further research into the most promising directions.
Risks and downsides
There is a small risk that
Tools to identify capabilities we develop will end up supporting capabilities work
Further interpretability research on capabilities we identify will motivate new insights
See Should we publish mechanistic interpretability research?We will seek senior researchers’ advice before sharing our work widely.
Acknowledgements
I would like to thank @Linda Linsefors, @Arthur Conmy, @Lucia Quirke, and @cmathw for feedback on this proposal. I would like to thank Lucia Quirke, Lovis Heindrich, and @RGRGRG for sharing their preliminary research on TinyStories.
Team
Team size: 3-5 people including myself, depending on their time commitment. The problem has a lot of surface area and people can easily work in parallel.
Research Lead
@Jett (feel free to DM me with any questions)
I participated in MATS winter 2023 cohort, mech interp stream, under Neel Nanda’s mentorship. I co-authored
A circuit for Python docstrings in a 4-layer attention-only transformer
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l
Polysemantic Attention Head in a 4-Layer Transformer
Projects 1 and 3 involved a lot of identification and characterisation of capabilities, similar to what I envision for this AISC project. In projects 2 and 3 I was acting as a research lead / mentor, and I received positive feedback. I commit to working on the project at least 10 hours per week.
Team Coordinator: I prefer some other team member to take on that role.
Skill requirements
Required:
Python: modules, defaultdicts, Counters, iterators, dataclasses, f-strings
Jupyter notebooks
Git: committing, branching, merging, resolving conflicts
Nice to have:
PyTorch
HuggingFace
TransformerLens
Plotly
Mech interp experience
Research experience
Nice to haves that I lack:
Technical writing
Web dev: HTML, CSS, JS
Appendix
Some capabilities observed in TinyStories 1M
N-grams
Once upon a time
From that day on
avocados
Repeated tokens
a big house with a lot of rooms. The house
there was a big garage [...] they needed more space in the garage
girl named Lily. [...] As they drove down the road, Lily
Repeated multi-token names (induction?)
a big, hairy rabbit named Bongo. Bongo
a little fish named Nemo. One day, Nemo
mouse named Timmy. He lived in a cozy hole in the wall of a big house. Timmy
Common phrases (skip trigrams?)
see something up close
Plural to singular with different tokenization
went to see the zebras, Lily saw a unique zebra
Understanding that a context was just provided, and it’s time for a story
Once upon a time, in a big forest, there lived a rhinoceros named Roxy. Roxy loved to climb. She climbed trees, rocks, and hills. One
Once upon a time, in a small yard, there was a small daisy. The daisy had a name. Her name was Daisy. Daisy was very small, but she was also very happy.\n
Once upon a time, there was a big, heavy alligator. He lived near a small pond. He was very hungry and wanted to eat something.\n\nOne
Knowing when to end a quote
Kitty smiled and replied, "Thank you, Spot. I polish it every day.”
Billy saw that Roxy was sad and asked, "Why are you sad, Roxy?”
The cow said, "I am lonely. I want a friend.”
Pronouns
Tim went to his
So, Mia and Tom played together. They
bought some light bulbs. When he came back, he put them
lemon on the ground. He wanted to play with it
Predicting related concepts:
bookshelf [...] book, park [...] grass, forgive [...] happy, lunchtime [...] eat, zebra [...] stripes, tree [...] climb, nurse [...] bandage, monkey [...] jungle, shop [...] counter, laundry [...] clothes, inside [...] goodbye, octopus [...] ocean, lost [...] can’t, road [...] driving, emergency [...] doctor, milk [...] spilled, hammer [...] screwdriver, Daddy [...] Mommy
Indirect Object Identification
Spot saw the shiny car and said, "Wow, Kitty, your car is so bright and clean!" Kitty smiled and replied, "Thank you, Spot
Buddy kicked the ball with his strong legs. The ball flew into the goal! Spot was so happy. He and Buddy
One sunny day, Amy went to the yard with her friend, Max. Max saw the purple swing and said, "Wow! I want to swing too!" Amy