Written during SERI MATS, 2022 Winter cohort, at a little prodding from sensei Trout.
If you’re trying to understand a policy, viewing how it changes over time is valuable even if you aren’t interested in the training process directly.
For example, here’s Lauro et al’s neural net learning to solve mazes. Vectors are drawn by taking a probability weighted combination of the basis vectors, e.g. the vector x is given by pright−pleft.
You might notice some basic things
The network learns to avoid different walls at different times in training (see: bottom right and middle left). This rules out an architecture where the mouse sees locally around itself (of course, we already knew this. but I expect you can find more interesting phenomena after looking further)
Alex suggested a similar thing could be done with language models. Plot logits for a few sentences over time (e.g. continuations after a prompt trying to hurt a human), compare logit curves to the loss curve, and compare with when we start doing RLHF.
I would be extremely surprised if nobody has done this before, but thought I’d signal boost since it’s relatively easy to do and interesting. (Also a gateway drug to my hidden agenda of studying training dynamics, which I think are important to understand[1] for alignment!)
[ASoT] Policy Trajectory Visualization
Written during SERI MATS, 2022 Winter cohort, at a little prodding from sensei Trout.
If you’re trying to understand a policy, viewing how it changes over time is valuable even if you aren’t interested in the training process directly.
For example, here’s Lauro et al’s neural net learning to solve mazes. Vectors are drawn by taking a probability weighted combination of the basis vectors, e.g. the vector x is given by pright−pleft.
You might notice some basic things
The network learns to avoid different walls at different times in training (see: bottom right and middle left). This rules out an architecture where the mouse sees locally around itself (of course, we already knew this. but I expect you can find more interesting phenomena after looking further)
Alex suggested a similar thing could be done with language models. Plot logits for a few sentences over time (e.g. continuations after a prompt trying to hurt a human), compare logit curves to the loss curve, and compare with when we start doing RLHF.
I would be extremely surprised if nobody has done this before, but thought I’d signal boost since it’s relatively easy to do and interesting. (Also a gateway drug to my hidden agenda of studying training dynamics, which I think are important to understand[1] for alignment!)
Something something shard theory something something high path dependence (I’m taking stream of thought seriously lol)