Very cool stuff! Do you have the notebook on colab or something? Kind of want to find out how the story ends, whether that’s in a second half video or just playing around with the code. At the end of this video you had what looked like fairly clean positional embeddings coming out of MLP0. Also the paying-attention-to-self in the second attention layer could plausibly be something to do with erasing the information that comes in on that token, since that’s something that all transformer decoders have to do in some fashion or another.
Pretty sure the loss spikes were coming from using max rather than min when defining the learning rate schedule. Your learning rate multiplier starts at 1 and then linearly increases as step/100 once it reaches 100, which makes sense why it behaves itself for a while and then ultimately diverges for large numbers of steps.
Very cool stuff! Do you have the notebook on colab or something? Kind of want to find out how the story ends, whether that’s in a second half video or just playing around with the code. At the end of this video you had what looked like fairly clean positional embeddings coming out of MLP0. Also the paying-attention-to-self in the second attention layer could plausibly be something to do with erasing the information that comes in on that token, since that’s something that all transformer decoders have to do in some fashion or another.
Pretty sure the loss spikes were coming from using max rather than min when defining the learning rate schedule. Your learning rate multiplier starts at 1 and then linearly increases as step/100 once it reaches 100, which makes sense why it behaves itself for a while and then ultimately diverges for large numbers of steps.
Yeah, just changing the max to a min produces this much smoother loss curve from your notebook..
Oops, did not read the post carefully enough, you’ve already linked to the colab!