MadHatter comments on Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

MadHatter 4 Nov 2022 4:11 UTC
2 points
0
Very cool stuff! Do you have the notebook on colab or something? Kind of want to find out how the story ends, whether that’s in a second half video or just playing around with the code. At the end of this video you had what looked like fairly clean positional embeddings coming out of MLP0. Also the paying-attention-to-self in the second attention layer could plausibly be something to do with erasing the information that comes in on that token, since that’s something that all transformer decoders have to do in some fashion or another.
Pretty sure the loss spikes were coming from using max rather than min when defining the learning rate schedule. Your learning rate multiplier starts at 1 and then linearly increases as step/100 once it reaches 100, which makes sense why it behaves itself for a while and then ultimately diverges for large numbers of steps.
- MadHatter 4 Nov 2022 5:14 UTC
  1 point
  2
  Parent
  Yeah, just changing the max to a min produces this much smoother loss curve from your notebook..
- MadHatter 4 Nov 2022 4:56 UTC
  1 point
  0
  Parent
  Oops, did not read the post carefully enough, you’ve already linked to the colab!