nostalgebraist’s post and Part 1 of this were pretty useful, but I really appreciate the dive into the actual mathematical and architectural details of the Transformer, makes the knowledge more concrete and easier to remember.
Small errata:
“calculating the inner product between their keys and values” should probably be “calculating the inner product between their keys and queries” (based on what I understand from before and based on the math expressions after this)
“as inputted from the encoder stack” should probably be “as inputted to the encoder stack”
nostalgebraist’s post and Part 1 of this were pretty useful, but I really appreciate the dive into the actual mathematical and architectural details of the Transformer, makes the knowledge more concrete and easier to remember.
Small errata:
“calculating the inner product between their keys and values” should probably be “calculating the inner product between their keys and queries” (based on what I understand from before and based on the math expressions after this)
“as inputted from the encoder stack” should probably be “as inputted to the encoder stack”
Thanks :)
There are actually a quite a few errors in this post. Thanks for catching more. At some point I’ll probably go back and fix stuff.