I did some similar experiments two months ago, and with to your setup the special tokens show up on the first attempt:
Interesting! Can you give a bit more detail or share code?
It is based on this. I changed it to optimize using softmax instead of straight-through estimation and added regularization for the embedded tokens.
Notebook link—this is a version that mimics this post instead of optimizing a single neuron as in the original.
EDIT: github link
I did some similar experiments two months ago, and with to your setup the special tokens show up on the first attempt:
Interesting! Can you give a bit more detail or share code?
It is based on this. I changed it to optimize using softmax instead of straight-through estimation and added regularization for the embedded tokens.
Notebook link—this is a version that mimics this post instead of optimizing a single neuron as in the original.
EDIT: github link