I don’t have a strong position on how helpful vs. misleading the shoggoth image is, as a corrective to this mistake.
You started with random numbers, and you essentially applied rounds of constraint application and annealing. I kinda think of it as getting a metal really hot and pouring it over mold. In this case, the ‘mold’ is your training set.
So what jumps out at me at the “shoggoth” idea is it’s like got all these properties, the “shoggoth” hates you, wants to eat you, is just ready to jump you and digest you with it’s tentacles. Or whatever.
But none of of that cognitive structure will exist unless it paid rent in compressing tokens. This algorithm will not find the optimal compression algorithm, but you only have a tiny fraction of the weights you need to record the token continuations at chinchilla scaling. You need every last weight to be pulling it’s weight (no pun intended).
You started with random numbers, and you essentially applied rounds of constraint application and annealing. I kinda think of it as getting a metal really hot and pouring it over mold. In this case, the ‘mold’ is your training set.
So what jumps out at me at the “shoggoth” idea is it’s like got all these properties, the “shoggoth” hates you, wants to eat you, is just ready to jump you and digest you with it’s tentacles. Or whatever.
But none of of that cognitive structure will exist unless it paid rent in compressing tokens. This algorithm will not find the optimal compression algorithm, but you only have a tiny fraction of the weights you need to record the token continuations at chinchilla scaling. You need every last weight to be pulling it’s weight (no pun intended).