Topics like this really draw a crowd but if you dont know how it works writing like this just adds energy in the wrong direction. If you start off small building perceptrons by hand, you can work your way up through models to transformers and it’ll be clear what the math is attempting to do per word. It’s sophisticatedly predicting the next work based on a matrix of relevance to the previous word and the block as a whole. The attention mechanism is the magic of relevance but it is, predicting the next word.
It’s sophisticatedly predicting the next work based on a matrix of relevance to the previous word and the block as a whole.
Fine. I take that to mean that the population from which the next word is drawn changes from one cycle to the next. That makes sense to me. And the way it changes depends in part on the previous text, but also on what it had learned during training, no?
Topics like this really draw a crowd but if you dont know how it works writing like this just adds energy in the wrong direction. If you start off small building perceptrons by hand, you can work your way up through models to transformers and it’ll be clear what the math is attempting to do per word. It’s sophisticatedly predicting the next work based on a matrix of relevance to the previous word and the block as a whole. The attention mechanism is the magic of relevance but it is, predicting the next word.
Fine. I take that to mean that the population from which the next word is drawn changes from one cycle to the next. That makes sense to me. And the way it changes depends in part on the previous text, but also on what it had learned during training, no?