with the only recursive element of its thought being that it can pass 16 bits to its next running
I would name activations for all previous tokens as the relevant “element of thought” here that gets passed, and this can be gigabytes.
From how the quote looks, I think his gripe is with the possibility of in-context learning, where human-like learning happens without anything about how the network works (neither its weights nor previous token states) being ostensibly updated.
For every token, model activations are computed once when the token is encountered and then never explicitly revised → “only [seems like it] goes in one direction”