Agree. If GPT-4 can solve 3-dim matrix multiplication with chain-of-thought, then doesn’t that mean you could just take the last layer’s output (before you generate a single token from it) and send it into other instances of GPT-4, and then chain together their output? That should by definition by enough “internal state-keeping” that you wouldn’t need it to do the “note-keeping” of chain-of-thought. And that’s precisely bayesed’s point—because from the outside, that kind of a construct would just look like a bigger LLM. I think this is a clever post, but the bottleneck-ing created by token generation is too arbitrary of a way to assess LLM complexity.
The LLM outputs are out of distribution for its input layer. There is some research happening in deep model communication, but it has not yielded fruit yet AFAIK.
Agree. If GPT-4 can solve 3-dim matrix multiplication with chain-of-thought, then doesn’t that mean you could just take the last layer’s output (before you generate a single token from it) and send it into other instances of GPT-4, and then chain together their output? That should by definition by enough “internal state-keeping” that you wouldn’t need it to do the “note-keeping” of chain-of-thought. And that’s precisely bayesed’s point—because from the outside, that kind of a construct would just look like a bigger LLM. I think this is a clever post, but the bottleneck-ing created by token generation is too arbitrary of a way to assess LLM complexity.
The LLM outputs are out of distribution for its input layer. There is some research happening in deep model communication, but it has not yielded fruit yet AFAIK.