There is an architecture called RWKV which claims to have an ‘infinite’ context window (since it is similar to an RNN). It claims to be competitive with GPT-3. I have no idea whether this is worth taking seriously or not.
I don’t think it’s fair for them to claim that the model has an infinite context length. It appears that they can train the model as a transformer, but can turn the model into an RNN at inference time. While the RNN doesn’t have a context length limit as the transformer does, I doubt it will perform well on contexts longer than it has seen during training. There may also be limits to how much information can be stored in the hidden state, such that the model has a shorter effective context length than current SOTA LLMs.
There is an architecture called RWKV which claims to have an ‘infinite’ context window (since it is similar to an RNN). It claims to be competitive with GPT-3. I have no idea whether this is worth taking seriously or not.
I don’t think it’s fair for them to claim that the model has an infinite context length. It appears that they can train the model as a transformer, but can turn the model into an RNN at inference time. While the RNN doesn’t have a context length limit as the transformer does, I doubt it will perform well on contexts longer than it has seen during training. There may also be limits to how much information can be stored in the hidden state, such that the model has a shorter effective context length than current SOTA LLMs.
Two links related to RWKV to know more :
https://johanwind.github.io/2023/03/23/rwkv_overview.html
https://johanwind.github.io/2023/03/23/rwkv_details.html