Continuous learning is the capacity to keep adding to long-term memory as you go, and this would allow a language model to tackle much longer texts.
Cerebras are saying they can handle 50000 token context windows. That’s about 30K-40K words, the amount one might type in a day, typing quickly and without rest. Or half a short novel.
This sort of context window makes improvement to short term memory largely unnecessary, as running within a single context window instantiates day-long spurs (temporary instances of human imitations whose detailed short experiences are to be forgotten), or bureaucracies of such spurs. Also, speaking internal monologues into the context window to reason out complicated arguments lifts any bounds one-step token prediction might place on them. If a bureaucracy were to prepare a report, it could be added to the next batch of sequence prediction learning, improving the model’s capabilities or alignment properties it was intended to improve.
So all that remains is some fine tuning, hopefully with conditioning and notRLHF.
Cerebras are saying they can handle 50000 token context windows. That’s about 30K-40K words, the amount one might type in a day, typing quickly and without rest. Or half a short novel.
This sort of context window makes improvement to short term memory largely unnecessary, as running within a single context window instantiates day-long spurs (temporary instances of human imitations whose detailed short experiences are to be forgotten), or bureaucracies of such spurs. Also, speaking internal monologues into the context window to reason out complicated arguments lifts any bounds one-step token prediction might place on them. If a bureaucracy were to prepare a report, it could be added to the next batch of sequence prediction learning, improving the model’s capabilities or alignment properties it was intended to improve.
So all that remains is some fine tuning, hopefully with conditioning and not RLHF.