You also have a simple algorithm problem. Humans learn by replacing bad policy with good. Aka a baby replaces “policy that drops objects picked up” ->. “policy that usually results in object retention”.
This is because at a mechanistic level the baby tries many times to pickup and retain objects, and a fixed amount of circuitry in their brain has connections that resulted in a drop down weighted and ones they resulted in retention reinforced.
This means that over time as the baby learns, the compute cost for motor manipulation remains constant. Technically O(1) though thats a bit of a confusing way to express it.
With in context window learning, you can imagine an LLM+ robot recording :
Robotic token string: <string of robotic policy tokens 1> : outcome, drop
Robotic token string: <string of robotic policy tokens 2> : outcome, drop
And so on extending and consuming all of the machines context window, and every time the machine decides which tokens to use next it needs O(n log n) compute to consider all the tokens in the window. (Used to be n^2, this is a huge advance)
This does not scale. You will not get capable or dangerous AI this way. Obviously you need to compress that linear list of outcomes from different strategies to update the underlying network that generated them so it is more likely to output tokens that result in success.
Same for any other task you want the model to do. In context learning scales poorly. This also makes it safe....
You also have a simple algorithm problem. Humans learn by replacing bad policy with good. Aka a baby replaces “policy that drops objects picked up” ->. “policy that usually results in object retention”.
This is because at a mechanistic level the baby tries many times to pickup and retain objects, and a fixed amount of circuitry in their brain has connections that resulted in a drop down weighted and ones they resulted in retention reinforced.
This means that over time as the baby learns, the compute cost for motor manipulation remains constant. Technically O(1) though thats a bit of a confusing way to express it.
With in context window learning, you can imagine an LLM+ robot recording :
Robotic token string: <string of robotic policy tokens 1> : outcome, drop
Robotic token string: <string of robotic policy tokens 2> : outcome, retain
Robotic token string: <string of robotic policy tokens 2> : outcome, drop
And so on extending and consuming all of the machines context window, and every time the machine decides which tokens to use next it needs O(n log n) compute to consider all the tokens in the window. (Used to be n^2, this is a huge advance)
This does not scale. You will not get capable or dangerous AI this way. Obviously you need to compress that linear list of outcomes from different strategies to update the underlying network that generated them so it is more likely to output tokens that result in success.
Same for any other task you want the model to do. In context learning scales poorly. This also makes it safe....