Upon reflection, though, I begin to be skeptical that “selection” is any different from “reward.” Consider the description of model-training:
To motivate this, let’s view the above process not from the vantage point of the overall training loop but from the perspective of the model itself. For the purposes of demonstration, let’s assume the model is a conscious and coherent entity. From it’s perspective, the above process looks like:
Waking up with no memories in an environment.
Taking a bunch of actions.
Suddenly falling unconscious.
Waking up with no memories in an environment.
Taking a bunch of actions.
and so on.....
The model never “sees” the reward. Each time it wakes up in an environment, its cognition has been altered slightly such that it is more likely to take certain actions than it was before.
What distinguishes this from how my brain works? The above is pretty much exactly what happens to my brain every millisecond:
It wakes up in an environment, with no memories[1]; just a raw causal process mapping inputs to outputs.
It receives some inputs, and produces some outputs.
It’s replaced with a new version—almost identical to the old version, but with some synapse weights and activation states tweaked via simple, local operations.
It wakes up in an environment...
and so on...
Why say that I “see” reward, but the model doesn’t?
Is it cheating to say this? I don’t think so. Both I and GPT-3 saw the sentence “Paris is the capital of France” in the past; both of us had our synapse weights tweaked as a result; and now both of us can tell you the capital of France. If we’re saying that the model doesn’t “have memories,” then, I propose, neither do I.
Your brain stores memories of input and also of previous thoughts you had and the experience of taking actions. Within the “replaced with a new version” view of the time evolution of your brain (which is also the pure-functional-programming view of a process communicating with the outside world), we can say that the input it receives next iteration contains lots of information from outputs it made in the preceding iteration.
But with the reinforcement learning algorithm, the previous outputs are not given as input. Rather, the previous outputs are fed to the reward function, and the reward function’s output is fed to the gradient descent process, and that determines the future weights. It seems like a much noisier channel.
Also, individual parts of a brain (or ordinary computer program with random access memory) can straightforwardly carry state forward that is mostly orthogonal to state in other parts (thus allowing semi-independent modules to carry out particular algorithms); it seems to me that the model cannot do that — cannot increase the bandwidth of its “train of thought while being trained” — without inventing an encoding scheme to embed that information into its performance on the desired task such that the best performers are also the ones that will think the next thought. It seems fairly implausible to me that a model would learn to execute such an internal communication system, while still outcompeting models “merely” performing the task being trained.
(Disclaimer: I’m not familiar with the details of ML techniques; this is just loose abstract thinking about that particular question of whether there’s actually any difference.)
I found this lens very interesting!
Upon reflection, though, I begin to be skeptical that “selection” is any different from “reward.”
Consider the description of model-training:
What distinguishes this from how my brain works? The above is pretty much exactly what happens to my brain every millisecond:
It wakes up in an environment, with no memories[1]; just a raw causal process mapping inputs to outputs.
It receives some inputs, and produces some outputs.
It’s replaced with a new version—almost identical to the old version, but with some synapse weights and activation states tweaked via simple, local operations.
It wakes up in an environment...
and so on...
Why say that I “see” reward, but the model doesn’t?
Is it cheating to say this? I don’t think so. Both I and GPT-3 saw the sentence “Paris is the capital of France” in the past; both of us had our synapse weights tweaked as a result; and now both of us can tell you the capital of France. If we’re saying that the model doesn’t “have memories,” then, I propose, neither do I.
Your brain stores memories of input and also of previous thoughts you had and the experience of taking actions. Within the “replaced with a new version” view of the time evolution of your brain (which is also the pure-functional-programming view of a process communicating with the outside world), we can say that the input it receives next iteration contains lots of information from outputs it made in the preceding iteration.
But with the reinforcement learning algorithm, the previous outputs are not given as input. Rather, the previous outputs are fed to the reward function, and the reward function’s output is fed to the gradient descent process, and that determines the future weights. It seems like a much noisier channel.
Also, individual parts of a brain (or ordinary computer program with random access memory) can straightforwardly carry state forward that is mostly orthogonal to state in other parts (thus allowing semi-independent modules to carry out particular algorithms); it seems to me that the model cannot do that — cannot increase the bandwidth of its “train of thought while being trained” — without inventing an encoding scheme to embed that information into its performance on the desired task such that the best performers are also the ones that will think the next thought. It seems fairly implausible to me that a model would learn to execute such an internal communication system, while still outcompeting models “merely” performing the task being trained.
(Disclaimer: I’m not familiar with the details of ML techniques; this is just loose abstract thinking about that particular question of whether there’s actually any difference.)