Looks good to me! Minor comment: I think this article could benefit from being peppered with analogies to biological systems a bit more. Example:
Well, there is one more thing. A chatbot like this isn’t necessarily very well-behaved. The text generator is not coming up with the best response, by any definition of “best”—it’s entirely based on predictions, which means it’s just coming up with the most likely response. And since a lot of the training is from the Internet, the most likely response is probably not what we want a chatbot to say.
Reward is not the optimization target. Just because the text generator was created by a process that trained it to / rewarded it for coming up with the most likely response, doesn’t mean that “it tries to come up with the most likely response” is a perfect or even the best predictor of its behavior. It so happens that it works pretty well in this case, but in general it’s important to avoid equating “the thing the system was rewarded for doing” with “the thing the system is trying to do / can be modelled as trying to do”
Analogy: Humans were created by natural selection, which shaped them over many generations to be good at reproducing. But humans are not well-modelled as fitness-maximizers. I mean it’s an OK model, it’s probably helpful as a first approximation. But there are definitely important cases where humans will deliberately do things that hurt their fitness massively.
So if I were you I’d say something like: “The neural net started off random, and then the training process reinforced subnetworks that contributed to success and dampened subnetworks that contributed to failure, and eventually what remained was very good at achieving success with high probability. And in the pre-training phase, “success” means predictive accuracy; later on in fine-tuning, success means something different.”
Thanks, I tweaked the wording a bit in this paragraph, and I tried to explain later in the essay what it even means for a system to be “trying” to do something.
Looks good to me! Minor comment: I think this article could benefit from being peppered with analogies to biological systems a bit more. Example:
Reward is not the optimization target. Just because the text generator was created by a process that trained it to / rewarded it for coming up with the most likely response, doesn’t mean that “it tries to come up with the most likely response” is a perfect or even the best predictor of its behavior. It so happens that it works pretty well in this case, but in general it’s important to avoid equating “the thing the system was rewarded for doing” with “the thing the system is trying to do / can be modelled as trying to do”
Analogy: Humans were created by natural selection, which shaped them over many generations to be good at reproducing. But humans are not well-modelled as fitness-maximizers. I mean it’s an OK model, it’s probably helpful as a first approximation. But there are definitely important cases where humans will deliberately do things that hurt their fitness massively.
So if I were you I’d say something like: “The neural net started off random, and then the training process reinforced subnetworks that contributed to success and dampened subnetworks that contributed to failure, and eventually what remained was very good at achieving success with high probability. And in the pre-training phase, “success” means predictive accuracy; later on in fine-tuning, success means something different.”
Thanks, I tweaked the wording a bit in this paragraph, and I tried to explain later in the essay what it even means for a system to be “trying” to do something.