It’s utility function is pretty simple and explicitly programmed. It wants to find the best token, where ‘best’ is mostly the same as ‘the most likely according to the data I’m trained on’. With a few other particulars (where you can adjust how ‘creative’ vs plagiarizer-y it should be.)
That’s a utility function. GPT is what’s called a hill climbing algorithm. It must have a simple straight forward utility function hard coded right in there for it to assess if a given choice is ‘climbing’ or not.
That’s the training signal, not the utility function. Those are different things. (I believe this point was made in Reward is not the Optimization Target, though I could be wrong since I never actually read this post; corrections welcome.)
I’m going to disagree here.
It’s utility function is pretty simple and explicitly programmed. It wants to find the best token, where ‘best’ is mostly the same as ‘the most likely according to the data I’m trained on’. With a few other particulars (where you can adjust how ‘creative’ vs plagiarizer-y it should be.)
That’s a utility function. GPT is what’s called a hill climbing algorithm. It must have a simple straight forward utility function hard coded right in there for it to assess if a given choice is ‘climbing’ or not.
That’s the training signal, not the utility function. Those are different things. (I believe this point was made in Reward is not the Optimization Target, though I could be wrong since I never actually read this post; corrections welcome.)