I’ll use the definition of optimization from Wikipedia: “Mathematical optimization is the selection of a best element, with regard to some criteria, from some set of available alternatives”.
Best-of-n or rejection sampling is an alternative to RLHF which involves generating n responses from an LLM and returning the one with the highest reward model score. I think it’s reasonable to describe this process as optimizing for reward because its searching for LLM outputs that achieve the highest reward from the reward model.
I’d also argue that AlphaGo/AlphaZero is optimizing for reward. In the AlphaGo paper it says, “At each time step t of each simulation, an action at is selected from state st so as to maximize action value plus a bonus” and the formula is: a=argmaxa(Q(st,a)+u(st,a)) where u is an exploration bonus.
Action values Q are calculated as the mean value (estimated probability of winning) of all board states in the subtree below an action. The value of each possible future board state is calculated using a combination of a value function estimation for that state and the mean outcome of dozens of random rollouts until the end of the game (return +1 or −1 depending on who wins).
The value function predicts the return (expected sum of future reward) from a position whereas the random rollouts are calculating the actual average reward by simulating future moves until the end of the game when the reward function return +1 or −1.
So I think AlphaZero is optimizing for a combination of predicted reward (from the value function) and actual reward which is calculated using multiple rollouts until the end of the game.
I’ll use the definition of optimization from Wikipedia: “Mathematical optimization is the selection of a best element, with regard to some criteria, from some set of available alternatives”.
Best-of-n or rejection sampling is an alternative to RLHF which involves generating n responses from an LLM and returning the one with the highest reward model score. I think it’s reasonable to describe this process as optimizing for reward because its searching for LLM outputs that achieve the highest reward from the reward model.
I’d also argue that AlphaGo/AlphaZero is optimizing for reward. In the AlphaGo paper it says, “At each time step t of each simulation, an action at is selected from state st so as to maximize action value plus a bonus” and the formula is: a=argmaxa(Q(st,a)+u(st,a)) where u is an exploration bonus.
Action values Q are calculated as the mean value (estimated probability of winning) of all board states in the subtree below an action. The value of each possible future board state is calculated using a combination of a value function estimation for that state and the mean outcome of dozens of random rollouts until the end of the game (return +1 or −1 depending on who wins).
The value function predicts the return (expected sum of future reward) from a position whereas the random rollouts are calculating the actual average reward by simulating future moves until the end of the game when the reward function return +1 or −1.
So I think AlphaZero is optimizing for a combination of predicted reward (from the value function) and actual reward which is calculated using multiple rollouts until the end of the game.