Thanks for the clarification! From OpenAI’s announcement, it looks like this ranking only occurs during the finetuning portion of training (Step 2). But the user doesn’t have the opportunity to provide this feedback after deployment. So are you suggesting that ChatGPT gets aligned to the values of the human contractor(s) that provide data during finetuning, and then carries these values forward when interacting with users? I’m asking because one of the key benefits of CIRL games (also called “assistance games”) is that they allow the AI to continuously update towards the user’s values, without freezing for deployment, and I don’t fully understand the connection here.
So are you suggesting that ChatGPT gets aligned to the values of the human contractor(s) that provide data during finetuning, and then carries these values forward when interacting with users?
You are correct that this appears to stand in contrast one of the key benefits of CIRL games. Namely, that they allow the AI to continuously update towards the user’s values. The argument I present is that ChatGPT can still learn something about the preferences of the user it is interacting with through the use of in-context value learning. During deployment, ChatGPT will then be able to learn preferences in-context allowing for continuous updating towards the user’s values like in the CIRL game.
Thanks for the clarification! From OpenAI’s announcement, it looks like this ranking only occurs during the finetuning portion of training (Step 2). But the user doesn’t have the opportunity to provide this feedback after deployment. So are you suggesting that ChatGPT gets aligned to the values of the human contractor(s) that provide data during finetuning, and then carries these values forward when interacting with users? I’m asking because one of the key benefits of CIRL games (also called “assistance games”) is that they allow the AI to continuously update towards the user’s values, without freezing for deployment, and I don’t fully understand the connection here.
You are correct that this appears to stand in contrast one of the key benefits of CIRL games. Namely, that they allow the AI to continuously update towards the user’s values. The argument I present is that ChatGPT can still learn something about the preferences of the user it is interacting with through the use of in-context value learning. During deployment, ChatGPT will then be able to learn preferences in-context allowing for continuous updating towards the user’s values like in the CIRL game.