Do I read right that the suggestion is as follows:
Overall we want to do inverse RL (like in our paper) but we need an invertible model that maps human reward functions to human behavior.
You use an LM as this model. It needs to take some useful representation of reward functions as input (it could do so if those reward functions are a subset of natural language)
You observe a human’s behavior and invert the LM to infer the reward function that produced the behavior (or the set of compatible reward functions)
Then you train a new model using this reward function (or functions) to outperform humans
This sounds pretty interesting! Although I see some challenges:
How can you represent the reward function? On the one hand, an LM (or another behaviorally cloned model) should use it as an input so it should be represented as natural language. On the other hand some algorithm should maximize it in the final step so it would ideally be a function that maps inputs to rewards.
Can the LM generalize OOD far enough? It’s trained on human language which may contain some natural language descriptions of reward functions, but probably not the ‘true’ reward function which is complex and hard to describe, meaning it’s OOD.
How can you practically invert an LM?
What to do if multiple reward functions explain the same behavior? (probably out of scope for this post)
I see. In that case, what do you think of my suggestion of inverting the LM? By default, it maps human reward functions to behavior. But when you invert it, it maps behavior to reward functions (possibly this is a one-to-many mapping but this ambiguity is a problem you can solve with more diverse behavior data). Then you could use it for IRL (with the some caveats I mentioned).
Which may be necessary since this:
The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).
...seems like an unreliable mapping since any training data of the form “person did X, therefore their goal must be Y” is firstly rare and more importantly inaccurate/incomplete since it’s hard to describe human goals in language. On the other hand, human behavior seems easier to describe in language.
Can you clarify: are you talking about inverting the LM as a function or algorithm, or constructing prompts to elicit different information (while using the LM as normal)?
For myself, I was thinking of using CHATGPT-style approaches with multiple queries—what is your prediction for their preferences, how could that prediction be checked, what more information would you need, etc...
Do I read right that the suggestion is as follows:
Overall we want to do inverse RL (like in our paper) but we need an invertible model that maps human reward functions to human behavior.
You use an LM as this model. It needs to take some useful representation of reward functions as input (it could do so if those reward functions are a subset of natural language)
You observe a human’s behavior and invert the LM to infer the reward function that produced the behavior (or the set of compatible reward functions)
Then you train a new model using this reward function (or functions) to outperform humans
This sounds pretty interesting! Although I see some challenges:
How can you represent the reward function? On the one hand, an LM (or another behaviorally cloned model) should use it as an input so it should be represented as natural language. On the other hand some algorithm should maximize it in the final step so it would ideally be a function that maps inputs to rewards.
Can the LM generalize OOD far enough? It’s trained on human language which may contain some natural language descriptions of reward functions, but probably not the ‘true’ reward function which is complex and hard to describe, meaning it’s OOD.
How can you practically invert an LM?
What to do if multiple reward functions explain the same behavior? (probably out of scope for this post)
The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).
I see. In that case, what do you think of my suggestion of inverting the LM? By default, it maps human reward functions to behavior. But when you invert it, it maps behavior to reward functions (possibly this is a one-to-many mapping but this ambiguity is a problem you can solve with more diverse behavior data). Then you could use it for IRL (with the some caveats I mentioned).
Which may be necessary since this:
...seems like an unreliable mapping since any training data of the form “person did X, therefore their goal must be Y” is firstly rare and more importantly inaccurate/incomplete since it’s hard to describe human goals in language. On the other hand, human behavior seems easier to describe in language.
Can you clarify: are you talking about inverting the LM as a function or algorithm, or constructing prompts to elicit different information (while using the LM as normal)?
For myself, I was thinking of using CHATGPT-style approaches with multiple queries—what is your prediction for their preferences, how could that prediction be checked, what more information would you need, etc...