Awesome, thanks for the feedback Eric! And glad to hear you enjoyed the post!
I’m confused why you’re using a neural network
Good point, for the example post it was total overkill. The reason I went with a NN was to demonstrate the link with the usual setting in which preference learning is applied. And in general, NNs generalize better than the table-based approach ( see also my response to Charlie Steiner ).
happy to chat about that
I definitely plan to write a follow-up to this post, will come back to your offer when that follow-up reaches the front of my queue :)
But there doesn’t seem to be any point where we can’t infer the best possible approximation at all.
Hadn’t thought about this before! Perhaps it could work to compare the inferred utility function with a random baseline? I.e. the baseline policy would be “for every comparison, flip a coin and make that your prediction about the human preference”.
If this happens to accurately describe how the human makes the decision, then the utility function should not be able to perform better than the baseline (and perhaps even worse). How much more structure can we add to the human choice before the utility function performs better than the random baseline?
it’s not obvious to me that approximating inconsistent preferences using a utility function is the “right” thing to do
True! I guess one proposal to resolve these inconsistencies is CEV, although that is not very computable.
Awesome, thanks for the feedback Eric! And glad to hear you enjoyed the post!
Good point, for the example post it was total overkill. The reason I went with a NN was to demonstrate the link with the usual setting in which preference learning is applied. And in general, NNs generalize better than the table-based approach ( see also my response to Charlie Steiner ).
I definitely plan to write a follow-up to this post, will come back to your offer when that follow-up reaches the front of my queue :)
Hadn’t thought about this before! Perhaps it could work to compare the inferred utility function with a random baseline? I.e. the baseline policy would be “for every comparison, flip a coin and make that your prediction about the human preference”.
If this happens to accurately describe how the human makes the decision, then the utility function should not be able to perform better than the baseline (and perhaps even worse). How much more structure can we add to the human choice before the utility function performs better than the random baseline?
True! I guess one proposal to resolve these inconsistencies is CEV, although that is not very computable.