This is one of the bigger reasons why I really don’t like RLHF—because inevitably you’re going to have to use a whole bunch of Humans who know less-than-ideal amounts about philosophy, pertaining to Ai Alignment.
What would these humans do differently, if they knew about philosophy? Concretely, could you give a few examples of “Here’s a completion that should be positively reinforced because it demonstrates correct understanding of language, and here’s a completion of the same text that should be negatively reinforced because it demonstrates incorrect understanding of language”? (Bear in mind that the prompts shouldn’t be about language, as that would probably just teach the model what to say when it’s discussing language in particular.)
It’s impossible for the Utility function of the Ai to be amenable to humans if it doesn’t use language the same way
What makes you think that humans all use language the same way, if there’s more than one plausible option? People are extremely diverse in their perspectives.
What would these humans do differently, if they knew about philosophy? Concretely, could you give a few examples of “Here’s a completion that should be positively reinforced because it demonstrates correct understanding of language, and here’s a completion of the same text that should be negatively reinforced because it demonstrates incorrect understanding of language”? (Bear in mind that the prompts shouldn’t be about language, as that would probably just teach the model what to say when it’s discussing language in particular.)
What makes you think that humans all use language the same way, if there’s more than one plausible option? People are extremely diverse in their perspectives.