By “claims about a space of questions” you mean “a claim about every question from a space of questions”?
I just wrote incorrectly; I meant “the agent can choose a question from a space of questions and make a claim about it”. If you want to support claims about a space of questions, you could allow quantifiers in your questions.
However, you can also get the same behaviour in a typical world if you assume that the judge has a wrong prior.
I mean, sure, but any alignment scheme is going to have to assume some amount of correctness in the human-generated information it is given. You can’t learn about preferences if you model humans as arbitrarily wrong about their preferences.
I broadly agree with all of this, thanks :)
I just wrote incorrectly; I meant “the agent can choose a question from a space of questions and make a claim about it”. If you want to support claims about a space of questions, you could allow quantifiers in your questions.
I mean, sure, but any alignment scheme is going to have to assume some amount of correctness in the human-generated information it is given. You can’t learn about preferences if you model humans as arbitrarily wrong about their preferences.