I mostly agree with what Paul said re using various techniques to improve the evaluation of f+ to ensure you can test it on more open-ended questions. That being said, I’m more optimistic that, if you can get the initial training procedure right, you can rely on generalization to fill in the rest. Specifically, I’m imagining a situation where the training dataset is of the narrower form you talk about such that f+ and f− always agree (as in Step 3 here)—but where the deployment setting wouldn’t necessarily have to be of this form, since once you’re confident that you’ve actually learned f+ and not e.g. f−, you can use it for all sorts of things that wouldn’t ever be in that training dataset (the hard part, of course, is ever actually being confident that you did in fact learn the intended model).
(Also, thanks for catching the typo—it should be fixed now.)
Having thought about it more (hopefully with more clarity), I think I have trouble imagining training data for f+ that:
We’re highly confident is correct.
Enables the model to decide which true things to output in general. (my (2) here)
It seems to me that we can be highly confident about matters of fact (how many chairs are in this room...), but less confident once value judgements come into play (which of A or B is the better answer to “How should I go about designing a chair?”). [Of course it’s not black-and-white: one can make a philosophical argument that all questions are values questions. However, I think this is an issue even if we stick to pragmatic, common-sense approaches.]
I don’t think we can remedy this for values questions by including only data that we’re certain of. It seems to me that works for facts questions due to the structure of the world: it’s so hugely constrained by physical law that you can get an extremely good model by generalizing from sparse data from a different distribution.
It’s not clear that anything analogous works for generalizing preferences (maybe?? but I’d guess not). I’d expect an f+ trained on [data we’re highly confident is correct] to generalize poorly to general open questions.
Similarly, in Paul’s setup I think the following condition will fail if we need to be highly confident of the correctness (relative to what is known) of the small dataset:
The small dataset is still rich enough that you could infer correct language usage from it, i.e. the consistency condition on the small dataset alone suffices to recover all 10,000 bits required to specify the intended model.
It’s entirely plausible you can learn “correct language usage” in the narrow sense from consistency on the small dataset (i.e. you may infer a [deduced_statement → natural_language_equivalent] mapping). I don’t think it’s plausible you learn it in the sense required (i.e. a [(set_of_all_deduced_statements, Q) → natural_language_answer] mapping).
Again, perhaps I’m (not even) wrong, but I think the above accurately describes my current thinking.
Ok, I think that makes some sense in so far as you’re softening the f+=f− constraint and training it in more open-ended conditions. I’m not currently clear where this gets us, but I’ll say more about that in my response to Paul.
However, I don’t see how you can use generalization from the kind of dataset where f+ and f− always agree (having asked prescriptive questions). [EDIT: now I do, I was just thinking particularly badly] I see honestly answering a question as a 2-step process (conceptually): 1) Decide which things are true. 2) Decide which true thing to output.
In the narrow case, we’re specifying ((2) | (1)) in the question, and training the model to do (1). Even if we learn a model that does (1) perfectly (in the intended way), it hasn’t learned anything that can generalize to (2). Step (2) is in part a function of human values, so we’d need to be giving it some human-values training signal for it to generalize.
[EDIT: I’ve just realized that I’m being very foolish here. The above suggests that learning (1) doesn’t necessarily generalize to (2). In no way does it imply that it can’t. I think the point I want to make is that an f+ that does generalize extremely well in this way is likely to be doing some close equivalent to predicting-the-human. (in this I’m implicitly claiming that doing (2) well in general requires full understanding of human values)]
Overall, I’m still unsure how to describe what we want: clearly we don’t trust Alice’s answers if she’s being blackmailed, but how about if she’s afraid, mildly anxious, unusually optimistic, slightly distracted, thinking about concept a or b or c...? It’s clear that the instrumental model just gives whatever response Alice would give here. I don’t know what the intended model should do; I don’t know what “honest answer” we’re looking for.
If the situation has property x, and Alice has reacted with unusual-for-Alice property y. Do we want the Alice-with-y answer, or the standard-Alice answer? It seems to depend on whether we decide y is acceptable (or even required) w.r.t. answer reliability, given x. Then I think we get the same problem on that questionetc.
I mostly agree with what Paul said re using various techniques to improve the evaluation of f+ to ensure you can test it on more open-ended questions. That being said, I’m more optimistic that, if you can get the initial training procedure right, you can rely on generalization to fill in the rest. Specifically, I’m imagining a situation where the training dataset is of the narrower form you talk about such that f+ and f− always agree (as in Step 3 here)—but where the deployment setting wouldn’t necessarily have to be of this form, since once you’re confident that you’ve actually learned f+ and not e.g. f−, you can use it for all sorts of things that wouldn’t ever be in that training dataset (the hard part, of course, is ever actually being confident that you did in fact learn the intended model).
(Also, thanks for catching the typo—it should be fixed now.)
Having thought about it more (hopefully with more clarity), I think I have trouble imagining training data for f+ that:
We’re highly confident is correct.
Enables the model to decide which true things to output in general. (my (2) here)
It seems to me that we can be highly confident about matters of fact (how many chairs are in this room...), but less confident once value judgements come into play (which of A or B is the better answer to “How should I go about designing a chair?”).
[Of course it’s not black-and-white: one can make a philosophical argument that all questions are values questions. However, I think this is an issue even if we stick to pragmatic, common-sense approaches.]
I don’t think we can remedy this for values questions by including only data that we’re certain of. It seems to me that works for facts questions due to the structure of the world: it’s so hugely constrained by physical law that you can get an extremely good model by generalizing from sparse data from a different distribution.
It’s not clear that anything analogous works for generalizing preferences (maybe?? but I’d guess not). I’d expect an f+ trained on [data we’re highly confident is correct] to generalize poorly to general open questions.
Similarly, in Paul’s setup I think the following condition will fail if we need to be highly confident of the correctness (relative to what is known) of the small dataset:
The small dataset is still rich enough that you could infer correct language usage from it, i.e. the consistency condition on the small dataset alone suffices to recover all 10,000 bits required to specify the intended model.
It’s entirely plausible you can learn “correct language usage” in the narrow sense from consistency on the small dataset (i.e. you may infer a [deduced_statement → natural_language_equivalent] mapping). I don’t think it’s plausible you learn it in the sense required (i.e. a [(set_of_all_deduced_statements, Q) → natural_language_answer] mapping).
Again, perhaps I’m (not even) wrong, but I think the above accurately describes my current thinking.
Ok, I think that makes some sense in so far as you’re softening the f+=f− constraint and training it in more open-ended conditions. I’m not currently clear where this gets us, but I’ll say more about that in my response to Paul.
However, I don’t see how you can use generalization from the kind of dataset where f+ and f− always agree (having asked prescriptive questions). [EDIT: now I do, I was just thinking particularly badly]
I see honestly answering a question as a 2-step process (conceptually):
1) Decide which things are true.
2) Decide which true thing to output.
In the narrow case, we’re specifying ((2) | (1)) in the question, and training the model to do (1). Even if we learn a model that does (1) perfectly (in the intended way), it hasn’t learned anything that can generalize to (2).
Step (2) is in part a function of human values, so we’d need to be giving it some human-values training signal for it to generalize.
[EDIT: I’ve just realized that I’m being very foolish here. The above suggests that learning (1) doesn’t necessarily generalize to (2). In no way does it imply that it can’t. I think the point I want to make is that an f+ that does generalize extremely well in this way is likely to be doing some close equivalent to predicting-the-human. (in this I’m implicitly claiming that doing (2) well in general requires full understanding of human values)]
Overall, I’m still unsure how to describe what we want: clearly we don’t trust Alice’s answers if she’s being blackmailed, but how about if she’s afraid, mildly anxious, unusually optimistic, slightly distracted, thinking about concept a or b or c...?
It’s clear that the instrumental model just gives whatever response Alice would give here.
I don’t know what the intended model should do; I don’t know what “honest answer” we’re looking for.
If the situation has property x, and Alice has reacted with unusual-for-Alice property y. Do we want the Alice-with-y answer, or the standard-Alice answer? It seems to depend on whether we decide y is acceptable (or even required) w.r.t. answer reliability, given x. Then I think we get the same problem on that question etc.