Thanks Caspar, your comments here and on earlier drafts are appreciated. We’ll expand more on the positioning within the related literature as we develop this into a paper.
As for your work on Decision Scoring Rules and the proposal in your comment, the biggest distinction is that this post’s proposal does not require specifying the decision maker’s utility function in order to reward one of the predictors and shape their behavior into maximizing it. That seems very useful to me, as if we were able to properly specify the desired utility function, we could skip using predictive models and just train an AI to maximize that instead (modulo inner alignment).
>the biggest distinction is that this post’s proposal does not require specifying the decision maker’s utility function in order to reward one of the predictors and shape their behavior into maximizing it.
Hmm… Johannes made a similar argument in personal conversation yesterday. I’m not sure how convinced I am by this argument.
So first, here’s one variant of the proper decision scoring rules setup where we also don’t need to specify the decision maker’s utility function: Ask the predictor for her full conditional probability distribution for each action. Then take the action that is best according to your utility function and the predictor’s conditional probability distribution. Then score the predictor according to a strictly proper decision scoring rule. (If you think of strictly proper decision scoring rules as taking only a predicted expected utility as input, you have to first calculate the expected utility of the reported distribution, and then score that expected utility against the utility you actually obtained.) (Note that if the expert has no idea what your utility function is, they are now strictly incentivized to report fully honestly about all actions! The same is true in your setup as well, I think, but in what I describe here a single predictor suffices.) In this setup you also don’t need to specify your utility function.
One important difference, I suppose, is that in all the existing methods (like proper decision scoring rules) the decision maker needs to at some point assess her utility in a single outcome—the one obtained after choosing the recommended action—and reward the expert in proportion to that. In your approach one never needs to do this. However, in your approach one instead needs to look at a bunch of probability distributions and assess which one of these is best. Isn’t this much harder? (If you’re doing expected utility maximization—doesn’t your approach entail assigning probabilities to all hypothetical outcomes?) In realistic settings, these outcome distributions are huge objects!
I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn’t pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.
I’m not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?
>I’m not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?
Sorry if I was cryptic! Yes, it’s basically the same as using the MAX decision rule and (importantly) a quasi-strictly proper scoring rule (in their terminology, which is basically the same up to notation as a strictly proper decision scoring rule in the terminology of the decision scoring rules paper). (We changed the terminology for our paper because “quasi-strictly proper scoring rule w.r.t. the max decision rule” is a mouthful. :-P) Does that help?
>much safer than having it effectively chosen for them by their specification of a utility function
So, as I tried to explain before, one convenient thing about using proper decision scoring rules is that you do not need to specify your utility function. You just need to give rewards ex post. So one advantage of using proper decision scoring rules is that you need less of your utility function not more! But on to the main point...
>I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn’t pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.
Let’s grant for now that from an alignment perspective the property you describe is desirable. My counterargument is that proper decision scoring rules (or the max decision rule with a scoring rule that is quasi-strictly proper w.r.t. the max scoring rule) and zero-sum conditional prediction both have this property. Therefore, having the property cannot yield an argument to favor one over the other.
Maybe put differently: I still don’t know what property it is that you think favors zero-sum conditional prediction over proper decision scoring rules. I don’t think it can be not wanting to specify your utility function / not wanting the agent to pick agents based on their model of your utility function / wanting to instead choose yourself based on reported distributions, because both methods can be used in this way. Also, note that in both methods the predictors in practice have incentives that are determined by (their beliefs about) the human’s values. For example, in zero-sum conditional prediction, each predictor is incentivized to run computations to evaluate actions that it thinks could potentially be optimal w.r.t. human values, and not incentivized to think about actions that it confidently thinks are suboptimal. So for example, if I have the choice between eating chocolate ice cream, eating strawberry ice cream and eating mud, then the predictor will reason that I won’t choose to eat mud and that therefore its prediction about mud won’t be evaluated. Therefore, it will probably not think much about how what it will be like if I eat mud (though it has to think about it a little to make sure that the other predictor can’t gain by recommending mud eating).
On whether the property is desirable [ETA: I here mean the property: [human chooses based on reported distribution] but not compared to [explicitly specifying a utility function]]: Perhaps my objection is just what you mean by ELK. In any case, I think my views depend a bit on how we imagine lots of different aspect of the overall alignment scheme. One important question, I think, is how exactly we imagine the human to “look at” the distributions for example. But my worry is that (similar to RLHF) letting the human evaluate distributions rather than outcomes increases the predictors’ incentives to deceive the human. The incentive is to find actions whose distribution looks good (in whatever format you represent the distribution) in relation to the other distributions, not which distributions are good. Given that the distributions are so large (and less importantly because humans have lots of systematic, exploitable irrationalities related to risk), I would think that human judgment of single outcomes/point distributions is much better than human judgment of full distributions.
Thanks Caspar, your comments here and on earlier drafts are appreciated. We’ll expand more on the positioning within the related literature as we develop this into a paper.
As for your work on Decision Scoring Rules and the proposal in your comment, the biggest distinction is that this post’s proposal does not require specifying the decision maker’s utility function in order to reward one of the predictors and shape their behavior into maximizing it. That seems very useful to me, as if we were able to properly specify the desired utility function, we could skip using predictive models and just train an AI to maximize that instead (modulo inner alignment).
>the biggest distinction is that this post’s proposal does not require specifying the decision maker’s utility function in order to reward one of the predictors and shape their behavior into maximizing it.
Hmm… Johannes made a similar argument in personal conversation yesterday. I’m not sure how convinced I am by this argument.
So first, here’s one variant of the proper decision scoring rules setup where we also don’t need to specify the decision maker’s utility function: Ask the predictor for her full conditional probability distribution for each action. Then take the action that is best according to your utility function and the predictor’s conditional probability distribution. Then score the predictor according to a strictly proper decision scoring rule. (If you think of strictly proper decision scoring rules as taking only a predicted expected utility as input, you have to first calculate the expected utility of the reported distribution, and then score that expected utility against the utility you actually obtained.) (Note that if the expert has no idea what your utility function is, they are now strictly incentivized to report fully honestly about all actions! The same is true in your setup as well, I think, but in what I describe here a single predictor suffices.) In this setup you also don’t need to specify your utility function.
One important difference, I suppose, is that in all the existing methods (like proper decision scoring rules) the decision maker needs to at some point assess her utility in a single outcome—the one obtained after choosing the recommended action—and reward the expert in proportion to that. In your approach one never needs to do this. However, in your approach one instead needs to look at a bunch of probability distributions and assess which one of these is best. Isn’t this much harder? (If you’re doing expected utility maximization—doesn’t your approach entail assigning probabilities to all hypothetical outcomes?) In realistic settings, these outcome distributions are huge objects!
I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn’t pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.
I’m not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?
>I’m not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?
Sorry if I was cryptic! Yes, it’s basically the same as using the MAX decision rule and (importantly) a quasi-strictly proper scoring rule (in their terminology, which is basically the same up to notation as a strictly proper decision scoring rule in the terminology of the decision scoring rules paper). (We changed the terminology for our paper because “quasi-strictly proper scoring rule w.r.t. the max decision rule” is a mouthful. :-P) Does that help?
>much safer than having it effectively chosen for them by their specification of a utility function
So, as I tried to explain before, one convenient thing about using proper decision scoring rules is that you do not need to specify your utility function. You just need to give rewards ex post. So one advantage of using proper decision scoring rules is that you need less of your utility function not more! But on to the main point...
>I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn’t pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.
Let’s grant for now that from an alignment perspective the property you describe is desirable. My counterargument is that proper decision scoring rules (or the max decision rule with a scoring rule that is quasi-strictly proper w.r.t. the max scoring rule) and zero-sum conditional prediction both have this property. Therefore, having the property cannot yield an argument to favor one over the other.
Maybe put differently: I still don’t know what property it is that you think favors zero-sum conditional prediction over proper decision scoring rules. I don’t think it can be not wanting to specify your utility function / not wanting the agent to pick agents based on their model of your utility function / wanting to instead choose yourself based on reported distributions, because both methods can be used in this way. Also, note that in both methods the predictors in practice have incentives that are determined by (their beliefs about) the human’s values. For example, in zero-sum conditional prediction, each predictor is incentivized to run computations to evaluate actions that it thinks could potentially be optimal w.r.t. human values, and not incentivized to think about actions that it confidently thinks are suboptimal. So for example, if I have the choice between eating chocolate ice cream, eating strawberry ice cream and eating mud, then the predictor will reason that I won’t choose to eat mud and that therefore its prediction about mud won’t be evaluated. Therefore, it will probably not think much about how what it will be like if I eat mud (though it has to think about it a little to make sure that the other predictor can’t gain by recommending mud eating).
On whether the property is desirable [ETA: I here mean the property: [human chooses based on reported distribution] but not compared to [explicitly specifying a utility function]]: Perhaps my objection is just what you mean by ELK. In any case, I think my views depend a bit on how we imagine lots of different aspect of the overall alignment scheme. One important question, I think, is how exactly we imagine the human to “look at” the distributions for example. But my worry is that (similar to RLHF) letting the human evaluate distributions rather than outcomes increases the predictors’ incentives to deceive the human. The incentive is to find actions whose distribution looks good (in whatever format you represent the distribution) in relation to the other distributions, not which distributions are good. Given that the distributions are so large (and less importantly because humans have lots of systematic, exploitable irrationalities related to risk), I would think that human judgment of single outcomes/point distributions is much better than human judgment of full distributions.