If q is not too low, then you can do this by taking a bunch of samples and evaluating them by expected utility. Of course, it might be expensive to evaluate this many samples.
I think that you can also do this with an adversarial game, as in your post on mimicry. You can have one system that takes some action, and another system that bets at some odds that the action was produced by the AI rather than the base distribution. This seems to work without learning the cost function.
If q is not too low, then you can do this by taking a bunch of samples and evaluating them by expected utility. Of course, it might be expensive to evaluate this many samples.
I think that you can also do this with an adversarial game, as in your post on mimicry. You can have one system that takes some action, and another system that bets at some odds that the action was produced by the AI rather than the base distribution. This seems to work without learning the cost function.
I was imagining the case where O(q−1) is too slow, i.e. where we want the AI to actually perform a search.
The second paragraph is what I had in mind. Note that in this case you are maximizing over learnable cost functions rather than all cost functions.