In general it seems like there is no tractable algorithm that samples from the given space, or performs well on the given conservative cost maximization game. On top of that, it’s not clear how you would train a conservative-maximizer of this kind; e.g. even if you have a simple model class that contains a good quantilizer, you don’t have access to the objective and so can’t train your model in any normal way.
The most obvious solution to both problems (and a few others) is to maximize over efficient and learnable cost functions c rather than all cost functions. This almost immediately gives you a practical algorithm which we could begin to test (you can train both c and the quantilizer in parallel e.g. by gradient descent). It also looks a lot closer to things that people do in practice—this is a sensible-looking adjustment to normal apprenticeship learning.
If q is not too low, then you can do this by taking a bunch of samples and evaluating them by expected utility. Of course, it might be expensive to evaluate this many samples.
I think that you can also do this with an adversarial game, as in your post on mimicry. You can have one system that takes some action, and another system that bets at some odds that the action was produced by the AI rather than the base distribution. This seems to work without learning the cost function.
In general it seems like there is no tractable algorithm that samples from the given space, or performs well on the given conservative cost maximization game. On top of that, it’s not clear how you would train a conservative-maximizer of this kind; e.g. even if you have a simple model class that contains a good quantilizer, you don’t have access to the objective and so can’t train your model in any normal way.
The most obvious solution to both problems (and a few others) is to maximize over efficient and learnable cost functions c rather than all cost functions. This almost immediately gives you a practical algorithm which we could begin to test (you can train both c and the quantilizer in parallel e.g. by gradient descent). It also looks a lot closer to things that people do in practice—this is a sensible-looking adjustment to normal apprenticeship learning.
If q is not too low, then you can do this by taking a bunch of samples and evaluating them by expected utility. Of course, it might be expensive to evaluate this many samples.
I think that you can also do this with an adversarial game, as in your post on mimicry. You can have one system that takes some action, and another system that bets at some odds that the action was produced by the AI rather than the base distribution. This seems to work without learning the cost function.
I was imagining the case where O(q−1) is too slow, i.e. where we want the AI to actually perform a search.
The second paragraph is what I had in mind. Note that in this case you are maximizing over learnable cost functions rather than all cost functions.