Yes, this is fine to do, and prevents single-shot problems if you have a particular picture of the distribution over outcomes where most disastrous risk comes from edge cases that get 99.99%ile score but are actually bad, and all we need is actions that are 99th percentile.
This is fine if you want your AI to stack blocks on top of other blocks or something.
But unfortunately when you want to use a quantilizer to do something outside the normal human distribution, like cure cancer or supervise the training of a superhuman AI, you’re no longer just shooting for a 99%ile policy. You want the AI to do something unlikely, and so of course, you can’t simultaneously restrict it to only do likely things.
Now, you can construct a policy where each individual action is exactly 99%ile on some distribution of human actions. The results quickly go off-distribution and end up somewhere new—chaining together 100 99%ile actions is not the same as picking a 99%ile policy. This means you don’t get the straightforward quantilizer-style guarantee, but maybe it’s not all bad—after all, we wanted to go off distribution to cure cancer, and maybe by taking actions with some kind of moderation we can get some other kind of benefits we don’t quite understand yet. I’m reminded of Peli Grietzer’s recent post on virtue ethics.
Yes, this is fine to do, and prevents single-shot problems if you have a particular picture of the distribution over outcomes where most disastrous risk comes from edge cases that get 99.99%ile score but are actually bad, and all we need is actions that are 99th percentile.
This is fine if you want your AI to stack blocks on top of other blocks or something.
But unfortunately when you want to use a quantilizer to do something outside the normal human distribution, like cure cancer or supervise the training of a superhuman AI, you’re no longer just shooting for a 99%ile policy. You want the AI to do something unlikely, and so of course, you can’t simultaneously restrict it to only do likely things.
Now, you can construct a policy where each individual action is exactly 99%ile on some distribution of human actions. The results quickly go off-distribution and end up somewhere new—chaining together 100 99%ile actions is not the same as picking a 99%ile policy. This means you don’t get the straightforward quantilizer-style guarantee, but maybe it’s not all bad—after all, we wanted to go off distribution to cure cancer, and maybe by taking actions with some kind of moderation we can get some other kind of benefits we don’t quite understand yet. I’m reminded of Peli Grietzer’s recent post on virtue ethics.