> choose randomly among the top 0<q≤1 proportion of actions/policies
That requires that you be able to rank actions/policies, which means that they are first reduced to some absolute value on that scale (technically you could do this by merely ranking them, but every sane method of consistently ordering all possible policies is going to reduce each policy to a single value and then sort them by that value).
So… if there are critical values, those values should be apparent in a large gap between the value that you are using to sort them.
Which brings us to the main problem, which is that ranking policies is a hard and unsolved problem that so far has only been reduced to itself.
> choose randomly among the top 0<q≤1 proportion of actions/policies
That requires that you be able to rank actions/policies, which means that they are first reduced to some absolute value on that scale (technically you could do this by merely ranking them, but every sane method of consistently ordering all possible policies is going to reduce each policy to a single value and then sort them by that value).
So… if there are critical values, those values should be apparent in a large gap between the value that you are using to sort them.
Which brings us to the main problem, which is that ranking policies is a hard and unsolved problem that so far has only been reduced to itself.