If we can’t observe O, we could always just guess a particular value of O and then do whatever’s optimal for that value. Then with probability P[O], we’ll be right, and our performance is lower bounded by P[O]*(whatever optimization pressure we’re able to apply if we guess correctly).
The log-number of different policies bounds the log-number of different outcome-distributions we can achieve. And observing one additional bit doubles the log-number of different policies.
Here are two intuitive arguments:
If we can’t observe O, we could always just guess a particular value of O and then do whatever’s optimal for that value. Then with probability P[O], we’ll be right, and our performance is lower bounded by P[O]*(whatever optimization pressure we’re able to apply if we guess correctly).
The log-number of different policies bounds the log-number of different outcome-distributions we can achieve. And observing one additional bit doubles the log-number of different policies.