(As the density of the differential of the monotonic function, in the vicinity of the actual envelope contents, goes to zero, the expected benefit of the algorithm over random chance, goes to zero.)
I don’t understand that statement. Let’s define the function f(x) as “1 if x<3, otherwise 0” (add a suitable fix for the discontinuity at 3 if you’re pedantic), and the distribution of envelopes as “put 2 dollars in one envelope and 4 in the other” (smear it around a bit if you don’t like discrete distributions). I’m not sure what Eliezer means by “density of the differential”, but df/dx=0 in the vicinity of both x=2 and x=4, yet the expected benefit of the algorithm is 1 dollar. Or maybe I’m missing something?
It seems that Eliezer’s point is not best expressed in terms of differentials. The benifet of the algorithm depends on the difference in the monotonic function between the values of each the envelopes. There is only a finite amount of difference available to distribute over the domain, of infinite measure, of the function, and for any epsilon greater than 0, for all but a finite measure of the domain, the difference in the function between possible pairs of values for the envelopes will be less than epsilon. So if you really have no information constraining the distribution of values for the envelopes to use to allocate the difference in function effectively, the expected benifet of the algorithm is 0.
Edit: Actually, that argument doesn’t hold on its own, because the value of switching to the bigger envelope isn’t constant, it depends on the size of the envelope. However, in the problem as stated, I don’t think the value of switching grows fast enough. I need to think about it more.
Assume the envelope contents are distributed arbitrarily on [A,+infinity) where A is some large number. Let f(x)=1/x (the values for x<1 don’t matter). Then the expected benefit of Thrun’s algorithm is always 1⁄4, even though the difference in f(x) between the values of any two envelopes is less than 1/A. To convince yourself of that, work out the proof yourself or run a computational experiment:
from random import random
n = 100000
thrun = 0
chance = 0
for i in range(n):
smaller = 2+3*random() # replace this line with anything
selected,other = smaller,2*smaller
if random() > 0.5:
selected,other = other,selected
chance += (selected if (random() > 0.5) else other)
thrun += (selected if (random() > 1/selected) else other)
print (thrun-chance)/n
You’re not missing anything, but the variance is quite high, you’ll need many more samples. Or you could try writing a program that converges faster, I’m a total newbie at these things.
I tried to prove it, and I am getting an expected benifet of 1⁄4, and I am worried that I am making a mistake related to reording the terms of a non absolutely converging series.
If you are right, you could do better with f(x) = 2/x, and the values for x < 2 don’t matter.
Read as: “as the slope goes to zero”. That is, if we don’t know where the dollar values are being selected from in the range from zero to infinity, we don’t know where to put the part of the function that has a significant slope, and so the average benefit is about zero (for example, your function is useless if both numbers are greater than three or less than three, and most pairs of natural numbers are both greater than three). Doing better requires making assumptions about how much money is likely to be in the envelopes.
With all due respect, that’s not quite the sort of answer I want… Eliezer made a sharp mathematical statement, I made another. Can you reformulate your comment as a sharp mathematical statement too, so I can see if it’s a fair rephrasing of Eliezer’s statement?
I apologize! My point is—you can’t calculate the “expected benefit” of the algorithm after making an assumption about the distribution of the money. The expected benefit is the average benefit of the algorithm over all possible distributions (weighted by likelihood). Assuming we know nothing about the likelihood of specific distributions, the probability that there is a difference between f(x1) and f(x2) is ~0. By “density of the differential”, I believe he is referring to the fact that in order for there to be a difference between those two numbers, f ′ will have to be “dense” on some part of the interval between them. Since we don’t know where that is, we can’t decide where to concentrate the function’s differential. Is this sufficiently formal / sensible?
That’s… still not very precise, but I also apologize, because I can’t really explain how to think precisely. I was kinda born this way and have no idea of the typical difficulties. Maybe you could look at my reply to JGWeissman as an example of what can be said precisely about the problem? It shows that Thrun’s algorithm can give a large (constant) benefit even if values of f(x) have arbitrarily small differences. That seems to contradict your comment, but I can’t really tell :-)
It’s not that I don’t know how to think precisely about math! I was attempting to translate a fairly vague concept. Though I think I was in error about the selection of dollar values, I had overlooked that one is always twice the other, it sounded like Eliezer was implying they were close to each other. So I’m not certain that my interpretation of what Eliezer was saying is a correct analysis of the problem (in fact no matter what he was actually saying, it has to have been an incorrect analysis of the problem given that you have provided a counterexample to his conclusion, yes?), though it still looks to me as though he was saying that the integral of df/dx in some region about “the vicinity of the actual envelope contents” is expected to be low (though I, and I suspect he, would use the integral of df/dx on the interval from x to 2x).
The argument makes sense if you assign utility 1 to choosing correctly and utility 0 to choosing incorrectly, I think. And in fact, rereading the post, this seems to be the intent!
It also still looks to me that your statement “the expected benefit of the algorithm is 1 dollar” above uses an incorrect interpretation of “expected benefit” (I think the expected benefit of the algorithm using that function is 0).
I took a look at your reply to JGWeissman, and yep, it looks like the algorithm works, which is neat! I’d contest the description of 25 cents as a large constant benefit, but certainly there are ways to increase it (how high, I wonder?).
Thus—my interpretation of Eliezer’s argument holds that you can’t get above an expected 50% chance of picking the better envelope, but that example does show that you can get higher expected value than 3⁄2 x. Therefore, no contradiction?
The expected benefit is 25 cents if the minimum possible contents of an envelope is 1 dollar. More generally, if envelopes are guaranteed to contain more than 100 dollars, then using f(x)=100/x yields expected benefit of 25 dollars, and so on. Also, Eliezer is right that a randomized algorithm can’t be Bayesian-optimal in this problem, so for any given prior there’s a deterministic algorithm with even higher expected benefit, I guess you can work it out.
Eliezer is right that a randomized algorithm can’t be Bayesian-optimal in this problem, so for any given prior there’s a deterministic algorithm with even higher expected benefit
I think the claim is that there must a deterministic algorithm that does at least as well as the randomized algorithm. The randomized algorithm is allowed to be optimal, just not uniquely optimal.
As for the deterministic variant, as you’d need some distribution from which the value of x is being selected, I’m not sure how best to calculate the EV of any particular scheme (whereas the nondeterministic algorithm sidesteps this by allowing a calculation of EV after X is selected). With a good prior, yeah, it’d be pretty simple, but without it becomes GAI-complete, yeah?
1⁄4 of the smallest possible amount you could win doesn’t count as a large constant benefit in my view of things, but that’s a bit of a nitpick. In any case, what do you think about the rest of the post?
Oh, sorry, just noticed the last part of your comment. It seems wrong, you can get a higher than 50% chance of picking the better envelope. The degenerate case is if you already know there’s only one possibility, e.g. 1 dollar in one envelope and 2 dollars in the other. If you open the envelope and see 1 dollar, then you know you must switch, so you get the better envelope with probability 100%. You can get more fuzzy cases by sort of smearing out the distribution of envelopes continuously, starting from that degenerate case, and using the f(x) strategy. The chance of picking the better envelope will fall below 100%, but I think it will stay above 50%. Do you want my opinion on anything else? :-)
That’s definitely cheating! We don’t have access to the means by which X is generated. In the absence of a stated distribution, can we still do better than 50%?
Well, Thrun’s algorithm does better than 50% for every distribution. But no matter what f(x) we choose, there will always be distributions that make the chance arbitrarily close to 50% (say, less than 50%+epsilon). To see why, note that for a given f(x) we can construct a distribution far enough from zero that all values of f(x) are less than epsilon, so the chance of switching prescribed by the algorithm is also less than epsilon.
The next question is whether we can find any other randomized algorithm that does better than 50%+epsilon on any distribution. The answer to that is also no.
1) Note that any randomized algorithm must decide whether to switch or not, based on the contents of the envelope and possibly a random number generator. In other words, it must be described by a function f(x) like in Thrun’s algorithm. f(x) doesn’t have to be monotonic, but must lie between 0 and 1 inclusive for every x.
2) For every such function f(x), we will construct a distribution of envelopes that makes it do worse than 50%+epsilon.
3) Let’s consider for each number x the degenerate distribution D_x that always puts x dollars in one envelope and 2*x in the other.
4) To make the algorithm do worse than 50%+epsilon on distribution D_x, we need the chance of switching at 2*x to be not much lower than the chance of switching at x. Namely, we need the condition f(2*x)>f(x)-epsilon.
5) Now we only need to prove that there exists an x such that f(2*x)>f(x)-epsilon. We will prove that by reductio ad absurdum. If we had f(2*x)≤f(x)-epsilon for every x, we could iterate that and obtain f(x)≥f(x*2^n)+n*epsilon for every x and n, which would make f(x) greater than 1. Contradiction, QED.
Yes, that all looks sensible. The point I’m trying to get at—the one I think Eliezer was gesturing towards—was that for any f and any epsilon, f(x) - f(2x) < epsilon for almost all x, in the formal sense. The next step is less straightforward—does it then follow that, prior to the selection of x, our expectation for getting the right answer is 50%? This seems to be Eliezer’s implication. However, it seems also to rest on an infinite uniform random distribution, which I understand can be problematic. Or have I misunderstood?
He didn’t say that, he said the benefit gets closer and closer to zero if you modify the setup in a certain way. I couldn’t find an interpretation that makes his statement correct, but at least it’s meaningful.
the algorithm did make use of prior knowledge about the envelope distribution. (As the density of the differential of the monotonic function, in the vicinity of the actual envelope contents, goes to zero, the expected benefit of the algorithm over random chance, goes to zero.)
without meaning that the expected density of the differential does go to zero—or perhaps would go to zero barring some particular prior knowledge about the envelope distribution. And that doesn’t sound like “modifying the setup” to me, that seems like it would make the statement irrelevant. What exactly is the “modification”, and what did you decide his statement really means, if you don’t mind?
Sorry, are you familiar with the mathematical concept of limit? Saying that “f(x) goes to zero as x goes to zero” does not imply the nonsensical belief that “x goes to zero”.
Yes, I am familiar with limits. What I mean is—if you say “f(x) goes to zero as x goes to zero”, then you are implying (in a non-mathematical sense) that we are evaluating f(x) in a region about zero—that is, we are interested in the behavior of f(x) close to x=0.
Edit: More to the point, if I say “g(f(x)) goes to zero as f(x) goes to infinity”, then f(x) better not be (known to be) bounded above.
I don’t understand that statement. Let’s define the function f(x) as “1 if x<3, otherwise 0” (add a suitable fix for the discontinuity at 3 if you’re pedantic), and the distribution of envelopes as “put 2 dollars in one envelope and 4 in the other” (smear it around a bit if you don’t like discrete distributions). I’m not sure what Eliezer means by “density of the differential”, but df/dx=0 in the vicinity of both x=2 and x=4, yet the expected benefit of the algorithm is 1 dollar. Or maybe I’m missing something?
It seems that Eliezer’s point is not best expressed in terms of differentials. The benifet of the algorithm depends on the difference in the monotonic function between the values of each the envelopes. There is only a finite amount of difference available to distribute over the domain, of infinite measure, of the function, and for any epsilon greater than 0, for all but a finite measure of the domain, the difference in the function between possible pairs of values for the envelopes will be less than epsilon. So if you really have no information constraining the distribution of values for the envelopes to use to allocate the difference in function effectively, the expected benifet of the algorithm is 0.
Edit: Actually, that argument doesn’t hold on its own, because the value of switching to the bigger envelope isn’t constant, it depends on the size of the envelope. However, in the problem as stated, I don’t think the value of switching grows fast enough. I need to think about it more.
Assume the envelope contents are distributed arbitrarily on [A,+infinity) where A is some large number. Let f(x)=1/x (the values for x<1 don’t matter). Then the expected benefit of Thrun’s algorithm is always 1⁄4, even though the difference in f(x) between the values of any two envelopes is less than 1/A. To convince yourself of that, work out the proof yourself or run a computational experiment:
Replacing “3” with “10000″ gets me varying results, mostly negative at a glance. What am I missing?
You’re not missing anything, but the variance is quite high, you’ll need many more samples. Or you could try writing a program that converges faster, I’m a total newbie at these things.
I tried to prove it, and I am getting an expected benifet of 1⁄4, and I am worried that I am making a mistake related to reording the terms of a non absolutely converging series.
If you are right, you could do better with f(x) = 2/x, and the values for x < 2 don’t matter.
Sorry, I mistyped the comment, fixed it immediately but I guess you saw the old version. It’s 1⁄4 of course.
I’m not sure what remains of Eliezer’s original point now...
Ah, that old confounder in the rhythm of disagreement: a smart person who appears to disagree with you might not have said what they meant to say. ;)
Read as: “as the slope goes to zero”. That is, if we don’t know where the dollar values are being selected from in the range from zero to infinity, we don’t know where to put the part of the function that has a significant slope, and so the average benefit is about zero (for example, your function is useless if both numbers are greater than three or less than three, and most pairs of natural numbers are both greater than three). Doing better requires making assumptions about how much money is likely to be in the envelopes.
With all due respect, that’s not quite the sort of answer I want… Eliezer made a sharp mathematical statement, I made another. Can you reformulate your comment as a sharp mathematical statement too, so I can see if it’s a fair rephrasing of Eliezer’s statement?
I apologize! My point is—you can’t calculate the “expected benefit” of the algorithm after making an assumption about the distribution of the money. The expected benefit is the average benefit of the algorithm over all possible distributions (weighted by likelihood). Assuming we know nothing about the likelihood of specific distributions, the probability that there is a difference between f(x1) and f(x2) is ~0. By “density of the differential”, I believe he is referring to the fact that in order for there to be a difference between those two numbers, f ′ will have to be “dense” on some part of the interval between them. Since we don’t know where that is, we can’t decide where to concentrate the function’s differential. Is this sufficiently formal / sensible?
That’s… still not very precise, but I also apologize, because I can’t really explain how to think precisely. I was kinda born this way and have no idea of the typical difficulties. Maybe you could look at my reply to JGWeissman as an example of what can be said precisely about the problem? It shows that Thrun’s algorithm can give a large (constant) benefit even if values of f(x) have arbitrarily small differences. That seems to contradict your comment, but I can’t really tell :-)
It’s not that I don’t know how to think precisely about math! I was attempting to translate a fairly vague concept. Though I think I was in error about the selection of dollar values, I had overlooked that one is always twice the other, it sounded like Eliezer was implying they were close to each other. So I’m not certain that my interpretation of what Eliezer was saying is a correct analysis of the problem (in fact no matter what he was actually saying, it has to have been an incorrect analysis of the problem given that you have provided a counterexample to his conclusion, yes?), though it still looks to me as though he was saying that the integral of df/dx in some region about “the vicinity of the actual envelope contents” is expected to be low (though I, and I suspect he, would use the integral of df/dx on the interval from x to 2x).
The argument makes sense if you assign utility 1 to choosing correctly and utility 0 to choosing incorrectly, I think. And in fact, rereading the post, this seems to be the intent!
It also still looks to me that your statement “the expected benefit of the algorithm is 1 dollar” above uses an incorrect interpretation of “expected benefit” (I think the expected benefit of the algorithm using that function is 0).
I took a look at your reply to JGWeissman, and yep, it looks like the algorithm works, which is neat! I’d contest the description of 25 cents as a large constant benefit, but certainly there are ways to increase it (how high, I wonder?).
Thus—my interpretation of Eliezer’s argument holds that you can’t get above an expected 50% chance of picking the better envelope, but that example does show that you can get higher expected value than 3⁄2 x. Therefore, no contradiction?
The expected benefit is 25 cents if the minimum possible contents of an envelope is 1 dollar. More generally, if envelopes are guaranteed to contain more than 100 dollars, then using f(x)=100/x yields expected benefit of 25 dollars, and so on. Also, Eliezer is right that a randomized algorithm can’t be Bayesian-optimal in this problem, so for any given prior there’s a deterministic algorithm with even higher expected benefit, I guess you can work it out.
I think the claim is that there must a deterministic algorithm that does at least as well as the randomized algorithm. The randomized algorithm is allowed to be optimal, just not uniquely optimal.
Oh. Sorry. You’re right.
As for the deterministic variant, as you’d need some distribution from which the value of x is being selected, I’m not sure how best to calculate the EV of any particular scheme (whereas the nondeterministic algorithm sidesteps this by allowing a calculation of EV after X is selected). With a good prior, yeah, it’d be pretty simple, but without it becomes GAI-complete, yeah?
1⁄4 of the smallest possible amount you could win doesn’t count as a large constant benefit in my view of things, but that’s a bit of a nitpick. In any case, what do you think about the rest of the post?
Oh, sorry, just noticed the last part of your comment. It seems wrong, you can get a higher than 50% chance of picking the better envelope. The degenerate case is if you already know there’s only one possibility, e.g. 1 dollar in one envelope and 2 dollars in the other. If you open the envelope and see 1 dollar, then you know you must switch, so you get the better envelope with probability 100%. You can get more fuzzy cases by sort of smearing out the distribution of envelopes continuously, starting from that degenerate case, and using the f(x) strategy. The chance of picking the better envelope will fall below 100%, but I think it will stay above 50%. Do you want my opinion on anything else? :-)
That’s definitely cheating! We don’t have access to the means by which X is generated. In the absence of a stated distribution, can we still do better than 50%?
Well, Thrun’s algorithm does better than 50% for every distribution. But no matter what f(x) we choose, there will always be distributions that make the chance arbitrarily close to 50% (say, less than 50%+epsilon). To see why, note that for a given f(x) we can construct a distribution far enough from zero that all values of f(x) are less than epsilon, so the chance of switching prescribed by the algorithm is also less than epsilon.
The next question is whether we can find any other randomized algorithm that does better than 50%+epsilon on any distribution. The answer to that is also no.
1) Note that any randomized algorithm must decide whether to switch or not, based on the contents of the envelope and possibly a random number generator. In other words, it must be described by a function f(x) like in Thrun’s algorithm. f(x) doesn’t have to be monotonic, but must lie between 0 and 1 inclusive for every x.
2) For every such function f(x), we will construct a distribution of envelopes that makes it do worse than 50%+epsilon.
3) Let’s consider for each number x the degenerate distribution D_x that always puts x dollars in one envelope and 2*x in the other.
4) To make the algorithm do worse than 50%+epsilon on distribution D_x, we need the chance of switching at 2*x to be not much lower than the chance of switching at x. Namely, we need the condition f(2*x)>f(x)-epsilon.
5) Now we only need to prove that there exists an x such that f(2*x)>f(x)-epsilon. We will prove that by reductio ad absurdum. If we had f(2*x)≤f(x)-epsilon for every x, we could iterate that and obtain f(x)≥f(x*2^n)+n*epsilon for every x and n, which would make f(x) greater than 1. Contradiction, QED.
Yes, that all looks sensible. The point I’m trying to get at—the one I think Eliezer was gesturing towards—was that for any f and any epsilon, f(x) - f(2x) < epsilon for almost all x, in the formal sense. The next step is less straightforward—does it then follow that, prior to the selection of x, our expectation for getting the right answer is 50%? This seems to be Eliezer’s implication. However, it seems also to rest on an infinite uniform random distribution, which I understand can be problematic. Or have I misunderstood?
That’s called an improper prior. Eliezer mentions in the post that it was his first idea, but turned out to be irrelevant to the analysis.
So I guess we’re back to square one, then.
I don’t understand. Which part are you still confused about? To me the whole thing seems quite clear.
How did Eliezer determine that the expected benefit of the algorithm over random chance is zero?
He didn’t say that, he said the benefit gets closer and closer to zero if you modify the setup in a certain way. I couldn’t find an interpretation that makes his statement correct, but at least it’s meaningful.
I don’t get why it makes sense to say
without meaning that the expected density of the differential does go to zero—or perhaps would go to zero barring some particular prior knowledge about the envelope distribution. And that doesn’t sound like “modifying the setup” to me, that seems like it would make the statement irrelevant. What exactly is the “modification”, and what did you decide his statement really means, if you don’t mind?
Sorry, are you familiar with the mathematical concept of limit? Saying that “f(x) goes to zero as x goes to zero” does not imply the nonsensical belief that “x goes to zero”.
Yes, I am familiar with limits. What I mean is—if you say “f(x) goes to zero as x goes to zero”, then you are implying (in a non-mathematical sense) that we are evaluating f(x) in a region about zero—that is, we are interested in the behavior of f(x) close to x=0.
Edit: More to the point, if I say “g(f(x)) goes to zero as f(x) goes to infinity”, then f(x) better not be (known to be) bounded above.