I wrote this with the assumption that Bob would care about maximizing his money at the end, and that there would be a high but not infinite number of rounds.
On my view, your questions mostly don’t change the analysis much. The only difference I can see is that if he literally only cares about beating Alice, he should go all in. In that case, having $1 less than Alice is equivalent to having $0. That’s not really how people use money though, and seems pretty artificial.
How are you expecting these answers to change things?
If Bob wants to maximise his money at the end, then he really should bet it all every round. I don’t see why you would want to use Kelly rather than maximising expected utility. Not maximising expected utility means that you expect to get less utility.
Well put. I agree that we should try to maximize the value that we expect to have after playing the game.
My claim here is that just because a statistic is named “expected value” doesn’t mean it’s accurately representing what we expect to happen in all types of situations. In Alice’s game, which is ergodic, traditional ensemble-averaging based expected value is highly accurate. The more tickets Alice buys, the more her actual value converges to the expected value.
In Bob’s game, which is non-ergodic, ensemble-based expected value is a poor statistic. It doesn’t actually predict the value that he would have. There’s no convergence between Bob’s value and “expected value”, so it seems strange to say that Bob “expects to get” the result of the ensemble average here.
You can certainly calculate Bob’s ensemble average, and it will have a higher result than the temporal average (as I state in my post). My claim is that this doesn’t help you, because it’s not representative of Bob’s game at all. In those situations, maximizing temporal average is the best you can do in reality, and the Kelly criterion maximizes that. Trying to maximize ensemble-based expected value here will wipe you out.
The problem with maximising expected utility is that Bob will sit their playing 1 more round, then another 1 more round again and again until he eventually looses everything. Each step maximised the expected utility, but the policy overall guarantees zero utility with certainty, assuming Bob never runs out of time.
But, even as utility-maximising-Bob is saved from self-destruction by the clock, he shall think to himself “dam it! Out of time. That is really annoying, I want to keep doing this bet”.
At least to me Kelly betting fits in the same kind of space as the Newcomb paradox and (possibly) the prisoners dilemma. They all demonstrate that the optimal policy is not necessarily given by a sequence of optimal actions at every step.
Ignoring infinities, do you have the same objection to a game with a limit of 100 rounds? Utility-maximizing Bob will bet all his money 100 times, and lose all of it with probability around 1−10−24, and he’ll endorse that because one time in 1024 he is raking it in to the tune of 1032 dollars or something. If you try to stop him he’ll be justly annoyed because you’re not letting him maximize his utility function.
Do you think that’s a problem for expected utility maximization? If so, it seems to me that your objection isn’t “optimal policy doesn’t come from optimal actions”. (At any rate I think that would be a bad objection, because optimal policy for this utility function does come from optimal actions at each step.) Rather, it seems to me that your objection is you don’t really believe Bob has that utility function.
Which, of course he doesn’t! No one has a utility function like that (or, indeed, at all). And I think that’s important to realize. But it’s a different objection, and I think that’s important to realize too.
Yes, I completely agree that the main reason in real life we would recommend against that strategy is that we instinctively (and usually correctly) feel that the person’s utility function is sub-linear in money. So that the 1032 dollars with probability 10−24 is bad. Obviously if 1032 dollars is needed to cure some disease that will otherwise kill them immediately that changes things.
But, their is an objection that I think runs somewhat separately to that, which is the round limit. If we are operating under an optimal, reasonable policy, then (outside commitment tactic negotiations) I think it shouldn’t really be possible for a new outside constraint to improve our performance. Because if the constraint does improve performance then we could have adopted that constraint voluntarily and our policy was therefore not optimal. And the N-round limit is doing a fairly important job at improving Bob’s performance in this hypothetical. Otherwise Bob’s strategy is equivalent to “I bet everything, every time, until I loose it all.” Perhaps this second objection is just the old one in a new disguise (any agent with a finitely-bounded utility function would eventually reach a round number where they decide “actually I have enough now”, and thus restore my sense of what should be), but I am not sure that it is exactly the same.
Oh, I don’t think the round limit is fundamental here, I just don’t like infinities :p
At time zero, you can show Bob a bunch of probability distributions for his money at some finite time t, corresponding to betting strategies, and ask which he’d prefer. And his answer will always be that his favorite distribution is the one corresponding to “bet everything every time”. And when it gets to time t, Bob is almost certainly broke, but not actually regretting his decisions in the sense of “knowing what I knew then I could have done better”.
If we take the limit as t→∞… I’m not really sure this is a meaningful thing to do. I guess we could take the pointwise limit and see that the resulting function is 1 at 0 and 0 everywhere else, which is indeed a probability distribution we don’t like. But if we take the pointwise limit of the Kelly strategy, it’s 0 everywhere, which isn’t even a probability distribution. I don’t think we should use that as a reason to prefer the Kelly strategy. Maybe there are other limits we can take? (I’ve forgotten a lot of what I used to know.) But mostly I think this is a weird thing to try to do.
If we’re not taking the limit, if we just say Bob can play as long as he wants, then yes, he just keeps playing until he goes broke. But he endorses that behavior. There’s no point where he looks back and goes “I was an idiot”.
One thing I’d say here is that we don’t sum up or compare utilities at different times. Like, it would be tempting to say “with probability 1, Bob will go broke. And however much money he had at the time, with probability 1, his alter ego Kelly-Betting Bob will eventually have more money than that. So Bob would prefer to be Kelly-Betting Bob”. But that last sentence doesn’t hold; Bob knows that in the event he’d managed to stick it out that long, his wealth would so vastly dwarf Kelly-Betting Bob’s that it was worth the risks he took.
I understand your point, and I think I am sort of convinced. But its the sort of thing where minor details in the model can change things quite a lot. For example, I am sort of assuming that Bob gets no utility at all from his money until he walks out of the casino with his winnings—IE having the money and still being in the casino is worth nothing to him, because he can’t buy stuff with it. Where as you seem to be comparing Bob with his counter-factual at each round number—while I am only interested in Bob at the very end of the process, when he walks away with his winnings to get all that utility. But your proposed Bob never walks away from the table with any winnings. (Assuming no round limit). If he still has winnings he doesn’t walk away.
Lets put details on the scenario in two slightly different ways. (1) the “casino” is just a computer script where Bob can program in a strategy (bet it all every time), and then just type in the number of rounds (N). (Or, for your version of Bob, put the whole thing in a “while my_money > 0:” loop.) We could alternatively (2) imagine that Bob is in the casino playing each round one at a time, and that the time taken doing 1 round is a fixed utility cost of some small number (say 0.1). This doesn’t change anything for utility-maximising-Bob, and in fact the time costs for 1 more round relative to his expected gains shrink over time as his money doubles up. (later rounds are a better deal in expectation).
With these models I just see a system where Bob deterministically looses all his money. The longer he goes before going bust, the more of his time he wastes as well (in (2)).
Kelly betting doesn’t actually fix my complaint. A Kelly betting Bob with no point at which they say “Yes, that is enough money, time to leave.” actually gets minus infinity utility in model (2) where doing a round costs a small but finite amount of utility in terms of the time spent. Because the money acquired doesn’t pay off till they leave, which they never do.
I think maybe you are right that it comes down to the utility function. Any agent (even the Kelly one) will behave in a way that comes across as obviously insane if we allow their utility function to go to infinity. Although I still don’t quite see how that infinity actually ever enters in this specific case. If we answer the infinite utility function with an infinite number of possible rounds then we can say with certainty that Bob never walks away with any winnings.
I agree infinity is what makes things go weird here, but as you say, not particularly weirder for Bob than for Kelly-Betting Bob (who also never leaves the casino, and also wraps in a while my_money > 0 loop).
But what you say here seems to undermine your original comment:
The problem with maximising expected utility is that Bob will sit their playing 1 more round, then another 1 more round again and again until he eventually looses everything.
But KBB also sits there playing one more round, then another round. He doesn’t eventually lose everything, but he doesn’t leave either. This isn’t a problem with maximizing expected utility, it’s a problem with infinity.
At least to me Kelly betting fits in the same kind of space as the Newcomb paradox and (possibly) the prisoners dilemma. They all demonstrate that the optimal policy is not necessarily given by a sequence of optimal actions at every step.
But with this setup, it only demonstrates that if we wave our hands and talk about what happens after playing infinitely many rounds of a game we never want to stop playing.
If we aren’t talking about something like that, then optimal policy for the expected-money maximizer is given by taking the optimal action at every step.
Yes, my position did indeed shift, as you changed my mind and I thought about it in more depth. My original position was very much pro-Kelly. On thinking about your points I now think it is the while my_money > 0 aspect where the problem really lies. I still stand by the difference between optimal global policy and optimal action at each step distinction, because at each step the optimal policy (for Kelly or not) is to shake the dice another time. But, if this is taken as a policy we arrive at the while my_money > 0 break condition being the only escape, which is clearly a bad policy. (It guarantees that in any world we walk away, we walk away with nothing.)
Nod. I think we basically agree at this point. Certainly I don’t intend to claim that optimal policy and optimal actions always coincide (I have more thoughts on that but don’t want to get into them).
Since writing the original post, I’ve found Gwern’s post about a solution to something almost identical to Bob’s problem. In this post, he creates a decision tree for every possible move starting from the first one, determining final value at the leaf nodes. He then uses the Bellman equation and traditional expected value to back out what you should do in the earliest moves. The answer is that you bet approximately Kelly.
Gwern’s takeaway here is (I think) that expected value always works, but you have to make sure you’re solving the right problem. Using expected value naively at each step, discounting the temporal nature of the problem, leads to ruin.
I think many of the more philosophical points in my original post still stand, as doing backwards induction even on this toy problem is pretty difficult (it took his software 16 hours to find the solution). Collapsing a time series expected value problem to a one-shot Kelly problem saves a lot of effort, but to do that you need an ergodic statistic. Even once you’ve done that, you should still make sure the game is worth playing before you actually start betting.
it took his software 16 hours to find the solution
That’s just the maximally-inefficient-but-convenient interpreted version in R. For the Kelly Coin Flip Game, the fastest exact brute-force was 0.002h, not 16.000h, and it’d probably be less than half that if I ran it on my current 16-core machine instead of my laptop from 9 years ago. (For comparison, Feep & others got a similar speedup on another dynamic programming problem: taking it from the naive interpreted version of erroring out at problem sizes much past 300 due to memory usage problems to being able to solve problem sizes up to 133,787,000 in just 9 wallclock days. Quite something. And probably some of the tricks in the second problem could’ve been applied to speed up the first one even more.) And the real answer is that it takes 0.000h because Arthur found an exact formula which uses so few operations that I wasn’t sure how to benchmark it meaningfully beyond “seems to run in milliseconds” & so fast it looked like memoizing was slowing it down. (The original problem being too fast to compute is why I started making it harder by generalizing the problem.)
As usual, the convenient way to implement something is very rarely anywhere nearthe fastest, often by multiple orders of magnitude, and we must choose our poison: “fast, easy, general—pick two”.
I have no problem with the argument that ergodic formulas may be the limit of or provably identical to straightforward decision theory/reinforcement learning utility maximization over the actual decision problems rather than simplified strawmen, and may be convenient computational shortcuts. I just don’t find that very useful when relevant problems are finite enough that you lose a lot (eg in the coin-flip problem, KC loses a pretty substantial amount of money because even 300 rounds/years is still not enough for the convergence & often you need to act wildly different from KC), and often break the assumptions, and the ergodic stuff obscures all of this, completely ignoring what it’s a special-case of, and comes with a whole heap of puffery and PR.
Arthur found an exact formula which uses so few operations that I wasn’t sure how to benchmark it meaningfully
Oh, cool. I’ll have to read your post again more carefully.
rather than simplified strawmen
Myopic expectation maximization may be a bad argument, but I don’t think it’s a strawman. People do believe that you should expectation maximize on each step of a coin-flipping game, instead over the full history of the game. They act on that belief and go bust, like 30% of the players in Haghani & Dewey. Those people would actually do better adopting an ergodic statistic.
I now understand that Bellman based RL learns a value function that ends up maximizing expected value over a history instead of myopically. That doesn’t mean that any AI agent using expectation maximization will do this. In particular, I worry that people will wrap a world model in naive expectation maximization and end up with an agent that goes bust in resources. This seems like something people are actually trying to do with LLMs.
Oh, cool. I’ll have to read your post again more carefully.
Yeah, it’s one of those ‘kitchen sink’-type posts. The point is less any individual result than creating a zoo of ‘here are some of the many ways to tackle the problem, and what exotic flora & fauna we observe along the way’. You don’t get the effect if you just look at one or two points.
They act on that belief and go bust, like 30% of the players in Haghani & Dewey. Those people would actually do better adopting an ergodic statistic.
Well, they go bust, yes, and would do better with almost any other strategy (since you can’t do worse than winning $0). But I don’t recall Haghani & Dewey saying that the 30%-busters were all doing greedy EV maximization and betting their entire bankroll at each timestep...? (There are many ways to overbet which are not greedy EV maximization.)
In particular, I worry that people will wrap a world model in naive expectation maximization and end up with an agent that goes bust in resources. This seems like something people are actually trying to do with LLMs.
Inasmuch as they are imitation-learning from humans and planning, that seems like less of a concern in the long run. However, to the extent that there is any fundamental tendency towards myopia, that might be a good thing for safety. Inducing various kinds of ‘myopia’ has been a perennial proposal for AI safety: if the AI isn’t planning out sufficiently long-term because eg it has a very high discount rate, then that reduces a lot of instrumental convergence pressure or reward-hacking potential—because all of that misbehavior is outside the planning window. (An ‘oracle AI’ can be seen as an extreme version where it cares about only the next time-step, in which it returns an answer.)
If we’re already sacrificing max utility to create a myopic agent that’s lower risk, why would we not also want it to maximize temporal average rather than ensemble average to reduce wipeout risk?
No, it isn’t. Gwern never says that anywhere, and it’s not true. This is a good example of what I’m saying.
For clarity the game is this. You start with $25 and you can bet any multiple of $0.01 up to the amount you have. A coin is flipped with a 60⁄40 bias in your favour. If you win you double the amount you bet, otherwise you lose it. There is a cap of $250, so after each bet you lose any money over this amount (so in fact you should never make a bet that could take you over). This continues for 300 rounds.
Bob’s edge is 20%, so the Kelly criterion would recommend that he bets $5. If he continues to use the Kelly criterion in every round (except if this would take him over the cap, in which case he bets to take him to the cap) he ends with an average of $238.04.
As explained on the page you link to, the optimal strategy and expected value can be calculated inductively based on the number of bets remaining. The optimal starting bet is $1.99, and if you continue to bet optimally your average amount of money is $246.61.
So in this game the optimal starting bet is only 20% of the Kelly bet. The Kelly strategy bets too riskily, and leaves $8.57 on the table compared to the optimal strategy.
Kelly isn’t optimal in any limit either. As the number of rounds goes to infinity, the optimal strategy is to bet just $0.01, since this maximises the likelihood of never going bankrupt. If instead the cap goes to infinity then the optimal strategy is to bet everything on every round. Of course you could tune the cap and the number of rounds together so that Kelly was optimal on the first bet, but then it still wouldn’t be optimal for subsequent bets.
(EDIT: It’s actually not certain that the optimal strategy in the first round is $1.99, since floating point accuracy in the computations becomes relevant and many starting bets give the same result. But $5 is so far from optimum that it genuinely did give a lower expected value, so we can say for certain that Kelly is not optimal.)
Hmm. I think we might be misunderstanding each other here.
When I say Gwern’s post leads to “approximately Kelly”, I’m not trying to say it’s exactly Kelly. I’m not even trying to say that it converges to Kelly. I’m trying to say that it’s much closer to Kelly than it is to myopic expectation maximization.
Similarly, I’m not trying to say that Kelly maximizes expected value. I am trying to say that expected value doesn’t summarize wipeout risk in a way that is intuitive for humans, and that those who expect myopic expected values to persist across a time series of games in situations like this will be very surprised.
I do think that people making myopic decisions in situation’s like Bob’s should in general bet Kelly instead of expected value maximizing. I think an understanding of what ergodicity is, and whether a statistic is ergodic, helps to explain why. Given this, I also think that it makes sense to ask whether you should be looking for bets that are more ergodic in their ensemble average (like index funds rather than poker).
In general, I find expectation maximization unsatisfying because I don’t think it deals well with wipeout risk. Reading Ole Peters helped me understand why people were so excited about Kelly, and reading this article by Gwern helped me understand that I had been interpreting expectation maximization in a very limited way in the first place.
In the limit of infinite bets like Bob’s with no cap, myopic expectation maximization at each step means that most runs will go bankrupt. I don’t find the extremely high returns in the infinitesimally probable regions to make up for that. I’d like a principled way of expressing that which doesn’t rely on having a specific type of utility function, and I think Peters’ ergodicity economics gets most but not all the way there.
Other than that, I don’t disagree with anything you’ve said.
I don’t find the extremely high returns in the infinitesimally probable regions to make up for that. I’d like a principled way of expressing that which doesn’t rely on having a specific type of utility function
This sounds impossible to me? Like, if we’re talking about agents with a utility function, then either that function is such that extremely high returns make up for extremely low probabilities, or it’s such that they don’t. If they do, there’s no argument you can make that this agent is mistaken, they simply value things differently than you. If you want to argue that the high returns aren’t worth the low probability, you’re going to need to make assumptions about their utility function.
I admit that I don’t know what ergodicity is (and I bounce off the wiki page). But if I put myself in the shoes of Bob whose utility function is linear in money… my anticipation is that he just doesn’t care. Like, you explain what ergodicity to him, and point out that the process he’s following is non-ergodic. And he replies that yes, that’s true; but on the other hand, the process he’s following does optimize his expected money, which is the only thing he cares about. And there’s no ergodic process that maximizes his expected money. So he’s just going to keep on optimizing for the thing he cares about, thanks, and if you want to give up some expected money in exchange for ergodicity, that’s your right.
It’s not clear to me that it’s impossible, and I think it’s worth exploring the idea further before giving up on it. In particular, I think that saying “optimizing expected money is the thing that Bob cares about” assumes the conclusion. Bob cares about having the most money he can actually get, so I don’t see why he should do the thing that almost-surely leads to bankruptcy. In the limit as the number of bets goes to infinity, the probability of not being bankrupt will converge to 0. It’s weird to me that something of measure 0 probability can swamp the entirety of the rest of the probability.
I’d say that “optimizing expected money is the only thing Bob cares about” is an example, not an assumption or conclusion. If you want to argue that agents should care about ergodicity regardless of their utility function, then you need to argue that to the agent whose utility function is linear in money (and has no other terms, which I assumed but didn’t state in the previous comment).
Such an agent is indifferent between a certainty of 1025 dollars, and a near-certainty of 0 dollars with a 10−67 chance of 1092 dollars. That’s simply what it means to have that utility function. If you think this agent, in the current hypothetical scenario, should bet Kelly to get ergodicity, then I think you just aren’t taking seriously what it means to have a utility function that’s linear in money.
In the limit as the number of bets goes to infinity
I spoke about limits and infinity in my conversation with Ben, my guess is it’s not worth me rehashing what I said there. Though I will add that I could make someone whose utility is log in money—i.e. someone who’d normally bet Kelly—behave similarly.
Not with quite the same setup. But I can offer them a sequence of bets such that with near-certainty (p→1 as t→∞), they’d eventually end up with $0.01 and then stop betting because they’ll under no circumstances risk going down to $0.
These bets can’t be of the form “payout is some fixed multiple of your stake and you get to choose your stake”, but I think it would work if I do “payout is exponential in your stake”. Or I could just say “minimum stake is your entire bankroll minus $0.01”—if I offer high enough payouts each time, they’ll take these bets, over and over, until they’re down to their last cent. Each time they’d prefer a smaller bet for less money, but if I’m not offering that they’d rather take the bet I am offering than not bet at all.
Also,
It’s weird to me that something of measure 0 probability can swamp the entirety of the rest of the probability.
The Dirac delta has this property too, and IIUC it’s a fairly standard tool.
Here were talking something that’s weird in a different way, and perhaps weird in a way that’s harder to deal with. But again I think that’s more because of infinity than because of utility functions that are linear in money.
If instead the cap goes to infinity then the optimal strategy is to bet everything on every round.
This isn’t right unless I’m missing something—Kelly provides the fastest growth, while betting everything on every round is almost certain to bankrupt you.
If you’re trying to maximize expected money at the end of a fixed number of rounds, you do that by betting everything on every round (and, yes, almost certainly going bankrupt).
If that’s not what you’re trying to do, the optimal strategy is probably something else. But “how do we maximize expected money?” seems to be the question Gwern’s post is exploring. It’s just that with the $250 cap, maximizing expected money seems like a good idea (because you can almost always get close to $250), and with no cap, maximizing expected money seems like a terrible idea (because it gives you a 10^-67 chance of $10^92).
You don’t do Kelly because it’s good at maximizing expected money. You do it (when you do it) because you’re trying to do something other than maximize expected money.
Oh, I see. Yes, I agree. The idea to maximize the expected money would never occur to me (since that’s not how my utility function works), but I get it now.
It bankrupts you with probability 1 − 0.6^300, but in the other 0.6^300 of cases you get a sweet sweet $25 × 2^300. This nets you an expected $1.42 × 10^25.
Whereas Kelly betting only has an expected value of $25 × (0.6×1.2 + 0.4×0.8)^300 = $3220637.15.
Obviously humans don’t have linear utility functions, but my point is that the Kelly criterion still isn’t the right answer when you make the assumptions more realistic. You actually have to do the calculation with the actual utility function.
So, by optimal, you mean “almost certainly bankrupt you.” Then yes.
My definition of optimal is very different.
Obviously humans don’t have linear utility functions
I don’t think that’s the only reason—if I value something linearly, I still don’t want to play a game that almost certainly bankrupts me.
Obviously humans don’t have linear utility functions, but my point is that the Kelly criterion still isn’t the right answer when you make the assumptions more realistic.
I mean, that’s not obvious—the Kelly criterion gives you, in the example with the game, E(money) = $240, compared to $246.61 with the optimal strategy. That’s really close.
I don’t think that’s the only reason—if I value something linearly, I still don’t want to play a game that almost certainly bankrupts me.
I still think that’s because you intuitively know that bankruptcy is worse-than-linearly bad for you. If your utility function were truly linear then it’s true by definition that you would trade an arbitrary chance of going bankrupt for a tiny chance of a sufficiently large reward.
I mean, that’s not obvious—the Kelly criterion gives you, in the example with the game, E(money) = $240, compared to $246.61 with the optimal strategy. That’s really close.
Yes, but the game is very easy, so a lot of different strategies get you close to the cap.
Yes, but the game is very easy, so a lot of different strategies get you close to the cap.
I’ve been thinking about it, and I’m not sure if this is the case in the sense you mean it—expected money maximization doesn’t reflect human values at all, white Kelly criterion mostly does, so if we make our assumptions more realistic, it should move us away from expected money maximization and towards the Kelly criterion, as opposed to moving us the other way.
Not maximising expected utility means that you expect to get less utility.
This isn’t actually right though—the concept of maximizing utility doesn’t quite overlap with expecting to have more or less utility at the end.
There are many examples where maximizing your expected utility means expecting to go broke, and not maximizing it means expecting to end up with more money.
(Even though, in this particular one-turn example, Bob should, in fact, expect to end up with more money if he bets everything.)
I wrote this with the assumption that Bob would care about maximizing his money at the end, and that there would be a high but not infinite number of rounds.
On my view, your questions mostly don’t change the analysis much. The only difference I can see is that if he literally only cares about beating Alice, he should go all in. In that case, having $1 less than Alice is equivalent to having $0. That’s not really how people use money though, and seems pretty artificial.
How are you expecting these answers to change things?
If Bob wants to maximise his money at the end, then he really should bet it all every round. I don’t see why you would want to use Kelly rather than maximising expected utility. Not maximising expected utility means that you expect to get less utility.
Well put. I agree that we should try to maximize the value that we expect to have after playing the game.
My claim here is that just because a statistic is named “expected value” doesn’t mean it’s accurately representing what we expect to happen in all types of situations. In Alice’s game, which is ergodic, traditional ensemble-averaging based expected value is highly accurate. The more tickets Alice buys, the more her actual value converges to the expected value.
In Bob’s game, which is non-ergodic, ensemble-based expected value is a poor statistic. It doesn’t actually predict the value that he would have. There’s no convergence between Bob’s value and “expected value”, so it seems strange to say that Bob “expects to get” the result of the ensemble average here.
You can certainly calculate Bob’s ensemble average, and it will have a higher result than the temporal average (as I state in my post). My claim is that this doesn’t help you, because it’s not representative of Bob’s game at all. In those situations, maximizing temporal average is the best you can do in reality, and the Kelly criterion maximizes that. Trying to maximize ensemble-based expected value here will wipe you out.
The problem with maximising expected utility is that Bob will sit their playing 1 more round, then another 1 more round again and again until he eventually looses everything. Each step maximised the expected utility, but the policy overall guarantees zero utility with certainty, assuming Bob never runs out of time.
But, even as utility-maximising-Bob is saved from self-destruction by the clock, he shall think to himself “dam it! Out of time. That is really annoying, I want to keep doing this bet”.
At least to me Kelly betting fits in the same kind of space as the Newcomb paradox and (possibly) the prisoners dilemma. They all demonstrate that the optimal policy is not necessarily given by a sequence of optimal actions at every step.
Ignoring infinities, do you have the same objection to a game with a limit of 100 rounds? Utility-maximizing Bob will bet all his money 100 times, and lose all of it with probability around 1−10−24, and he’ll endorse that because one time in 1024 he is raking it in to the tune of 1032 dollars or something. If you try to stop him he’ll be justly annoyed because you’re not letting him maximize his utility function.
Do you think that’s a problem for expected utility maximization? If so, it seems to me that your objection isn’t “optimal policy doesn’t come from optimal actions”. (At any rate I think that would be a bad objection, because optimal policy for this utility function does come from optimal actions at each step.) Rather, it seems to me that your objection is you don’t really believe Bob has that utility function.
Which, of course he doesn’t! No one has a utility function like that (or, indeed, at all). And I think that’s important to realize. But it’s a different objection, and I think that’s important to realize too.
Yes, I completely agree that the main reason in real life we would recommend against that strategy is that we instinctively (and usually correctly) feel that the person’s utility function is sub-linear in money. So that the 1032 dollars with probability 10−24 is bad. Obviously if 1032 dollars is needed to cure some disease that will otherwise kill them immediately that changes things.
But, their is an objection that I think runs somewhat separately to that, which is the round limit. If we are operating under an optimal, reasonable policy, then (outside commitment tactic negotiations) I think it shouldn’t really be possible for a new outside constraint to improve our performance. Because if the constraint does improve performance then we could have adopted that constraint voluntarily and our policy was therefore not optimal. And the N-round limit is doing a fairly important job at improving Bob’s performance in this hypothetical. Otherwise Bob’s strategy is equivalent to “I bet everything, every time, until I loose it all.” Perhaps this second objection is just the old one in a new disguise (any agent with a finitely-bounded utility function would eventually reach a round number where they decide “actually I have enough now”, and thus restore my sense of what should be), but I am not sure that it is exactly the same.
Oh, I don’t think the round limit is fundamental here, I just don’t like infinities :p
At time zero, you can show Bob a bunch of probability distributions for his money at some finite time t, corresponding to betting strategies, and ask which he’d prefer. And his answer will always be that his favorite distribution is the one corresponding to “bet everything every time”. And when it gets to time t, Bob is almost certainly broke, but not actually regretting his decisions in the sense of “knowing what I knew then I could have done better”.
If we take the limit as t→∞… I’m not really sure this is a meaningful thing to do. I guess we could take the pointwise limit and see that the resulting function is 1 at 0 and 0 everywhere else, which is indeed a probability distribution we don’t like. But if we take the pointwise limit of the Kelly strategy, it’s 0 everywhere, which isn’t even a probability distribution. I don’t think we should use that as a reason to prefer the Kelly strategy. Maybe there are other limits we can take? (I’ve forgotten a lot of what I used to know.) But mostly I think this is a weird thing to try to do.
If we’re not taking the limit, if we just say Bob can play as long as he wants, then yes, he just keeps playing until he goes broke. But he endorses that behavior. There’s no point where he looks back and goes “I was an idiot”.
One thing I’d say here is that we don’t sum up or compare utilities at different times. Like, it would be tempting to say “with probability 1, Bob will go broke. And however much money he had at the time, with probability 1, his alter ego Kelly-Betting Bob will eventually have more money than that. So Bob would prefer to be Kelly-Betting Bob”. But that last sentence doesn’t hold; Bob knows that in the event he’d managed to stick it out that long, his wealth would so vastly dwarf Kelly-Betting Bob’s that it was worth the risks he took.
I understand your point, and I think I am sort of convinced. But its the sort of thing where minor details in the model can change things quite a lot. For example, I am sort of assuming that Bob gets no utility at all from his money until he walks out of the casino with his winnings—IE having the money and still being in the casino is worth nothing to him, because he can’t buy stuff with it. Where as you seem to be comparing Bob with his counter-factual at each round number—while I am only interested in Bob at the very end of the process, when he walks away with his winnings to get all that utility. But your proposed Bob never walks away from the table with any winnings. (Assuming no round limit). If he still has winnings he doesn’t walk away.
Lets put details on the scenario in two slightly different ways. (1) the “casino” is just a computer script where Bob can program in a strategy (bet it all every time), and then just type in the number of rounds (N). (Or, for your version of Bob, put the whole thing in a “while my_money > 0:” loop.) We could alternatively (2) imagine that Bob is in the casino playing each round one at a time, and that the time taken doing 1 round is a fixed utility cost of some small number (say 0.1). This doesn’t change anything for utility-maximising-Bob, and in fact the time costs for 1 more round relative to his expected gains shrink over time as his money doubles up. (later rounds are a better deal in expectation).
With these models I just see a system where Bob deterministically looses all his money. The longer he goes before going bust, the more of his time he wastes as well (in (2)).
Kelly betting doesn’t actually fix my complaint. A Kelly betting Bob with no point at which they say “Yes, that is enough money, time to leave.” actually gets minus infinity utility in model (2) where doing a round costs a small but finite amount of utility in terms of the time spent. Because the money acquired doesn’t pay off till they leave, which they never do.
I think maybe you are right that it comes down to the utility function. Any agent (even the Kelly one) will behave in a way that comes across as obviously insane if we allow their utility function to go to infinity. Although I still don’t quite see how that infinity actually ever enters in this specific case. If we answer the infinite utility function with an infinite number of possible rounds then we can say with certainty that Bob never walks away with any winnings.
I agree infinity is what makes things go weird here, but as you say, not particularly weirder for Bob than for Kelly-Betting Bob (who also never leaves the casino, and also wraps in a
while my_money > 0
loop).But what you say here seems to undermine your original comment:
But KBB also sits there playing one more round, then another round. He doesn’t eventually lose everything, but he doesn’t leave either. This isn’t a problem with maximizing expected utility, it’s a problem with infinity.
But with this setup, it only demonstrates that if we wave our hands and talk about what happens after playing infinitely many rounds of a game we never want to stop playing.
If we aren’t talking about something like that, then optimal policy for the expected-money maximizer is given by taking the optimal action at every step.
Yes, my position did indeed shift, as you changed my mind and I thought about it in more depth. My original position was very much pro-Kelly. On thinking about your points I now think it is the
while my_money > 0
aspect where the problem really lies. I still stand by the difference between optimal global policy and optimal action at each step distinction, because at each step the optimal policy (for Kelly or not) is to shake the dice another time. But, if this is taken as a policy we arrive at thewhile my_money > 0
break condition being the only escape, which is clearly a bad policy. (It guarantees that in any world we walk away, we walk away with nothing.)Nod. I think we basically agree at this point. Certainly I don’t intend to claim that optimal policy and optimal actions always coincide (I have more thoughts on that but don’t want to get into them).
Since writing the original post, I’ve found Gwern’s post about a solution to something almost identical to Bob’s problem. In this post, he creates a decision tree for every possible move starting from the first one, determining final value at the leaf nodes. He then uses the Bellman equation and traditional expected value to back out what you should do in the earliest moves. The answer is that you bet approximately Kelly.
Gwern’s takeaway here is (I think) that expected value always works, but you have to make sure you’re solving the right problem. Using expected value naively at each step, discounting the temporal nature of the problem, leads to ruin.
I think many of the more philosophical points in my original post still stand, as doing backwards induction even on this toy problem is pretty difficult (it took his software 16 hours to find the solution). Collapsing a time series expected value problem to a one-shot Kelly problem saves a lot of effort, but to do that you need an ergodic statistic. Even once you’ve done that, you should still make sure the game is worth playing before you actually start betting.
That’s just the maximally-inefficient-but-convenient interpreted version in R. For the Kelly Coin Flip Game, the fastest exact brute-force was 0.002h, not 16.000h, and it’d probably be less than half that if I ran it on my current 16-core machine instead of my laptop from 9 years ago. (For comparison, Feep & others got a similar speedup on another dynamic programming problem: taking it from the naive interpreted version of erroring out at problem sizes much past 300 due to memory usage problems to being able to solve problem sizes up to 133,787,000 in just 9 wallclock days. Quite something. And probably some of the tricks in the second problem could’ve been applied to speed up the first one even more.) And the real answer is that it takes 0.000h because Arthur found an exact formula which uses so few operations that I wasn’t sure how to benchmark it meaningfully beyond “seems to run in milliseconds” & so fast it looked like memoizing was slowing it down. (The original problem being too fast to compute is why I started making it harder by generalizing the problem.)
As usual, the convenient way to implement something is very rarely anywhere near the fastest, often by multiple orders of magnitude, and we must choose our poison: “fast, easy, general—pick two”.
I have no problem with the argument that ergodic formulas may be the limit of or provably identical to straightforward decision theory/reinforcement learning utility maximization over the actual decision problems rather than simplified strawmen, and may be convenient computational shortcuts. I just don’t find that very useful when relevant problems are finite enough that you lose a lot (eg in the coin-flip problem, KC loses a pretty substantial amount of money because even 300 rounds/years is still not enough for the convergence & often you need to act wildly different from KC), and often break the assumptions, and the ergodic stuff obscures all of this, completely ignoring what it’s a special-case of, and comes with a whole heap of puffery and PR.
Oh, cool. I’ll have to read your post again more carefully.
Myopic expectation maximization may be a bad argument, but I don’t think it’s a strawman. People do believe that you should expectation maximize on each step of a coin-flipping game, instead over the full history of the game. They act on that belief and go bust, like 30% of the players in Haghani & Dewey. Those people would actually do better adopting an ergodic statistic.
I now understand that Bellman based RL learns a value function that ends up maximizing expected value over a history instead of myopically. That doesn’t mean that any AI agent using expectation maximization will do this. In particular, I worry that people will wrap a world model in naive expectation maximization and end up with an agent that goes bust in resources. This seems like something people are actually trying to do with LLMs.
Yeah, it’s one of those ‘kitchen sink’-type posts. The point is less any individual result than creating a zoo of ‘here are some of the many ways to tackle the problem, and what exotic flora & fauna we observe along the way’. You don’t get the effect if you just look at one or two points.
Well, they go bust, yes, and would do better with almost any other strategy (since you can’t do worse than winning $0). But I don’t recall Haghani & Dewey saying that the 30%-busters were all doing greedy EV maximization and betting their entire bankroll at each timestep...? (There are many ways to overbet which are not greedy EV maximization.)
Inasmuch as they are imitation-learning from humans and planning, that seems like less of a concern in the long run. However, to the extent that there is any fundamental tendency towards myopia, that might be a good thing for safety. Inducing various kinds of ‘myopia’ has been a perennial proposal for AI safety: if the AI isn’t planning out sufficiently long-term because eg it has a very high discount rate, then that reduces a lot of instrumental convergence pressure or reward-hacking potential—because all of that misbehavior is outside the planning window. (An ‘oracle AI’ can be seen as an extreme version where it cares about only the next time-step, in which it returns an answer.)
If we’re already sacrificing max utility to create a myopic agent that’s lower risk, why would we not also want it to maximize temporal average rather than ensemble average to reduce wipeout risk?
No, it isn’t. Gwern never says that anywhere, and it’s not true. This is a good example of what I’m saying.
For clarity the game is this. You start with $25 and you can bet any multiple of $0.01 up to the amount you have. A coin is flipped with a 60⁄40 bias in your favour. If you win you double the amount you bet, otherwise you lose it. There is a cap of $250, so after each bet you lose any money over this amount (so in fact you should never make a bet that could take you over). This continues for 300 rounds.
Bob’s edge is 20%, so the Kelly criterion would recommend that he bets $5. If he continues to use the Kelly criterion in every round (except if this would take him over the cap, in which case he bets to take him to the cap) he ends with an average of $238.04.
As explained on the page you link to, the optimal strategy and expected value can be calculated inductively based on the number of bets remaining. The optimal starting bet is $1.99, and if you continue to bet optimally your average amount of money is $246.61.
So in this game the optimal starting bet is only 20% of the Kelly bet. The Kelly strategy bets too riskily, and leaves $8.57 on the table compared to the optimal strategy.
Kelly isn’t optimal in any limit either. As the number of rounds goes to infinity, the optimal strategy is to bet just $0.01, since this maximises the likelihood of never going bankrupt. If instead the cap goes to infinity then the optimal strategy is to bet everything on every round. Of course you could tune the cap and the number of rounds together so that Kelly was optimal on the first bet, but then it still wouldn’t be optimal for subsequent bets.
(EDIT: It’s actually not certain that the optimal strategy in the first round is $1.99, since floating point accuracy in the computations becomes relevant and many starting bets give the same result. But $5 is so far from optimum that it genuinely did give a lower expected value, so we can say for certain that Kelly is not optimal.)
Hmm. I think we might be misunderstanding each other here.
When I say Gwern’s post leads to “approximately Kelly”, I’m not trying to say it’s exactly Kelly. I’m not even trying to say that it converges to Kelly. I’m trying to say that it’s much closer to Kelly than it is to myopic expectation maximization.
Similarly, I’m not trying to say that Kelly maximizes expected value. I am trying to say that expected value doesn’t summarize wipeout risk in a way that is intuitive for humans, and that those who expect myopic expected values to persist across a time series of games in situations like this will be very surprised.
I do think that people making myopic decisions in situation’s like Bob’s should in general bet Kelly instead of expected value maximizing. I think an understanding of what ergodicity is, and whether a statistic is ergodic, helps to explain why. Given this, I also think that it makes sense to ask whether you should be looking for bets that are more ergodic in their ensemble average (like index funds rather than poker).
In general, I find expectation maximization unsatisfying because I don’t think it deals well with wipeout risk. Reading Ole Peters helped me understand why people were so excited about Kelly, and reading this article by Gwern helped me understand that I had been interpreting expectation maximization in a very limited way in the first place.
In the limit of infinite bets like Bob’s with no cap, myopic expectation maximization at each step means that most runs will go bankrupt. I don’t find the extremely high returns in the infinitesimally probable regions to make up for that. I’d like a principled way of expressing that which doesn’t rely on having a specific type of utility function, and I think Peters’ ergodicity economics gets most but not all the way there.
Other than that, I don’t disagree with anything you’ve said.
This sounds impossible to me? Like, if we’re talking about agents with a utility function, then either that function is such that extremely high returns make up for extremely low probabilities, or it’s such that they don’t. If they do, there’s no argument you can make that this agent is mistaken, they simply value things differently than you. If you want to argue that the high returns aren’t worth the low probability, you’re going to need to make assumptions about their utility function.
I admit that I don’t know what ergodicity is (and I bounce off the wiki page). But if I put myself in the shoes of Bob whose utility function is linear in money… my anticipation is that he just doesn’t care. Like, you explain what ergodicity to him, and point out that the process he’s following is non-ergodic. And he replies that yes, that’s true; but on the other hand, the process he’s following does optimize his expected money, which is the only thing he cares about. And there’s no ergodic process that maximizes his expected money. So he’s just going to keep on optimizing for the thing he cares about, thanks, and if you want to give up some expected money in exchange for ergodicity, that’s your right.
It’s not clear to me that it’s impossible, and I think it’s worth exploring the idea further before giving up on it. In particular, I think that saying “optimizing expected money is the thing that Bob cares about” assumes the conclusion. Bob cares about having the most money he can actually get, so I don’t see why he should do the thing that almost-surely leads to bankruptcy. In the limit as the number of bets goes to infinity, the probability of not being bankrupt will converge to 0. It’s weird to me that something of measure 0 probability can swamp the entirety of the rest of the probability.
I’d say that “optimizing expected money is the only thing Bob cares about” is an example, not an assumption or conclusion. If you want to argue that agents should care about ergodicity regardless of their utility function, then you need to argue that to the agent whose utility function is linear in money (and has no other terms, which I assumed but didn’t state in the previous comment).
Such an agent is indifferent between a certainty of 1025 dollars, and a near-certainty of 0 dollars with a 10−67 chance of 1092 dollars. That’s simply what it means to have that utility function. If you think this agent, in the current hypothetical scenario, should bet Kelly to get ergodicity, then I think you just aren’t taking seriously what it means to have a utility function that’s linear in money.
I spoke about limits and infinity in my conversation with Ben, my guess is it’s not worth me rehashing what I said there. Though I will add that I could make someone whose utility is log in money—i.e. someone who’d normally bet Kelly—behave similarly.
Not with quite the same setup. But I can offer them a sequence of bets such that with near-certainty (p→1 as t→∞), they’d eventually end up with $0.01 and then stop betting because they’ll under no circumstances risk going down to $0.
These bets can’t be of the form “payout is some fixed multiple of your stake and you get to choose your stake”, but I think it would work if I do “payout is exponential in your stake”. Or I could just say “minimum stake is your entire bankroll minus $0.01”—if I offer high enough payouts each time, they’ll take these bets, over and over, until they’re down to their last cent. Each time they’d prefer a smaller bet for less money, but if I’m not offering that they’d rather take the bet I am offering than not bet at all.
Also,
The Dirac delta has this property too, and IIUC it’s a fairly standard tool.
Here were talking something that’s weird in a different way, and perhaps weird in a way that’s harder to deal with. But again I think that’s more because of infinity than because of utility functions that are linear in money.
This isn’t right unless I’m missing something—Kelly provides the fastest growth, while betting everything on every round is almost certain to bankrupt you.
(e: posted overlapping with Oscar_Cunningham)
If you’re trying to maximize expected money at the end of a fixed number of rounds, you do that by betting everything on every round (and, yes, almost certainly going bankrupt).
If that’s not what you’re trying to do, the optimal strategy is probably something else. But “how do we maximize expected money?” seems to be the question Gwern’s post is exploring. It’s just that with the $250 cap, maximizing expected money seems like a good idea (because you can almost always get close to $250), and with no cap, maximizing expected money seems like a terrible idea (because it gives you a 10^-67 chance of $10^92).
You don’t do Kelly because it’s good at maximizing expected money. You do it (when you do it) because you’re trying to do something other than maximize expected money.
Oh, I see. Yes, I agree. The idea to maximize the expected money would never occur to me (since that’s not how my utility function works), but I get it now.
It bankrupts you with probability 1 − 0.6^300, but in the other 0.6^300 of cases you get a sweet sweet $25 × 2^300. This nets you an expected $1.42 × 10^25.
Whereas Kelly betting only has an expected value of $25 × (0.6×1.2 + 0.4×0.8)^300 = $3220637.15.
Obviously humans don’t have linear utility functions, but my point is that the Kelly criterion still isn’t the right answer when you make the assumptions more realistic. You actually have to do the calculation with the actual utility function.
So, by optimal, you mean “almost certainly bankrupt you.” Then yes.My definition of optimal is very different.I don’t think that’s the only reason—if I value something linearly, I still don’t want to play a game that almost certainly bankrupts me.
I mean, that’s not obvious—the Kelly criterion gives you, in the example with the game, E(money) = $240, compared to $246.61 with the optimal strategy. That’s really close.
I still think that’s because you intuitively know that bankruptcy is worse-than-linearly bad for you. If your utility function were truly linear then it’s true by definition that you would trade an arbitrary chance of going bankrupt for a tiny chance of a sufficiently large reward.
Yes, but the game is very easy, so a lot of different strategies get you close to the cap.
I’ve been thinking about it, and I’m not sure if this is the case in the sense you mean it—expected money maximization doesn’t reflect human values at all, white Kelly criterion mostly does, so if we make our assumptions more realistic, it should move us away from expected money maximization and towards the Kelly criterion, as opposed to moving us the other way.
This isn’t actually right though—the concept of maximizing utility doesn’t quite overlap with expecting to have more or less utility at the end.
There are many examples where maximizing your expected utility means expecting to go broke, and not maximizing it means expecting to end up with more money.
(Even though, in this particular one-turn example, Bob should, in fact, expect to end up with more money if he bets everything.)