it took his software 16 hours to find the solution
That’s just the maximally-inefficient-but-convenient interpreted version in R. For the Kelly Coin Flip Game, the fastest exact brute-force was 0.002h, not 16.000h, and it’d probably be less than half that if I ran it on my current 16-core machine instead of my laptop from 9 years ago. (For comparison, Feep & others got a similar speedup on another dynamic programming problem: taking it from the naive interpreted version of erroring out at problem sizes much past 300 due to memory usage problems to being able to solve problem sizes up to 133,787,000 in just 9 wallclock days. Quite something. And probably some of the tricks in the second problem could’ve been applied to speed up the first one even more.) And the real answer is that it takes 0.000h because Arthur found an exact formula which uses so few operations that I wasn’t sure how to benchmark it meaningfully beyond “seems to run in milliseconds” & so fast it looked like memoizing was slowing it down. (The original problem being too fast to compute is why I started making it harder by generalizing the problem.)
As usual, the convenient way to implement something is very rarely anywhere nearthe fastest, often by multiple orders of magnitude, and we must choose our poison: “fast, easy, general—pick two”.
I have no problem with the argument that ergodic formulas may be the limit of or provably identical to straightforward decision theory/reinforcement learning utility maximization over the actual decision problems rather than simplified strawmen, and may be convenient computational shortcuts. I just don’t find that very useful when relevant problems are finite enough that you lose a lot (eg in the coin-flip problem, KC loses a pretty substantial amount of money because even 300 rounds/years is still not enough for the convergence & often you need to act wildly different from KC), and often break the assumptions, and the ergodic stuff obscures all of this, completely ignoring what it’s a special-case of, and comes with a whole heap of puffery and PR.
Arthur found an exact formula which uses so few operations that I wasn’t sure how to benchmark it meaningfully
Oh, cool. I’ll have to read your post again more carefully.
rather than simplified strawmen
Myopic expectation maximization may be a bad argument, but I don’t think it’s a strawman. People do believe that you should expectation maximize on each step of a coin-flipping game, instead over the full history of the game. They act on that belief and go bust, like 30% of the players in Haghani & Dewey. Those people would actually do better adopting an ergodic statistic.
I now understand that Bellman based RL learns a value function that ends up maximizing expected value over a history instead of myopically. That doesn’t mean that any AI agent using expectation maximization will do this. In particular, I worry that people will wrap a world model in naive expectation maximization and end up with an agent that goes bust in resources. This seems like something people are actually trying to do with LLMs.
Oh, cool. I’ll have to read your post again more carefully.
Yeah, it’s one of those ‘kitchen sink’-type posts. The point is less any individual result than creating a zoo of ‘here are some of the many ways to tackle the problem, and what exotic flora & fauna we observe along the way’. You don’t get the effect if you just look at one or two points.
They act on that belief and go bust, like 30% of the players in Haghani & Dewey. Those people would actually do better adopting an ergodic statistic.
Well, they go bust, yes, and would do better with almost any other strategy (since you can’t do worse than winning $0). But I don’t recall Haghani & Dewey saying that the 30%-busters were all doing greedy EV maximization and betting their entire bankroll at each timestep...? (There are many ways to overbet which are not greedy EV maximization.)
In particular, I worry that people will wrap a world model in naive expectation maximization and end up with an agent that goes bust in resources. This seems like something people are actually trying to do with LLMs.
Inasmuch as they are imitation-learning from humans and planning, that seems like less of a concern in the long run. However, to the extent that there is any fundamental tendency towards myopia, that might be a good thing for safety. Inducing various kinds of ‘myopia’ has been a perennial proposal for AI safety: if the AI isn’t planning out sufficiently long-term because eg it has a very high discount rate, then that reduces a lot of instrumental convergence pressure or reward-hacking potential—because all of that misbehavior is outside the planning window. (An ‘oracle AI’ can be seen as an extreme version where it cares about only the next time-step, in which it returns an answer.)
If we’re already sacrificing max utility to create a myopic agent that’s lower risk, why would we not also want it to maximize temporal average rather than ensemble average to reduce wipeout risk?
That’s just the maximally-inefficient-but-convenient interpreted version in R. For the Kelly Coin Flip Game, the fastest exact brute-force was 0.002h, not 16.000h, and it’d probably be less than half that if I ran it on my current 16-core machine instead of my laptop from 9 years ago. (For comparison, Feep & others got a similar speedup on another dynamic programming problem: taking it from the naive interpreted version of erroring out at problem sizes much past 300 due to memory usage problems to being able to solve problem sizes up to 133,787,000 in just 9 wallclock days. Quite something. And probably some of the tricks in the second problem could’ve been applied to speed up the first one even more.) And the real answer is that it takes 0.000h because Arthur found an exact formula which uses so few operations that I wasn’t sure how to benchmark it meaningfully beyond “seems to run in milliseconds” & so fast it looked like memoizing was slowing it down. (The original problem being too fast to compute is why I started making it harder by generalizing the problem.)
As usual, the convenient way to implement something is very rarely anywhere near the fastest, often by multiple orders of magnitude, and we must choose our poison: “fast, easy, general—pick two”.
I have no problem with the argument that ergodic formulas may be the limit of or provably identical to straightforward decision theory/reinforcement learning utility maximization over the actual decision problems rather than simplified strawmen, and may be convenient computational shortcuts. I just don’t find that very useful when relevant problems are finite enough that you lose a lot (eg in the coin-flip problem, KC loses a pretty substantial amount of money because even 300 rounds/years is still not enough for the convergence & often you need to act wildly different from KC), and often break the assumptions, and the ergodic stuff obscures all of this, completely ignoring what it’s a special-case of, and comes with a whole heap of puffery and PR.
Oh, cool. I’ll have to read your post again more carefully.
Myopic expectation maximization may be a bad argument, but I don’t think it’s a strawman. People do believe that you should expectation maximize on each step of a coin-flipping game, instead over the full history of the game. They act on that belief and go bust, like 30% of the players in Haghani & Dewey. Those people would actually do better adopting an ergodic statistic.
I now understand that Bellman based RL learns a value function that ends up maximizing expected value over a history instead of myopically. That doesn’t mean that any AI agent using expectation maximization will do this. In particular, I worry that people will wrap a world model in naive expectation maximization and end up with an agent that goes bust in resources. This seems like something people are actually trying to do with LLMs.
Yeah, it’s one of those ‘kitchen sink’-type posts. The point is less any individual result than creating a zoo of ‘here are some of the many ways to tackle the problem, and what exotic flora & fauna we observe along the way’. You don’t get the effect if you just look at one or two points.
Well, they go bust, yes, and would do better with almost any other strategy (since you can’t do worse than winning $0). But I don’t recall Haghani & Dewey saying that the 30%-busters were all doing greedy EV maximization and betting their entire bankroll at each timestep...? (There are many ways to overbet which are not greedy EV maximization.)
Inasmuch as they are imitation-learning from humans and planning, that seems like less of a concern in the long run. However, to the extent that there is any fundamental tendency towards myopia, that might be a good thing for safety. Inducing various kinds of ‘myopia’ has been a perennial proposal for AI safety: if the AI isn’t planning out sufficiently long-term because eg it has a very high discount rate, then that reduces a lot of instrumental convergence pressure or reward-hacking potential—because all of that misbehavior is outside the planning window. (An ‘oracle AI’ can be seen as an extreme version where it cares about only the next time-step, in which it returns an answer.)
If we’re already sacrificing max utility to create a myopic agent that’s lower risk, why would we not also want it to maximize temporal average rather than ensemble average to reduce wipeout risk?