Arthur found an exact formula which uses so few operations that I wasn’t sure how to benchmark it meaningfully
Oh, cool. I’ll have to read your post again more carefully.
rather than simplified strawmen
Myopic expectation maximization may be a bad argument, but I don’t think it’s a strawman. People do believe that you should expectation maximize on each step of a coin-flipping game, instead over the full history of the game. They act on that belief and go bust, like 30% of the players in Haghani & Dewey. Those people would actually do better adopting an ergodic statistic.
I now understand that Bellman based RL learns a value function that ends up maximizing expected value over a history instead of myopically. That doesn’t mean that any AI agent using expectation maximization will do this. In particular, I worry that people will wrap a world model in naive expectation maximization and end up with an agent that goes bust in resources. This seems like something people are actually trying to do with LLMs.
Oh, cool. I’ll have to read your post again more carefully.
Yeah, it’s one of those ‘kitchen sink’-type posts. The point is less any individual result than creating a zoo of ‘here are some of the many ways to tackle the problem, and what exotic flora & fauna we observe along the way’. You don’t get the effect if you just look at one or two points.
They act on that belief and go bust, like 30% of the players in Haghani & Dewey. Those people would actually do better adopting an ergodic statistic.
Well, they go bust, yes, and would do better with almost any other strategy (since you can’t do worse than winning $0). But I don’t recall Haghani & Dewey saying that the 30%-busters were all doing greedy EV maximization and betting their entire bankroll at each timestep...? (There are many ways to overbet which are not greedy EV maximization.)
In particular, I worry that people will wrap a world model in naive expectation maximization and end up with an agent that goes bust in resources. This seems like something people are actually trying to do with LLMs.
Inasmuch as they are imitation-learning from humans and planning, that seems like less of a concern in the long run. However, to the extent that there is any fundamental tendency towards myopia, that might be a good thing for safety. Inducing various kinds of ‘myopia’ has been a perennial proposal for AI safety: if the AI isn’t planning out sufficiently long-term because eg it has a very high discount rate, then that reduces a lot of instrumental convergence pressure or reward-hacking potential—because all of that misbehavior is outside the planning window. (An ‘oracle AI’ can be seen as an extreme version where it cares about only the next time-step, in which it returns an answer.)
If we’re already sacrificing max utility to create a myopic agent that’s lower risk, why would we not also want it to maximize temporal average rather than ensemble average to reduce wipeout risk?
Oh, cool. I’ll have to read your post again more carefully.
Myopic expectation maximization may be a bad argument, but I don’t think it’s a strawman. People do believe that you should expectation maximize on each step of a coin-flipping game, instead over the full history of the game. They act on that belief and go bust, like 30% of the players in Haghani & Dewey. Those people would actually do better adopting an ergodic statistic.
I now understand that Bellman based RL learns a value function that ends up maximizing expected value over a history instead of myopically. That doesn’t mean that any AI agent using expectation maximization will do this. In particular, I worry that people will wrap a world model in naive expectation maximization and end up with an agent that goes bust in resources. This seems like something people are actually trying to do with LLMs.
Yeah, it’s one of those ‘kitchen sink’-type posts. The point is less any individual result than creating a zoo of ‘here are some of the many ways to tackle the problem, and what exotic flora & fauna we observe along the way’. You don’t get the effect if you just look at one or two points.
Well, they go bust, yes, and would do better with almost any other strategy (since you can’t do worse than winning $0). But I don’t recall Haghani & Dewey saying that the 30%-busters were all doing greedy EV maximization and betting their entire bankroll at each timestep...? (There are many ways to overbet which are not greedy EV maximization.)
Inasmuch as they are imitation-learning from humans and planning, that seems like less of a concern in the long run. However, to the extent that there is any fundamental tendency towards myopia, that might be a good thing for safety. Inducing various kinds of ‘myopia’ has been a perennial proposal for AI safety: if the AI isn’t planning out sufficiently long-term because eg it has a very high discount rate, then that reduces a lot of instrumental convergence pressure or reward-hacking potential—because all of that misbehavior is outside the planning window. (An ‘oracle AI’ can be seen as an extreme version where it cares about only the next time-step, in which it returns an answer.)
If we’re already sacrificing max utility to create a myopic agent that’s lower risk, why would we not also want it to maximize temporal average rather than ensemble average to reduce wipeout risk?