gwern comments on Proposal: Scaling laws for RL generalization

gwern 24 Mar 2022 17:00 UTC
16 points
0
Temporal Scaling Laws

How do agents learn with horizon length, or time of tasks?

Another missing scaling law is scaling with episode length in steps or with time. This has not been examined at all AFAIK (aside from the somewhat-related Transformer language model analyses looking at how well they make use of the context window—very poorly towards the end). The RL-related scaling work tends to assume a fixed problem with a ‘length’ either hardwired by the environment (Ms. Pacman, most ALE environments) or driven by the other player’s strength (Hex, Go, chess), and asks how performance scales with number of episodes. This doesn’t answer the question of how hard it is to learn longer versions of those games, which is either constant in all datapoints or is conflated with other things (a stronger opponent may lead to longer games, and make learning more expensive to maintain the power/log-curve, but is that due to the longer game or other aspects of a stronger opponent?).

The simple answer is that the longer the horizon, the harder credit assignment and the RL problem must be because there are more states and possibilities spread over the same reward, and bounds on agents like a tabular learning agent often include some nasty term like T² or L² (to pick the first one I found in my notes, O’Donoghue 2018′s regret bound has a L^3⁄2 with only a √ T for total sample-size/elapsed-timesteps). So we might naively argue that longer tasks are strictly harder than shorter tasks, and any scaling on a task implicitly bounds the longer versions too and any ‘broader’ task. This would have a lot of implications for AI capabilities and safety: we can focus on measuring short-term task performance, and we could know that any merely human-level AGI (as measured in the short-term) would ‘fall apart’ as its long-term performance would be much worse and its errors accumulate and it flailed around, and that whatever level of long-term performance is necessary to have major global impact will be preceded by much better short-term performance as a warning sign.

But ‘regret’ is not necessarily the right quantity: regret is basically the loss of your strategy compared to that of the ideal strategy, and says little about how well the ideal strategy copes with different timescales. Since the target regret is 0 (you match the ideal strategy), you can only do worse with longer durations since those are more time-steps on which to fall short of the ideal strategy and incur regret; you can never make up your past regrets, only cry about them over wine. As far as capabilities go, if even the ideal strategy scales poorly with increasing temporal durations, losing lots of reward, then obviously learning it won’t do any better.

And we can immediately think of reasons why reward might not worsen monotonically with increasing duration. Moravec’s paradox immediately comes to mind with its inversions where the broader task can be easier than a subtask: a chess AI can kick our ass at planning out scores of plies deep to the end of the game (and endgame databases are omniscient), but while it can foresee your inevitable checkmate an hour from now, a robotics AI would struggle to move the pawn reliably within 1s without knocking other pieces over. We can also easily imagine an ‘AI CEO’ which is an overgrown language model receiving and sending emails all day, skillfully imitating the rhetoric and strategy of great CEOs of the past and maneuvering its company through the shoals of years to come, even as it would be unable to fetch coffee for a visitor in the next minute or update a spreadsheet—something it must delegate to its personal assistant or its MBA intern. (Many people already believe CEOs often do little to justify their salary and could be replaced by something rather simpler than language models...) This is not necessarily because ‘being a CEO’ is intrinsically easier, in some sense, than ‘fetching coffee’, but may simply reflect that the data is much more abundant, and much more compact and preprocessed in linguistic form, for one task and not the other, and so in practice (but not theory) we got good AIs for one before the other. (There are probably billions of words dissecting CEO decisions, running case studies on past CEOs, CEOs giving speeches or investor calls etc, while any words used to describe fetching coffee omit the important low-level physics because we just don’t know what our bodies do or compute when they fetch coffee.) It may also reflect the illusory nature of the ‘state’ or ‘actions’. Is ‘being a CEO’ an extremely high-dimensional problem with billions of angles and variables to consider, all important, ticking along millisecond by millisecond or does it boil down to some more abstract problem much more analogous to Rock-Paper-Scissors played once a year (except if you lose a throw, you lose a billion dollars)?

There’s also an observation worth making that sometimes long-term tasks can be easier than short-term versions of the same task because they offer more scope for control, and correction and recovery from errors. Consider games like Go or chess: an (over the table or correspondence) chess grandmaster might lose a game of bullet blitz chess that they would win, or a Go master might be able to anaconda-style squeeze you to death given unlimited moves whereas if there were an arbitrary time limit like 70 moves, they would have to engage in tactically reckless invasions that can backfire on them with huge loss of territory; these are games where the ‘long’ form would be much easier to maximize reward on than the short form, because there’s more time to think and to accumulate small advantages. Or consider inner monologue in a GPT-like model for solving math or programming problems: an inner monologue solution will be much longer than the default model behavior of ‘just print out the (wrong) answer’, and so if there really were a strict temporal decrease in capability, one would have proven a priori than the inner-monologue solutions must be worse; this is not the case, however, the longer show-your-work/think-aloud completions can be much better than the shortest completion, and this is in part because it gives the model scope for breaking down hard problems to sub-problems, or even going back and correcting itself in some examples. Other monologues can take the form of guess-and-check, and of course you have instances like AlphaCode of brute-forcing solutions by generating lots of candidate solutions to test; all of these use up more time, and involve more temporal steps, in some sense, and get you capability gains. More broadly, an AI CEO can solve its long-term problems like ‘hiring’ even if it is not good at solving every individual short-term instance of ‘should I hire this candidate or not’, because it can observe a failed hire and fire them later on, and hire a new better candidate to fix the errors. Or it could just spend more upfront to make hires more likely to work out, paying for it compared to a human CEO out of all its manifold AI advantages (like working for potentially as little as the cost of its electricity and never needing to sleep). If an expert chess player is plopped down in a random game position which doesn’t suit his style, he may not do very well (he is ‘off-policy’, one might say); if he can play from the start, then he can take his favorite move and bring to bear his full skill while ‘on-policy’. Or a race-car driver might not have the mechanical reflexes to recover from driving straight into the wall at 200kph, but he doesn’t need to because he just doesn’t do that in the first place. There are many forking paths indeed. (AIs not being good enough to control errors and them compounding to send an episode into unobserved states is a major failing of imitation-learning approaches.)

And then there’s the exploration problem: how much of the difficulty of long-term tasks for AIs is not that they are ‘intrinsically’ hard, but that they are hard to explore, and we are pretty bad at exploration right now? A RL toy problem is n-chains, where one must hop along a line to move n steps away before getting a reward; for the standard ε-greedy exploration approach, to observe a single reward, an agent would need to randomly move n times in a row toward the reward at possibly ε probability, which would take a lot of episodes, but if it had offline data of an expert, it could probably solve it without a single episode because it’s not that hard to model or plan over.

Finally, we may just be surprised by how agent competencies evolve regardless of how true any of these considerations are. Consider OpenAI Five (or AlphaStar). Who would have expected that a simple LSTM RNN, whose time horizon for training i something like 1 minute, could learn what looks like remarkable planning and foresight? (Such as learning the Roshan gambit, which requires the entire team to fight the most powerful monster in the game, for a while, taking heavy damage while neglecting the enemy team & wild monsters, to get a single item, an 1UP, which will be used once on one team member maybe 20 or 30 minutes later in the middle of an all-out fight and only then pay back the early investment in fighting Roshan.)

Far from being a tight constraint in which one can place great forecasting faith and a useful safety mechanism, I’m unsure there’s even a meaningful correlation between how “long-term” a task is and how badly we can expect AIs to do. So, it’s not an easy question to answer even qualitatively. (And obviously, the possibility of very different capabilities at different time-scales is important to characterize for AI safety for precisely the same reasons that showing that short-term capabilities >> long-term capabilities would be important.)

Quantitatively, we’re not sure where to begin. First, what should our ‘temporal scaling laws’ be in? (We have average reward, episode length, agent horizon length, number of episodes, number of steps in episode, compute, parameter-count, agent family of model-free (value vs policy?) vs model-based… To name just the obvious variables.) Do we expect the exponents to be the same and simply offset by a constant (which seems to be how most of the scaling laws wind up looking—different arch? different constant. added noise? additional constant. cross-modality transfer? lower constant, etc.)?

Setting up an empirical testbed is also hard. You want an environment which cleanly varies the length of time dynamics to stress long-term credit assignment but not change anything else.
- Most ALE games are obviously unsuitable. An arcade game like Ms. Pacman or Breakout! typically has ‘resets’ which make it equivalent to a lot of short games. (If you rewrite Tic-Tac-Toe as playing 1,000 games in a row and only then receiving a reward in the range 0--1,000, I do not expect this to be much harder than a single game of Tic-Tac-Toe with ⁰⁄₁ reward.) Feedforward agents with only some frame-stacking for ‘history’ do well.
- Most of the MuJoCo environments like walking robots have a similar problem: little or no long-term dynamics, and certainly nothing with a convenient T knob on it. Similarly, feedforward works well, lots of history adds little or is a drawback.
- DMLab-30 I am less familiar with, but seems similar (stuff like finding an apple in a maze).
- 2-player games like Go or chess typically do not have any natural time limits: with chess, counting material after n moves is probably stupid (up to t=30 moves, all agents would learn is to huddle in a fortress and achieve a 0-0 tie), although with Go, counting territory after n moves might work and would be easy to get started with given all the existing Go agents. (Go agents like KataGo can be trained with very little compute at this point, particularly with curriculum learning on board size—although that too is an extraneous factor which must be carefully considered.)
  - Hidden-information 2-player games like poker might work, but the adversarial part obscures the use of history. Poker agents live or die based on how well they adapt to the immediate Nash equilibrium with the current opponent, not how well they reason about the deck based on cards revealed thus far.
- Nethack: some FB researchers have been pushing Nethack as a testbed, and it has some potential here. DRL agents do very badly on it, so there is a lot of headroom.
  
  Nethack has many levels, which are not independent; levels are only partially observable and have a lot of persistent state (which does not reset), rewarding long-term memory. Nethack players typically need to cache items on levels they have cleared and occasionally revisit to stock up; to ‘ascend’ and formally beat the game requires a lot of planning to kill the relevant enemies spread throughout the game and obtain the magical items which give one access to the final level. There is also the vital matter of building a strong character with appropriate curses/blessings/powers like being able to see in the dark or having complementary weapons/armor, and different character archetypes have different strength profiles over the game (in typical D&D fashion, wizards are weak and warriors strong early on, and then wizards gradually become OP while warriors stagnate). So there is a sense in which descent through the dungeon = progress, and longer games = better players, and actions early on can gimp one much later.
  
  But Nethack has drawbacks in being unnecessarily complex and more of an exploration problem than temporal reasoning problem (how is a DRL agent supposed to ever learn about tricks like ‘writing the word “Elbereth” on the ground’? even humans never discover that one by themselves!) especially when we look at the hand-scripted ‘symbolic’ agents which do things like plug in Sokoban solvers solely for the mini-game, the ‘score’ of Nethack appears to be a poor proxy for solving the game and leads agents to greedily maximize grinding the early levels, probably more compute-expensive than a streamlined rogue-like designed for GPU/TPU acceleration would be, having rather complicated tooling/UI (related to never being intended as a research tool), and not having any clear T knob. The MiniHack games might be better than full Nethack.
- Minecraft is a hard testbed for long-term planning, hierarchical actions, and imitation learning, but most of the hierarchies are relatively shallow and don’t come with any built-in way to turn a T knob; one could add on a simple generative system to generate random DAGs and associate some voxel art with them to create extended versions of Minecraft?
- This would be similar to Alchemy. Alchemy, as presented, seems to fix the hardness of the problem in terms of potions/stones, but if one varied that and increased the number of permitted trials to keep each instance soluble, perhaps that would be equivalent? Or would the inferential problem overwhelm any effect from reasoning over the history of experiments (similar to poker)?
None of these jump out.

In supervised learning, the obvious thing to do would be something like set up randomized n-grams as a Markov chain model to generate synthetic data and examine how predictive performance degrades as a function of n. (“Learning Curve Theory”, Hutter 2021, does something a bit like this, I think.) In a RL setting, could we set up a n-chain or Block World environment with simple programs explicitly allocating reward over time? Such as a reward function expressed as a big matrix of Time vs States, or possibly a (very) small randomly-initialized NN (the way we use randomized NNs as ground-truth generative models in analyses like Rosenfeld 2021 to measure performance of NN classifiers)?

For example, perhaps the shortest reward-function program gives 1 reward for moving left (or right) the current T; then the next length does it only if one moves left twice over 2 time-steps; then the next length might require 2 lefts then 1 right; and so on. (In a Block World environment, the various walls and blocks would obstruct movement, requiring some degree of planning and environment manipulation. Or perhaps these can be plugged into MiniHack?) For temporal scaling, one can then use the intrinsic time length of these reward-function programs as the T knob to tweak, run hyperparameter sweeps to find agents which hit the known ceiling (we can easily figure out the optimal sequence for each reward-function and estimate optimal reward as just doing it in cycles until the end of the episode), as fast as possible averaged across a sample of randomized rewards & Block Worlds for each T (increasing T by an OOM or so each time since we expect some sort of log/power-law), and derive a relationship of ‘Time’ → ‘# of episodes (for optimally-scaled agent)‘. To tie it back to the diversity proposal, in this case, our little TMs are sampling TMs over time and are ‘deep’, whereas before they were TMs over the other dimensions of the environment, space, and are ‘wide’. In the time-case, we would leave the environment alone as much as possible and pack all the added difficulty into the time dimension/complexity of action sequences. This is a lot of runs, but it would all fit on-GPU and can be done with very small FC NNs conditioned on the history, so like Jones’s Hex work, should be quite feasible.
What links here?
- gwern's comment on You Can’t Predict a Game of Pinball by Jeffrey Heninger (30 Mar 2023 18:03 UTC; 90 points)
- gwern's comment on The Scale Problem in AI by tailcalled (19 May 2022 22:56 UTC; 5 points)
- Jacob_Hilton 4 Feb 2023 19:50 UTC
  3 points
  Parent
  You might find this work interesting, which takes some small steps in this direction. It studies the effect of horizon length inasmuch as it makes credit assignment harder, showing that the number of samples required is an affine function of horizon length in a toy context.

gwern comments on Proposal: Scaling laws for RL generalization

Temporal Scaling Laws