I agree with a lot of the points you bring up, and ultimately am very uncertain about what we will see in practice.
One point I didn’t see you address is that in longer-term planning (e.g. CEO-bot), one of the key features is dealing with an increased likelihood and magnitude of encountering tail risk events, just because there is a longer window within which they may occur (e.g. recessions, market shifts up or down the value chain, your delegated sub-bots Goodharting in unanticipatable ways for a while before you detect the problem). Your success becomes a function of your ability to design a plan that is resilient to progressively-larger unknown unknowns (where “use local control to account for these perturbations” can be part of such a plan).
Maybe “make resilient plans” is needed even in shorter environments; certainly it is to some extent, though given the rarity with which a short episode will fail due to such an unknown unknown, it does seem possible that agents trained on shorter-horizon problems would need to be trained extremely hard in order to generalize.
Though I agree with the original statement that I wish we did not need to lean on this thin reed.
I’m not sure if that matters. By definition, it probably won’t happen and so any kind of argument or defense based on tail-risks-crippling-AIs-but-not-humans will then also by definition usually fail (unless the tail risk can be manufactured on demand and also there’s somehow no better approach), and it’s unclear that’s really any worse than humans (we’re supposedly pretty bad at tail risks). Tail risks also become a convergent drive for empowerment: the easiest way to deal with tail risks is to become wealthy so quickly that it’s irrelevant, which is what an agent may be trying to do anyway. Tail stuff, drawing on vast amounts of declarative knowledge, is also something that can be a strength of of artificial intelligence compared to humans: an AI trained on a large corpus can observe and ‘remember’ tail risks in a way that individual humans never will—a stock market AI trained on centuries of data will remember Black Friday vividly in a way that I can’t. (By analogy, an ImageNet CNN is much better at recognizing dog breeds than almost any human, even if that human still has superior image skills in other ways. Preparing for a Black Friday crash may be more analogous to knowing every kind of terrier than being able to few-shot a new kind of terrier.)
These are all fair points. I originally thought this discussion was about the likelihood of poor near-term RL generalization when varying horizon length (ie affecting timelines) rather than what type of human-level RL agent will FOOM (ie takeoff speeds). Rereading the original post I see I was mistaken, and I see how my phrasing left that ambiguous. If we’re at the point where the agent is capable of using forecasting techniques to synthesize historical events described in internet text into probabilities, then we’re well-past the point where I think “horizon-length” might really matter for RL scaling laws. In general, you can find and replace my mentions of “tail risk” with “events with too low a frequency in the training distribution to make the agent well-calibrated.”
I think it’s important to note that some important agenty decisions are like this! Military history is all about people who studied everything that came before them, but are dealing with such a high-dimensional and adversarial context that generals still get it wrong in new ways every time.
To address your actual comment, I definitely don’t think humans are good at tail-risks. (E.g. there are very few people with successful track records across multiple paradigm shifts.) I would expect a reasonably good AGI to do better, for the reasons you describe. That said, I do think that FOOM is indeed taking on more weird unknown unknowns than average. (There aren’t great reference classes for inner-aligning your successor given that humans failed to align you.) Maybe not that many! Maybe there is a robust characterizable path to FOOM where everything you need to encounter has a well-documented reference class. I’m not sure.
I agree with a lot of the points you bring up, and ultimately am very uncertain about what we will see in practice.
One point I didn’t see you address is that in longer-term planning (e.g. CEO-bot), one of the key features is dealing with an increased likelihood and magnitude of encountering tail risk events, just because there is a longer window within which they may occur (e.g. recessions, market shifts up or down the value chain, your delegated sub-bots Goodharting in unanticipatable ways for a while before you detect the problem). Your success becomes a function of your ability to design a plan that is resilient to progressively-larger unknown unknowns (where “use local control to account for these perturbations” can be part of such a plan). Maybe “make resilient plans” is needed even in shorter environments; certainly it is to some extent, though given the rarity with which a short episode will fail due to such an unknown unknown, it does seem possible that agents trained on shorter-horizon problems would need to be trained extremely hard in order to generalize.
Though I agree with the original statement that I wish we did not need to lean on this thin reed.
I’m not sure if that matters. By definition, it probably won’t happen and so any kind of argument or defense based on tail-risks-crippling-AIs-but-not-humans will then also by definition usually fail (unless the tail risk can be manufactured on demand and also there’s somehow no better approach), and it’s unclear that’s really any worse than humans (we’re supposedly pretty bad at tail risks). Tail risks also become a convergent drive for empowerment: the easiest way to deal with tail risks is to become wealthy so quickly that it’s irrelevant, which is what an agent may be trying to do anyway. Tail stuff, drawing on vast amounts of declarative knowledge, is also something that can be a strength of of artificial intelligence compared to humans: an AI trained on a large corpus can observe and ‘remember’ tail risks in a way that individual humans never will—a stock market AI trained on centuries of data will remember Black Friday vividly in a way that I can’t. (By analogy, an ImageNet CNN is much better at recognizing dog breeds than almost any human, even if that human still has superior image skills in other ways. Preparing for a Black Friday crash may be more analogous to knowing every kind of terrier than being able to few-shot a new kind of terrier.)
These are all fair points. I originally thought this discussion was about the likelihood of poor near-term RL generalization when varying horizon length (ie affecting timelines) rather than what type of human-level RL agent will FOOM (ie takeoff speeds). Rereading the original post I see I was mistaken, and I see how my phrasing left that ambiguous. If we’re at the point where the agent is capable of using forecasting techniques to synthesize historical events described in internet text into probabilities, then we’re well-past the point where I think “horizon-length” might really matter for RL scaling laws. In general, you can find and replace my mentions of “tail risk” with “events with too low a frequency in the training distribution to make the agent well-calibrated.”
I think it’s important to note that some important agenty decisions are like this! Military history is all about people who studied everything that came before them, but are dealing with such a high-dimensional and adversarial context that generals still get it wrong in new ways every time.
To address your actual comment, I definitely don’t think humans are good at tail-risks. (E.g. there are very few people with successful track records across multiple paradigm shifts.) I would expect a reasonably good AGI to do better, for the reasons you describe. That said, I do think that FOOM is indeed taking on more weird unknown unknowns than average. (There aren’t great reference classes for inner-aligning your successor given that humans failed to align you.) Maybe not that many! Maybe there is a robust characterizable path to FOOM where everything you need to encounter has a well-documented reference class. I’m not sure.