Right, I specifically think that someone would be best served by trying to think of ways to get a SOTA result on an Atari benchmark, not simply reading up on past results (although you’d want to do that as part of your attempt). There’s a huge difference between reading about what’s worked in the past and trying to think of new things that could work and then trying them out to see if they do.
As I’ve learned more about deep learning and tried to understand the material, I’ve constantly had ideas that I think could improve things. Then I’ve tried them out, and usually learned that they didn’t, or they did but they’d already been done, or that it was more complicated than that, etc. But I learned a ton in the process. On the other hand, suppose I was wary of doing AI capability work. Each time I had one of these ideas, I shied away from it out of fear of advancing AGI timelines. The result would be threefold: I’d have a much worse understanding of AI, and I’d be a lot more concerned about immininent AGI (after all, I had tons of ideas for how things could be done better!), and I wouldn’t have actually delayed AGI timelines at all.
I think a lot of people who get into AI from the alignment side are in danger of falling into this trap. As an example in an ACX thread I saw someone thinking about doing their PHD in ML, and they were concerned that they may have to do capability research in order to get their PHD. Someone replied that if they had to they should at least try to make sure it is nothing particularly important, in order to avoid advancing AGI timelines. I don’t think this is a good idea. Spending years working on research while actively holding yourself back from really thinking deeply about AI will harm your development significantly, and early in your career is right when you benefit the most from developing your understanding and are least likely to actually move up AGI timelines.
Suppose we have a current expected AGI arrival date of 20XX. This is the result of DeepMind, Google Brain, OpenAI, FAIR, Nvidia, universities all over the world, the Chinese government, and more all developing the state of the art. On top of that there’s computational progress happening at the same time, which may well turn out to be a major bottleneck. How much would OpenAI removing themselves from this race affect the date? A small but real amount. How about a bright PHD candidate removing themselves from this race? About zero. I don’t think people properly internalize both how insignificant the timeline difference is, and also how big the skill gains are from actually trying your hardest at something as opposed to handicapping yourself. And if you come up with something you’re genuinely worried about you can just not publish.
I do agree that people should try their ideas out, even if the ideas are “capabilities” flavored. However, I do think (if you buy the serial vs parallel distinction in the OP) that you should try to not do capabilities research.
Right, I specifically think that someone would be best served by trying to think of ways to get a SOTA result on an Atari benchmark, not simply reading up on past results (although you’d want to do that as part of your attempt). There’s a huge difference between reading about what’s worked in the past and trying to think of new things that could work and then trying them out to see if they do.
As you say, most ML ideas people come up with at first are pretty doomed to failure, and the main way people learn is via experience. This is in part due to the overconfidence of newbies in any field, but also in part due to how counterintuitive many ML results are to most people. [1]
The key thing people should know is, if you stumble on an actual capabilities insight… you can just… not publish it or talk about it. I think I’d emphasize this point over the other points. Do the research most helpful for learning, and then in the unlikely event it ends up being impressive capabilities work, you can always just put it into your filing cabinet and walk away. [2]
As for the ML PhD example:
Someone replied that if they had to they should at least try to make sure it is nothing particularly important, in order to avoid advancing AGI timelines. I don’t think this is a good idea. Spending years working on research while actively holding yourself back from really thinking deeply about AI [..]
I think you can think very deeply about AIs without going out and working on critical-path capabilities work! You should think very deeply about AIs in general, if you’re working in the field, regardless of what you’re doing! But if you’re in a job where you have to publish to advance, (assuming you buy the assumptions in the OP) it seems pretty bad to actively seek out and work on critical-path capabilities work, as opposed to skill-building work or safety work.
Finally, while I agree with your overall takeaway, I strongly disagree with this style of argument:
How about a bright PHD candidate removing themselves from this race? About zero
Because the expected effect of most people on moving anything significant forward by a couple of months is probably going to be zero, including solutions to alignment; there’s just a lot of people out there and the problems people want to work on are really hard. What matters isn’t whether or not we can measure the impact of single people in terms of full percentage points of various outcomes or full weeks of time, but about whether the expected gains are larger from doing the PhD vs not doing so, or what kind of PhD maximizes the expected gains or minimizes the expected harms.
As the post says:
And, again, small individual “don’t burn the timeline” actions all contribute to incrementally increasing the time humanity has to get its act together and figure this stuff out.
Even if your net impact is small, you can still choose exactly how small and what direction it is.
That being said, I think you’re painting a false dichotomy between trying really hard to “get a SOTA on an Atari benchmark” and “simply reading up on past results”. e.g. you could also gain experience reimplement existing results, explore their robustness, etc.
As a side note, I don’t think most ways of getting SOTA on Atari benchmarks are particularly relevant to cutting-edge capabilities work nor what I’d recommend people spend a lot of their time on. It’s possible we’re imagining completely different things here. That being said, this is not a crux for my belief that people should lean more toward trying things out.
Similarly, I also disagree with this take:
As an example in an ACX thread I saw someone thinking about doing their PHD in ML, and they were concerned that they may have to do capability research in order to get their PHD.
I think that the majority of ML PhDs actually are not meaningfully contributing to capabilities, because they’re just not working on things likely to be relevant after a few more improvements on general capabilities. See for example a lot of the pre-GPT NLP work finetuning small neural networks harder on specific tasks. I’d also bet that a lot of video understanding work in the past 5 years will also be obsoleted when we get better video/multi-modal foundational models.
I’m very unconfident in the following but, to sketch my intuition:
I don’t really agree with the idea of serial alignment progress that is independent from capability progress. This is what I was trying to get at with
“AI capabilities” and “AI alignment” are highly related to each other, and “AI capabilities” has to come first in that alignment assumes that there is a system to align.
By analogy, nuclear fusion safety research is inextricable from nuclear fusion capability research.
When I try to think of ways to align AI my mind points towards questions like “how do we get an AI to extrapolate concepts? How will it be learning? What will its architecture be?” etc. In other words it just points towards capabilities questions. Since alignment turns on capability questions that we don’t yet have an answer to, it doesn’t surprise me when many alignment researchers seem to spin their wheels and turn to doom and gloom—that’s more or less what I had thought would happen.
As an example of the blurred lines between capability and alignment: while I think it’s useful to have specific terms for inner and outer alignment, I also think that really anyone who worked with RL in a situation where they were manually setting the reward function was aware of these ideas already on some level. “Sometimes I mess up the reward function” and “sometimes the agent isn’t optimizing properly” are both issues encountered frequently. Basically while many people in the alignment community seem to think of alignment as something that is cooked up entirely separately from capability research I tend to think that a lot of it will develop naturally as part of day-to-day AI research with no specific focus on alignment.
As a thought experiment, let’s say that about 20% of current AI capability researchers are very concerned about AI alignment and get together to decide what to do for the next five years. They’re deciding between taking the stance “Capability work is fine right now! Go for it! Worry about alignment when we’re farther along!” or “Let’s get out of capability and go into alignment instead. Capability research is dangerous and burning precious time.” What’s the impact of adopting these two positions?
The first is roughly the default position, and I’d expect that basically what we’ll see is AGI in the year 20XX and that in the runup to this we’ll see vastly increased interest in alignment work and also a significant blurring between “alignment” and “regular AI research” since people want their home robots to not roll over their cat. We’ll also see all major AI research orgs and the AI community as a whole take existential risk from self-improving AGI a lot more seriously once modern SOTA AI systems start looking more and more like the kind of thing that could do that. Because of this there’ll be a concerted effort to handle the situation appropriately which has a good chance of success.
Option two involves slowing down the timeline by about 5-10%. Cutting the size of a field by 20% doesn’t slow progress that much since there’s diminishing returns to adding more researchers, and on top of that AI capability research is only half of what drives progress (the other half being compute). In return for this small slowdown the AI researchers who are now going into alignment will initially spin their wheels due to the lack of anything concrete to focus on or any concrete knowledge of what the future systems will look like. When AGI does start approaching the remaining AI capability community will take it much less seriously due to having been selected specifically for that trait. Three years before the arrival of transformative AGI alignment research is further along than it otherwise would have been, but AI capability researchers have gotten used to tuning alignment researchers out and there aren’t alignment-sympathetic colleagues around to say “hey, given how things are progressing I think it’s time we start taking all that AI risk stuff seriously”. Prospects are worse than option one.
So right now my intuition is that I think alignment will be very doable as long as it’s something that the AI community is taking seriously in the few years leading up to transformative AGI. The biggest risk seems to me to be some AI researchers at one of the leading research groups thinking “man, it sure would be cool if we could use the latest coding LLM combined with RL to make an AI that could improve itself in order to accomplish a goal” and set it running without it ever occuring to them that this could go wrong. Given this, the suggestion that everyone concerned about alignment basically cedes the whole field of AI research (outside of this specific community, “AI capability research” is just called “AI research”) to people who aren’t worried about it seems like a bad idea.
Yeah, that might be a big idea. If you’re right that AI capabilities work and AI Alignment work is the same thing, the problem is solved by definition. So if I’m getting at things correctly, capabilities and safety are highly correlated, and there can’t be situations where capabilities and alignment decouple.
So if I’m getting at things correctly, capabilities and safety are highly correlated, and there can’t be situations where capabilities and alignment decouple.
Not that far, more like it doesn’t decouple until more progress has been made. Pure alignment is an advanced subtopic of AI research that requires more progress to have been made before it’s a viable field.
I’m not super confident in the above and wouldn’t discourage people from doing alignment work now (plus the obvious nuance that it’s not one big lump, there are some things that can be done later and some that can be done earlier) but the idea of alignment work that requires a whole bunch of work in serial, independent of AI capability work, doesn’t seem plausible to me. From Nate Soares’ post:
The most blatant case of alignment work that seems serial to me is work that requires having a theoretical understanding of minds/optimization/whatever, or work that requires having just the right concepts for thinking about minds.
This is the kind of thing that seems inextricably bound up with capability work to me. My impression is that MIRI tends to think that whatever route we take to get to AGI, as it moves from subhuman to human-level intelligence it will transform to be like the minds that they theorize about (and they think this will happen before it goes foom) no matter how different it was when it started. So even if they don’t know what a state of the art RL agent will look like five years from now, they feel confident they can theorize about what it will look like ten years from now. Whereas my view is that if you can’t get the former right you won’t get the latter right either.
To the extent that intelligences will converge towards a certain optimal way of thinking as they get smarter, being able to predict what that looks like will involve a lot of capability work (“Hmm, maybe it will learn like this; let’s code up an agent that learns that way and see how it does”). If you’re not grounding your work in concrete experiments you will end up with mistakes in your view of what an optimal agent looks like and no way to fix them.
A big part of my view is that we seem to still be a long way from AGI. This hinges on how “real” the intelligence behind LLMs is. If we have to take the RL route then we are a long way away—I wrote a piece on this, “What Happened to AIs Learning Games from Pixels?”, which points out how slow the progress has been and covers the areas where the field is stuck. On the other hand if we can get most of the way to AGI just with massive self-supervised training then it starts seeming more likely that we’ll walk into AGI without having a good understanding of what’s going on. I think that the failure of VPT for minecraft compared to GPT for language, and the difficulty LLMs have with extrapolation and innovation, means that self-supervised learning won’t be enough without more insight. I’ll be paying close attention to how GPT-4 and other LLMs do over the next few years to see if they’re making progress faster than I thought, but I talked to chatGPT and it was way worse than I thought it’d be.
I like your comments, 307th, and your linked post on RL SotA. I don’t agree with everything you say, but I some of what you say is quite on point. In particular I agree that ‘RL is currently being rather unimpressive in achieving complicated goals in complex wide-possible-action-space simulation worlds’. I agree that some fundamental breakthroughs are needed to change this, not just scaling existing methods. I disagree that such breakthroughs will necessarily require many calendar years of research. I think probably the eyes of the big research labs will soon be turning to focus more fully upon tackling complex-world RL, and that it won’t be long at all before significant breakthroughs start being made.
I think rather than thinking about research progress in terms of years, or even ‘researcher hours’, it’s more helpful to think of progress in terms of ‘research points’ devoted to the specific topic. An hour of a highly effective researcher at a well-funded lab, with a well-setup research environment that makes new experiments easy to run is worth vastly more ‘research points’ towards a topic than an hour of a compute-limited grad student without polished experiment-running code patterns, without access to huge compute resources, and without much experience running large experiments over many variables.
Right, I specifically think that someone would be best served by trying to think of ways to get a SOTA result on an Atari benchmark, not simply reading up on past results (although you’d want to do that as part of your attempt). There’s a huge difference between reading about what’s worked in the past and trying to think of new things that could work and then trying them out to see if they do.
As I’ve learned more about deep learning and tried to understand the material, I’ve constantly had ideas that I think could improve things. Then I’ve tried them out, and usually learned that they didn’t, or they did but they’d already been done, or that it was more complicated than that, etc. But I learned a ton in the process. On the other hand, suppose I was wary of doing AI capability work. Each time I had one of these ideas, I shied away from it out of fear of advancing AGI timelines. The result would be threefold: I’d have a much worse understanding of AI, and I’d be a lot more concerned about immininent AGI (after all, I had tons of ideas for how things could be done better!), and I wouldn’t have actually delayed AGI timelines at all.
I think a lot of people who get into AI from the alignment side are in danger of falling into this trap. As an example in an ACX thread I saw someone thinking about doing their PHD in ML, and they were concerned that they may have to do capability research in order to get their PHD. Someone replied that if they had to they should at least try to make sure it is nothing particularly important, in order to avoid advancing AGI timelines. I don’t think this is a good idea. Spending years working on research while actively holding yourself back from really thinking deeply about AI will harm your development significantly, and early in your career is right when you benefit the most from developing your understanding and are least likely to actually move up AGI timelines.
Suppose we have a current expected AGI arrival date of 20XX. This is the result of DeepMind, Google Brain, OpenAI, FAIR, Nvidia, universities all over the world, the Chinese government, and more all developing the state of the art. On top of that there’s computational progress happening at the same time, which may well turn out to be a major bottleneck. How much would OpenAI removing themselves from this race affect the date? A small but real amount. How about a bright PHD candidate removing themselves from this race? About zero. I don’t think people properly internalize both how insignificant the timeline difference is, and also how big the skill gains are from actually trying your hardest at something as opposed to handicapping yourself. And if you come up with something you’re genuinely worried about you can just not publish.
I do agree that people should try their ideas out, even if the ideas are “capabilities” flavored. However, I do think (if you buy the serial vs parallel distinction in the OP) that you should try to not do capabilities research.
As you say, most ML ideas people come up with at first are pretty doomed to failure, and the main way people learn is via experience. This is in part due to the overconfidence of newbies in any field, but also in part due to how counterintuitive many ML results are to most people. [1]
The key thing people should know is, if you stumble on an actual capabilities insight… you can just… not publish it or talk about it. I think I’d emphasize this point over the other points. Do the research most helpful for learning, and then in the unlikely event it ends up being impressive capabilities work, you can always just put it into your filing cabinet and walk away. [2]
As for the ML PhD example:
I think you can think very deeply about AIs without going out and working on critical-path capabilities work! You should think very deeply about AIs in general, if you’re working in the field, regardless of what you’re doing! But if you’re in a job where you have to publish to advance, (assuming you buy the assumptions in the OP) it seems pretty bad to actively seek out and work on critical-path capabilities work, as opposed to skill-building work or safety work.
Finally, while I agree with your overall takeaway, I strongly disagree with this style of argument:
Because the expected effect of most people on moving anything significant forward by a couple of months is probably going to be zero, including solutions to alignment; there’s just a lot of people out there and the problems people want to work on are really hard. What matters isn’t whether or not we can measure the impact of single people in terms of full percentage points of various outcomes or full weeks of time, but about whether the expected gains are larger from doing the PhD vs not doing so, or what kind of PhD maximizes the expected gains or minimizes the expected harms.
As the post says:
Even if your net impact is small, you can still choose exactly how small and what direction it is.
That being said, I think you’re painting a false dichotomy between trying really hard to “get a SOTA on an Atari benchmark” and “simply reading up on past results”. e.g. you could also gain experience reimplement existing results, explore their robustness, etc.
As a side note, I don’t think most ways of getting SOTA on Atari benchmarks are particularly relevant to cutting-edge capabilities work nor what I’d recommend people spend a lot of their time on. It’s possible we’re imagining completely different things here. That being said, this is not a crux for my belief that people should lean more toward trying things out.
Similarly, I also disagree with this take:
I think that the majority of ML PhDs actually are not meaningfully contributing to capabilities, because they’re just not working on things likely to be relevant after a few more improvements on general capabilities. See for example a lot of the pre-GPT NLP work finetuning small neural networks harder on specific tasks. I’d also bet that a lot of video understanding work in the past 5 years will also be obsoleted when we get better video/multi-modal foundational models.
I’m very unconfident in the following but, to sketch my intuition:
I don’t really agree with the idea of serial alignment progress that is independent from capability progress. This is what I was trying to get at with
By analogy, nuclear fusion safety research is inextricable from nuclear fusion capability research.
When I try to think of ways to align AI my mind points towards questions like “how do we get an AI to extrapolate concepts? How will it be learning? What will its architecture be?” etc. In other words it just points towards capabilities questions. Since alignment turns on capability questions that we don’t yet have an answer to, it doesn’t surprise me when many alignment researchers seem to spin their wheels and turn to doom and gloom—that’s more or less what I had thought would happen.
As an example of the blurred lines between capability and alignment: while I think it’s useful to have specific terms for inner and outer alignment, I also think that really anyone who worked with RL in a situation where they were manually setting the reward function was aware of these ideas already on some level. “Sometimes I mess up the reward function” and “sometimes the agent isn’t optimizing properly” are both issues encountered frequently. Basically while many people in the alignment community seem to think of alignment as something that is cooked up entirely separately from capability research I tend to think that a lot of it will develop naturally as part of day-to-day AI research with no specific focus on alignment.
As a thought experiment, let’s say that about 20% of current AI capability researchers are very concerned about AI alignment and get together to decide what to do for the next five years. They’re deciding between taking the stance “Capability work is fine right now! Go for it! Worry about alignment when we’re farther along!” or “Let’s get out of capability and go into alignment instead. Capability research is dangerous and burning precious time.” What’s the impact of adopting these two positions?
The first is roughly the default position, and I’d expect that basically what we’ll see is AGI in the year 20XX and that in the runup to this we’ll see vastly increased interest in alignment work and also a significant blurring between “alignment” and “regular AI research” since people want their home robots to not roll over their cat. We’ll also see all major AI research orgs and the AI community as a whole take existential risk from self-improving AGI a lot more seriously once modern SOTA AI systems start looking more and more like the kind of thing that could do that. Because of this there’ll be a concerted effort to handle the situation appropriately which has a good chance of success.
Option two involves slowing down the timeline by about 5-10%. Cutting the size of a field by 20% doesn’t slow progress that much since there’s diminishing returns to adding more researchers, and on top of that AI capability research is only half of what drives progress (the other half being compute). In return for this small slowdown the AI researchers who are now going into alignment will initially spin their wheels due to the lack of anything concrete to focus on or any concrete knowledge of what the future systems will look like. When AGI does start approaching the remaining AI capability community will take it much less seriously due to having been selected specifically for that trait. Three years before the arrival of transformative AGI alignment research is further along than it otherwise would have been, but AI capability researchers have gotten used to tuning alignment researchers out and there aren’t alignment-sympathetic colleagues around to say “hey, given how things are progressing I think it’s time we start taking all that AI risk stuff seriously”. Prospects are worse than option one.
So right now my intuition is that I think alignment will be very doable as long as it’s something that the AI community is taking seriously in the few years leading up to transformative AGI. The biggest risk seems to me to be some AI researchers at one of the leading research groups thinking “man, it sure would be cool if we could use the latest coding LLM combined with RL to make an AI that could improve itself in order to accomplish a goal” and set it running without it ever occuring to them that this could go wrong. Given this, the suggestion that everyone concerned about alignment basically cedes the whole field of AI research (outside of this specific community, “AI capability research” is just called “AI research”) to people who aren’t worried about it seems like a bad idea.
Yeah, that might be a big idea. If you’re right that AI capabilities work and AI Alignment work is the same thing, the problem is solved by definition. So if I’m getting at things correctly, capabilities and safety are highly correlated, and there can’t be situations where capabilities and alignment decouple.
Not that far, more like it doesn’t decouple until more progress has been made. Pure alignment is an advanced subtopic of AI research that requires more progress to have been made before it’s a viable field.
I’m not super confident in the above and wouldn’t discourage people from doing alignment work now (plus the obvious nuance that it’s not one big lump, there are some things that can be done later and some that can be done earlier) but the idea of alignment work that requires a whole bunch of work in serial, independent of AI capability work, doesn’t seem plausible to me. From Nate Soares’ post:
This is the kind of thing that seems inextricably bound up with capability work to me. My impression is that MIRI tends to think that whatever route we take to get to AGI, as it moves from subhuman to human-level intelligence it will transform to be like the minds that they theorize about (and they think this will happen before it goes foom) no matter how different it was when it started. So even if they don’t know what a state of the art RL agent will look like five years from now, they feel confident they can theorize about what it will look like ten years from now. Whereas my view is that if you can’t get the former right you won’t get the latter right either.
To the extent that intelligences will converge towards a certain optimal way of thinking as they get smarter, being able to predict what that looks like will involve a lot of capability work (“Hmm, maybe it will learn like this; let’s code up an agent that learns that way and see how it does”). If you’re not grounding your work in concrete experiments you will end up with mistakes in your view of what an optimal agent looks like and no way to fix them.
A big part of my view is that we seem to still be a long way from AGI. This hinges on how “real” the intelligence behind LLMs is. If we have to take the RL route then we are a long way away—I wrote a piece on this, “What Happened to AIs Learning Games from Pixels?”, which points out how slow the progress has been and covers the areas where the field is stuck. On the other hand if we can get most of the way to AGI just with massive self-supervised training then it starts seeming more likely that we’ll walk into AGI without having a good understanding of what’s going on. I think that the failure of VPT for minecraft compared to GPT for language, and the difficulty LLMs have with extrapolation and innovation, means that self-supervised learning won’t be enough without more insight. I’ll be paying close attention to how GPT-4 and other LLMs do over the next few years to see if they’re making progress faster than I thought, but I talked to chatGPT and it was way worse than I thought it’d be.
I like your comments, 307th, and your linked post on RL SotA. I don’t agree with everything you say, but I some of what you say is quite on point. In particular I agree that ‘RL is currently being rather unimpressive in achieving complicated goals in complex wide-possible-action-space simulation worlds’. I agree that some fundamental breakthroughs are needed to change this, not just scaling existing methods. I disagree that such breakthroughs will necessarily require many calendar years of research. I think probably the eyes of the big research labs will soon be turning to focus more fully upon tackling complex-world RL, and that it won’t be long at all before significant breakthroughs start being made.
I think rather than thinking about research progress in terms of years, or even ‘researcher hours’, it’s more helpful to think of progress in terms of ‘research points’ devoted to the specific topic. An hour of a highly effective researcher at a well-funded lab, with a well-setup research environment that makes new experiments easy to run is worth vastly more ‘research points’ towards a topic than an hour of a compute-limited grad student without polished experiment-running code patterns, without access to huge compute resources, and without much experience running large experiments over many variables.
Thanks for making things clearer! I’ll have to think about this one—some very interesting points from a side I had perhaps unfairly dismissed before.