How my views on AI(S) have changed over the last 5.5 years
This was shamelessly copied from directly inspired by Erik Jenner’s “How my views on AI have changed over the last 1.5 years”. I think my views when I started my PhD in Fall 2018 look a lot worse than Erik’s when he started his PhD, though in large part due to starting my PhD in 2018 and not 2022.
Apologies for the disorganized bullet points. If I had more time I would’ve written a shorter shortform.
AI Capabilities/Development
Summary: I used to believe in a 2018-era MIRI worldview for AGI, and now I have updated toward slower takeoff, fewer insights, and shorter timelines.
In Fall of 2018, my model of how AGI might happen was substantially influenced by AlphaGo/Zero, which features explicit internal search. I expected future AIs to also feature explicit internal search over world models, and be trained mainly via reinforcement learning or IDA. I became more uncertain after OpenAI 5 (~May 2018), which used no clever techniques and just featured BPTT being ran on large LSTMs.
That being said, I did not believe in the scaling hypothesis—that is, that simply training larger models on more inputs would continually improve performance until we see “intelligent behavior”—until GPT-2 (2019), despite encountering it significantly earlier (e.g. with OpenAI 5, or speaking to OAI people).
In particular, I believed that we needed many “key insights” about intelligence before we could make AGI. This both gave me longer timelines and also made me believe more in fast take-off.
I used to believe pretty strongly in MIRI-style fast take-off (e.g. would’ve assigned <30% credence that we see a 4 year period with the economy doubling) as opposed to (what was called at the time) Paul-style slow take-off. Given the way the world has turned out, I have updated substantially. While I don’t think that AI development will be particularly smooth, I do expect it to be somewhat incremental, and I also expect earlier AIs to provide significantly more value even before truly transformative aI.
-- Some beliefs about AI Scaling Labs that I’m redacting on LW --
My timelines are significantly shorter—I would’ve probably said median 2050-60 in 2018, but now I think we will probably reach human-level AI by 2035.
AI X-Risk
Summary: I have become more optimistic about AI X-risk, but my understanding has become more nuanced.
My P(Doom) has substantially decreased, especially P(Doom) attributable to an AI directly killing all of humanity. This is somewhat due to having more faith that many people will be reasonable (in 2018, there were maybe ~20 FTE AIS researchers, now there are probably something like 300-1000 depending on how you count), somewhat due to believing that governance efforts may successfully slow down AGI substantially, and somewhat due to an increased belief that “winging-it”—style, “unprincipled” solutions can scale to powerful AIs.
That being said, I’m less sure about what P(Doom) means. In 2018, I imagined the main outcomes were either “unaligned AGI instantly defeats all of humanity” and “a pure post-scarcity utopia”. I now believe in a much wider variety of outcomes.
For example, I’ve become more convinced both that misuse risk is larger than I thought, and that even weirder outcomes are possible (e.g. the AI keeps human (brain scans) around due to trade reasons). The former is in large part related to my belief in fast take-off being somewhat contradicted by world events; now there is more time for powerful AIs to be misused.
I used to think that solving the technical problem of AI alignment would be necessary/sufficient to prevent AI x-risk. I now think that we’re unlikely to “solve alignment” in a way that leads to the ability to deploy a powerful Sovereign AI (without AI assistance), and also that governance solutions both can be helpful and are required.
AI Safety Research
Summary: I’ve updated slightly downwards on the value of conceptual work and significantly upwards on the value of fast empirical feedback cycles. I’ve become more bullish on (mech) interp, automated alignment research, and behavioral capability evaluations.
In Fall 2018, I used to think that IRL for ambitious value learning was one of the most important problems to work on. I no longer think so, and think that most of my work on this topic was basically useless.
In terms of more prosaic IRL problems, I very much lived in a frame of “the reward models are too dumb to understand” (a standard academic take) . I didn’t think much about issues of ontology identification or (malign) partial observability.
I thought that academic ML theory had a decent chance of being useful for alignment. I think it’s basically been pretty useless in the past 5.5 years, and no longer think the chances of it being helpful “in time” are enough. It’s not clear how much of this is because the AIS community did not really know about the academic ML theory work, but man, the bounds turned out to be pretty vacuous, and empirical work turned out far more informative than pure theory work.
I still think that conceptual work is undervalued in ML, but my prototypical good conceptual work looks less like “prove really hard theorems” or “think about philosophy” and a lot more like “do lots of cheap and quick experiments/proof sketches to get grounding”.
Relatedly, I used to dismiss simple techniques for AI Alignment that try “the obvious thing”. While I don’t think these techniques will scale (or even necessarily work well on current AIs), this strategy has turned out to be significantly better in practice than I thought.
My error bars around the value of reading academic literature have shrunk significantly (in large part due to reading a lot of it). I’ve updated significantly upwards on “the academic literature will probably contain some relevant insights” and downwards on “the missing component of all of AGI safety can be found in a paper from 1983″.
I used to think that interpretability of deep neural networks was probably infeasible to achieve “in time” if not “actually impossible” (especially mechanistic interpretability). Now I’m pretty uncertain about its feasibility.
Similarly, I used to think that having AIs automate substantial amounts of alignment research was not possible. Now I think that most plans with a shot of successfully preventing AGI x-risk will feature substantial amounts of AI.
I used to think that behavioral evaluations in general would be basically useless for AGIs. I now think that dangerous capability evaluations can serve as an important governance tool.
Bonus: Some Personal Updates
Summary: I’ve better identified my comparative advantages, and have a healthier way of relating to AIS research.
I used to think that my comparative advantage was clearly going to be in doing the actual technical thinking or theorem proving. In fact, I used to believe that I was unsuited for both technical writing and pushing projects over the finish line. Now I think that most of my value in the past ~2 years has come from technical writing or by helping finish projects.
I used to think that pure engineering or mathematical skill were what mattered, and feel sad about how it seemed that my comparative advantage was something akin to long term memory.[1] I now see more value in having good long-term memory.
I used to be uncertain about if academia was a good place for me to do research. Now I’m pretty confident it’s not.
Embarrassingly enough, in 2018 I used to implicitly believe quite strongly in a binary model of “you’re good enough to do research” vs “you’re not good enough to do research”. In addition, I had an implicit model that the only people “good enough” were those who never failed at any evaluation. I no longer think this is true.
I am more of a fan of trying obvious approaches or “just doing the thing”.
I think, compared to the people around me, I don’t actually have that much “raw compute” or even short term memory (e.g. I do pretty poorly on IQ tests or novel math puzzles), and am able to perform at a much higher level by pattern matching and amortizing thinking using good long-term memory (if not outsourcing it entirely by quoting other people’s writing).
I don’t mean this as a criticism—you can both be right—but this is extremely correlated to the updates made by the average Bay Area x-risk reduction-enjoyer over the past 5-10 years, to the extent that it almost could serve as a summary.
I don’t want to say things that have any chance of annoying METR without checking with METR comm people, and I don’t think it’s worth their time to check the things I wanted to say.
How my views on AI(S) have changed over the last 5.5 years
This was
shamelessly copied fromdirectly inspired by Erik Jenner’s “How my views on AI have changed over the last 1.5 years”. I think my views when I started my PhD in Fall 2018 look a lot worse than Erik’s when he started his PhD, though in large part due to starting my PhD in 2018 and not 2022.Apologies for the disorganized bullet points. If I had more time I would’ve written a shorter shortform.
AI Capabilities/Development
Summary: I used to believe in a 2018-era MIRI worldview for AGI, and now I have updated toward slower takeoff, fewer insights, and shorter timelines.
In Fall of 2018, my model of how AGI might happen was substantially influenced by AlphaGo/Zero, which features explicit internal search. I expected future AIs to also feature explicit internal search over world models, and be trained mainly via reinforcement learning or IDA. I became more uncertain after OpenAI 5 (~May 2018), which used no clever techniques and just featured BPTT being ran on large LSTMs.
That being said, I did not believe in the scaling hypothesis—that is, that simply training larger models on more inputs would continually improve performance until we see “intelligent behavior”—until GPT-2 (2019), despite encountering it significantly earlier (e.g. with OpenAI 5, or speaking to OAI people).
In particular, I believed that we needed many “key insights” about intelligence before we could make AGI. This both gave me longer timelines and also made me believe more in fast take-off.
I used to believe pretty strongly in MIRI-style fast take-off (e.g. would’ve assigned <30% credence that we see a 4 year period with the economy doubling) as opposed to (what was called at the time) Paul-style slow take-off. Given the way the world has turned out, I have updated substantially. While I don’t think that AI development will be particularly smooth, I do expect it to be somewhat incremental, and I also expect earlier AIs to provide significantly more value even before truly transformative aI.
-- Some beliefs about AI Scaling Labs that I’m redacting on LW --
My timelines are significantly shorter—I would’ve probably said median 2050-60 in 2018, but now I think we will probably reach human-level AI by 2035.
AI X-Risk
Summary: I have become more optimistic about AI X-risk, but my understanding has become more nuanced.
My P(Doom) has substantially decreased, especially P(Doom) attributable to an AI directly killing all of humanity. This is somewhat due to having more faith that many people will be reasonable (in 2018, there were maybe ~20 FTE AIS researchers, now there are probably something like 300-1000 depending on how you count), somewhat due to believing that governance efforts may successfully slow down AGI substantially, and somewhat due to an increased belief that “winging-it”—style, “unprincipled” solutions can scale to powerful AIs.
That being said, I’m less sure about what P(Doom) means. In 2018, I imagined the main outcomes were either “unaligned AGI instantly defeats all of humanity” and “a pure post-scarcity utopia”. I now believe in a much wider variety of outcomes.
For example, I’ve become more convinced both that misuse risk is larger than I thought, and that even weirder outcomes are possible (e.g. the AI keeps human (brain scans) around due to trade reasons). The former is in large part related to my belief in fast take-off being somewhat contradicted by world events; now there is more time for powerful AIs to be misused.
I used to think that solving the technical problem of AI alignment would be necessary/sufficient to prevent AI x-risk. I now think that we’re unlikely to “solve alignment” in a way that leads to the ability to deploy a powerful Sovereign AI (without AI assistance), and also that governance solutions both can be helpful and are required.
AI Safety Research
Summary: I’ve updated slightly downwards on the value of conceptual work and significantly upwards on the value of fast empirical feedback cycles. I’ve become more bullish on (mech) interp, automated alignment research, and behavioral capability evaluations.
In Fall 2018, I used to think that IRL for ambitious value learning was one of the most important problems to work on. I no longer think so, and think that most of my work on this topic was basically useless.
In terms of more prosaic IRL problems, I very much lived in a frame of “the reward models are too dumb to understand” (a standard academic take) . I didn’t think much about issues of ontology identification or (malign) partial observability.
I thought that academic ML theory had a decent chance of being useful for alignment. I think it’s basically been pretty useless in the past 5.5 years, and no longer think the chances of it being helpful “in time” are enough. It’s not clear how much of this is because the AIS community did not really know about the academic ML theory work, but man, the bounds turned out to be pretty vacuous, and empirical work turned out far more informative than pure theory work.
I still think that conceptual work is undervalued in ML, but my prototypical good conceptual work looks less like “prove really hard theorems” or “think about philosophy” and a lot more like “do lots of cheap and quick experiments/proof sketches to get grounding”.
Relatedly, I used to dismiss simple techniques for AI Alignment that try “the obvious thing”. While I don’t think these techniques will scale (or even necessarily work well on current AIs), this strategy has turned out to be significantly better in practice than I thought.
My error bars around the value of reading academic literature have shrunk significantly (in large part due to reading a lot of it). I’ve updated significantly upwards on “the academic literature will probably contain some relevant insights” and downwards on “the missing component of all of AGI safety can be found in a paper from 1983″.
I used to think that interpretability of deep neural networks was probably infeasible to achieve “in time” if not “actually impossible” (especially mechanistic interpretability). Now I’m pretty uncertain about its feasibility.
Similarly, I used to think that having AIs automate substantial amounts of alignment research was not possible. Now I think that most plans with a shot of successfully preventing AGI x-risk will feature substantial amounts of AI.
I used to think that behavioral evaluations in general would be basically useless for AGIs. I now think that dangerous capability evaluations can serve as an important governance tool.
Bonus: Some Personal Updates
Summary: I’ve better identified my comparative advantages, and have a healthier way of relating to AIS research.
I used to think that my comparative advantage was clearly going to be in doing the actual technical thinking or theorem proving. In fact, I used to believe that I was unsuited for both technical writing and pushing projects over the finish line. Now I think that most of my value in the past ~2 years has come from technical writing or by helping finish projects.
I used to think that pure engineering or mathematical skill were what mattered, and feel sad about how it seemed that my comparative advantage was something akin to long term memory.[1] I now see more value in having good long-term memory.
I used to be uncertain about if academia was a good place for me to do research. Now I’m pretty confident it’s not.
Embarrassingly enough, in 2018 I used to implicitly believe quite strongly in a binary model of “you’re good enough to do research” vs “you’re not good enough to do research”. In addition, I had an implicit model that the only people “good enough” were those who never failed at any evaluation. I no longer think this is true.
I am more of a fan of trying obvious approaches or “just doing the thing”.
I think, compared to the people around me, I don’t actually have that much “raw compute” or even short term memory (e.g. I do pretty poorly on IQ tests or novel math puzzles), and am able to perform at a much higher level by pattern matching and amortizing thinking using good long-term memory (if not outsourcing it entirely by quoting other people’s writing).
I don’t mean this as a criticism—you can both be right—but this is extremely correlated to the updates made by the average Bay Area x-risk reduction-enjoyer over the past 5-10 years, to the extent that it almost could serve as a summary.
That seems correct, at least directionally, yes.
Is the reason for the redaction also private?
I don’t want to say things that have any chance of annoying METR without checking with METR comm people, and I don’t think it’s worth their time to check the things I wanted to say.