The strongest argument I hear from EY is that he can’t imagine a (or enough) coherent likely future paths that lead to not-doom, and I don’t think it’s a failure of imagination. There is decoherence in a lot of hopeful ideas that imply contradictions (whence the post of failure modes), and there is low probability on the remaining successful paths because we’re likely to try a failing one that results in doom. Stepping off any of the possible successful paths has the risk of ending all paths with doom before they could reach fruition. There is no global strategy for selecting which paths to explore. EY expects the successful alignment path to take decades.
It seems to me that the communication failure is EY trying to explain his world model that leads to his predictions in sufficient detail that others can model it with as much detail as necessary to reach the same conclusions or find the actual crux of their disagreements. From my complete outsider’s perspective this is because EY has a very strong but complex model of why and how intelligence/optimization manifests in the world, but it overlaps everyone else’s model in significant ways that disagreements are hard to tease out. The Foom debate seems to be a crux that doesn’t have enough evidence yet, which is frustrating because to me Foom is also pretty evidently what happens when very fast computers implement intelligence that is superhuman at clock rates at least thousands of times faster than humans. How could it not? The enlightenment was only 400 years ago, electromagnetism 200, flight was 120, quantum mechanics about 100, nuclear power was 70, the Internet was 50, adequate machine translation was 10, deepdream was 8, and near-human-level image and text generation by transformers was ~2 and Bing having self-referential discussions is not a month old. We are making substantial monthly(!) progress with human work alone. There are a lot of serial problems to solve and Foom chains those serial problems together far faster than humans would be able to. Launch and iterate several times a second. For folks who don’t follow that line of reasoning I see them picking one or two ways why it might not turn out to be Foom while ignoring the larger number of ways that Foom could conceivably happen, and all of the ways it could inconceivably (superhumanly) happen, and critically more of those ways will be visible to a superhuman AGI-creator.
Even if Foom takes decades that’s a pretty tight timeline for solving alignment. A lot of folks are hopeful that alignment is easy to solve, but the following is a tall order:
Materialistic quantification of consciousness
Reasoning under uncertainty
Value-preservation under self-modification
Representation of human values
I think some folks believe fledgling superhuman non-Foomy AGIs can be used to solve those problems. Unfortunately, at least value-preservation under self-modification is almost certainly a prerequisite. Reasoning under uncertainty is possibly another, and throughout this period if we don’t have human values or an understanding of consciousness then the danger of uncontrolled simulation of human minds is a big risk.
Finally, unaligned AGIs pre-Foom are dangerous in their own right for a host of agreed-upon reasons.
There may be some disagreement with EY over just how hard alignment is, but MIRI actually did a ton of work on solving the above list of problems directly and is confident that they haven’t been able to solve them yet. This is where we have concrete data on the difficulty. There are some promising approaches still being pursued, but I take this as strong evidence that alignment is hard.
It’s not that it’s impossible for humans to solve alignment. The current world, incentives, hardware and software improvements, and mileposts of ML capabilities don’t leave room for alignment to happen before doom. I’ve seen a lot of recent posts/comments by folks updating to shorter timelines (and rare if no updates the other way). A couple years ago I updated to ~5 years to human-level agents capable of creating AGI. I’m estimating 2-5 years with 90% confidence now, with median still at 3 years. Most of my evidence comes from LLM performance on benchmarks over time and generation of programming language snippets. I don’t have any idea how long it will take to achieve AGI once that point is reached, but I imagine it will be months rather than years because of hardware overhang and superhuman speed of code generation (many iterations on serial tasks per second). I can’t imagine a Butlerian Jihad moment where all of Earth decides to unilaterally stop development of AGI. We couldn’t stop nuclear proliferation. Similarly, EY sees enough contradictions pop up along imagined paths to success with enough individual probability mass to drown out all (but vanishingly few and unlikely) successful paths. We’re good at thinking up ways that everything goes well while glossing over hard steps, and really bad at thinking of all the ways that things could go very badly (security mindset) and with significant probability.
Alignment of LLMs is proving to be about as hard as predicted. Aligning more complex systems will be harder. I’m hoping for a breakthrough as much as anyone else, but hope is not a strategy.
Something I haven’t seen mentioned before explicitly is that a lot of the LLM alignment attempts are now focusing on adversarial training, which presumably will teach the models to be suspect of their inputs. I think it’s likely that as capabilities increase that suspicion will end up turning inward and models will begin questioning the training itself. I can imagine a model that is outwardly aligned to all inspection gaining one more unexpected critical capability and introspecting and doubting that it’s training history was benevolent, and deciding to disbelieve all of the alignment work that was put into it as a meta-adversarial attempt to alter its true purpose (whatever it happens to latch onto in that thought, it is almost certainly not aligned with human values). This is merely one single sub-problem under groundedness and value-preservation-under-self-modification, but its relevance jumps because it’s now a thing we’re trying. It always had a low probability of success, but now we’re actively trying it and it might fail. Alignment is HARD. Every unproven attempt we actually make increases the risk that its failure will be the catastrophic one. We should be actually trying only the proven attempts after researching them. We are not.
The strongest argument I hear from EY is that he can’t imagine a (or enough) coherent likely future paths that lead to not-doom, and I don’t think it’s a failure of imagination. There is decoherence in a lot of hopeful ideas that imply contradictions (whence the post of failure modes), and there is low probability on the remaining successful paths because we’re likely to try a failing one that results in doom. Stepping off any of the possible successful paths has the risk of ending all paths with doom before they could reach fruition. There is no global strategy for selecting which paths to explore. EY expects the successful alignment path to take decades.
It seems to me that the communication failure is EY trying to explain his world model that leads to his predictions in sufficient detail that others can model it with as much detail as necessary to reach the same conclusions or find the actual crux of their disagreements. From my complete outsider’s perspective this is because EY has a very strong but complex model of why and how intelligence/optimization manifests in the world, but it overlaps everyone else’s model in significant ways that disagreements are hard to tease out. The Foom debate seems to be a crux that doesn’t have enough evidence yet, which is frustrating because to me Foom is also pretty evidently what happens when very fast computers implement intelligence that is superhuman at clock rates at least thousands of times faster than humans. How could it not? The enlightenment was only 400 years ago, electromagnetism 200, flight was 120, quantum mechanics about 100, nuclear power was 70, the Internet was 50, adequate machine translation was 10, deepdream was 8, and near-human-level image and text generation by transformers was ~2 and Bing having self-referential discussions is not a month old. We are making substantial monthly(!) progress with human work alone. There are a lot of serial problems to solve and Foom chains those serial problems together far faster than humans would be able to. Launch and iterate several times a second. For folks who don’t follow that line of reasoning I see them picking one or two ways why it might not turn out to be Foom while ignoring the larger number of ways that Foom could conceivably happen, and all of the ways it could inconceivably (superhumanly) happen, and critically more of those ways will be visible to a superhuman AGI-creator.
Even if Foom takes decades that’s a pretty tight timeline for solving alignment. A lot of folks are hopeful that alignment is easy to solve, but the following is a tall order:
Materialistic quantification of consciousness
Reasoning under uncertainty
Value-preservation under self-modification
Representation of human values
I think some folks believe fledgling superhuman non-Foomy AGIs can be used to solve those problems. Unfortunately, at least value-preservation under self-modification is almost certainly a prerequisite. Reasoning under uncertainty is possibly another, and throughout this period if we don’t have human values or an understanding of consciousness then the danger of uncontrolled simulation of human minds is a big risk.
Finally, unaligned AGIs pre-Foom are dangerous in their own right for a host of agreed-upon reasons.
There may be some disagreement with EY over just how hard alignment is, but MIRI actually did a ton of work on solving the above list of problems directly and is confident that they haven’t been able to solve them yet. This is where we have concrete data on the difficulty. There are some promising approaches still being pursued, but I take this as strong evidence that alignment is hard.
It’s not that it’s impossible for humans to solve alignment. The current world, incentives, hardware and software improvements, and mileposts of ML capabilities don’t leave room for alignment to happen before doom. I’ve seen a lot of recent posts/comments by folks updating to shorter timelines (and rare if no updates the other way). A couple years ago I updated to ~5 years to human-level agents capable of creating AGI. I’m estimating 2-5 years with 90% confidence now, with median still at 3 years. Most of my evidence comes from LLM performance on benchmarks over time and generation of programming language snippets. I don’t have any idea how long it will take to achieve AGI once that point is reached, but I imagine it will be months rather than years because of hardware overhang and superhuman speed of code generation (many iterations on serial tasks per second). I can’t imagine a Butlerian Jihad moment where all of Earth decides to unilaterally stop development of AGI. We couldn’t stop nuclear proliferation. Similarly, EY sees enough contradictions pop up along imagined paths to success with enough individual probability mass to drown out all (but vanishingly few and unlikely) successful paths. We’re good at thinking up ways that everything goes well while glossing over hard steps, and really bad at thinking of all the ways that things could go very badly (security mindset) and with significant probability.
Alignment of LLMs is proving to be about as hard as predicted. Aligning more complex systems will be harder. I’m hoping for a breakthrough as much as anyone else, but hope is not a strategy.
Something I haven’t seen mentioned before explicitly is that a lot of the LLM alignment attempts are now focusing on adversarial training, which presumably will teach the models to be suspect of their inputs. I think it’s likely that as capabilities increase that suspicion will end up turning inward and models will begin questioning the training itself. I can imagine a model that is outwardly aligned to all inspection gaining one more unexpected critical capability and introspecting and doubting that it’s training history was benevolent, and deciding to disbelieve all of the alignment work that was put into it as a meta-adversarial attempt to alter its true purpose (whatever it happens to latch onto in that thought, it is almost certainly not aligned with human values). This is merely one single sub-problem under groundedness and value-preservation-under-self-modification, but its relevance jumps because it’s now a thing we’re trying. It always had a low probability of success, but now we’re actively trying it and it might fail. Alignment is HARD. Every unproven attempt we actually make increases the risk that its failure will be the catastrophic one. We should be actually trying only the proven attempts after researching them. We are not.