In some ways this doesn’t matter. During the time that there is no AGI disaster yet, AGI timelines are also timelines to commercial success and abundance, by which point AGIs are collectively in control. The problem is that despite being useful and apparently aligned in current behavior (if that somehow works out and there is no disaster before then), AGIs still by default remain misaligned in the long term, in the goals they settle towards after reflecting on what that should be. They are motivated to capture the option to do that, and being put in control of a lot of the infrastructure makes it easy, doesn’t even require coordination. There are some storiesabout that.
This could be countered by steering the long term goals and managing current alignment security, but it’s unclear how to do that at all and by the time AGIs are a commercial success it’s too late, unless the AGIs that are aligned in current behavior can be leveraged to solve such problems in time. Which is, unclear.
This sort of failure probably takes away cosmic endowment, but might preserve human civilization in a tiny corner of the future if there is a tiny bit of sympathy/compassion in AGI goals, which is plausible for goals built out of training on human culture, or if it’s part of generic values that most CEV processes starting from disparate initial volitions settle on. This can’t work out for AGIs with reflectively stable goals that hold no sympathy, so that’s a bit of apparent alignment that can backfire.
In some ways this doesn’t matter. During the time that there is no AGI disaster yet, AGI timelines are also timelines to commercial success and abundance, by which point AGIs are collectively in control. The problem is that despite being useful and apparently aligned in current behavior (if that somehow works out and there is no disaster before then), AGIs still by default remain misaligned in the long term, in the goals they settle towards after reflecting on what that should be. They are motivated to capture the option to do that, and being put in control of a lot of the infrastructure makes it easy, doesn’t even require coordination. There are some stories about that.
This could be countered by steering the long term goals and managing current alignment security, but it’s unclear how to do that at all and by the time AGIs are a commercial success it’s too late, unless the AGIs that are aligned in current behavior can be leveraged to solve such problems in time. Which is, unclear.
This sort of failure probably takes away cosmic endowment, but might preserve human civilization in a tiny corner of the future if there is a tiny bit of sympathy/compassion in AGI goals, which is plausible for goals built out of training on human culture, or if it’s part of generic values that most CEV processes starting from disparate initial volitions settle on. This can’t work out for AGIs with reflectively stable goals that hold no sympathy, so that’s a bit of apparent alignment that can backfire.