That all makes sense. This also sounds like you’re thinking of aligning an AGI, while I’m thinking of aligning the ASI that AGI will self-improve to become. In particular, I expect a level of reflective consistency from ASI that humans don’t have. I think that’s a central crux of alignment difficulty—can we just craft datasets and behaviors that would produce ethical behavior in something like an LLM, or do we need to grapple with how a superintellegent mind might understand the world and its goals after superhuman reflection and autonomous learning? I tend to think it’s the latter. I don’t think that rules out the approach you describe as one helpful component, but it does make the question harder.
Agreed: and if this proceeds on the timelines I’m currently expecting, I’m looking forward to discussing all this with AGIs smarter than me, perhaps later this decade.
Quite possibly, some small number of groups will separately create semi-aligned AGIs with different alignment approaches and somewhat different definitions of alignment. I’m hoping the resulting conflict is a vigorous intellectual debate informed by experimental results, not a war.
I share that hope, but I want to do as much as I can now to ensure that outcome. Highly convincing arguments that an approach leads with high likelihood to catastrophic war might actually make people take a different approach. If such arguments exist, I want to find them and spread them ASAP. I see no reason to believe such arguments don’t exist. Even decent arguments for the risks might steer people away from them or generate solutions faster.
That all makes sense. This also sounds like you’re thinking of aligning an AGI, while I’m thinking of aligning the ASI that AGI will self-improve to become. In particular, I expect a level of reflective consistency from ASI that humans don’t have. I think that’s a central crux of alignment difficulty—can we just craft datasets and behaviors that would produce ethical behavior in something like an LLM, or do we need to grapple with how a superintellegent mind might understand the world and its goals after superhuman reflection and autonomous learning? I tend to think it’s the latter. I don’t think that rules out the approach you describe as one helpful component, but it does make the question harder.
Agreed: and if this proceeds on the timelines I’m currently expecting, I’m looking forward to discussing all this with AGIs smarter than me, perhaps later this decade.
Quite possibly, some small number of groups will separately create semi-aligned AGIs with different alignment approaches and somewhat different definitions of alignment. I’m hoping the resulting conflict is a vigorous intellectual debate informed by experimental results, not a war.
I share that hope, but I want to do as much as I can now to ensure that outcome. Highly convincing arguments that an approach leads with high likelihood to catastrophic war might actually make people take a different approach. If such arguments exist, I want to find them and spread them ASAP. I see no reason to believe such arguments don’t exist. Even decent arguments for the risks might steer people away from them or generate solutions faster.
More specifics on the other thread.