Yup — I deliberately left the process for producing the synthetic training set vague, other than that it involved less-well aligned AGI models, but I suspect part of it looks like the sort of thing you outline. And obviously we’ll be using alignment-by-prompt-engineering here, along the lines you discuss in Alignment by prompting. I do think that the only criterion for the AI mode should be the fundamental motivation that the model is operating from: so I would propose that in the training set, there are occasional examples of an aligned AI considering, or even carrying out, things like the use of force against an individual human or small group, in situations where that is actually justified by the collective well-being of all humanity. Situations like that do, unfortunately, arise in the real world, the morally-correct thing to do is often fairly clear, and for our AI’s aligned behavior to be effective and reflectively stable when constructing successors I think our training set should cover it.
There is a general observation that training sets work better if they are enriched in things like edge, corner, and boundary cases. I suspect this may not be completely unrelated to the way humans enjoy reading stories about high-stake situations, murder mysteries, moral conundrums and so forth — much more than they actually enjoy being in these situations: it’s a low-stakes way to prepare ourselves to know how to act if a high stakes situation ever arises.
This is one of the reasons I think it might be helpful to get people like writers, authors, and journalists involved in creating (perhaps a smaller golden set for) such a training set: by their training they tend to look for and locate the interesting boundary, edge, and corner cases in ethics and morality.
That all makes sense. This also sounds like you’re thinking of aligning an AGI, while I’m thinking of aligning the ASI that AGI will self-improve to become. In particular, I expect a level of reflective consistency from ASI that humans don’t have. I think that’s a central crux of alignment difficulty—can we just craft datasets and behaviors that would produce ethical behavior in something like an LLM, or do we need to grapple with how a superintellegent mind might understand the world and its goals after superhuman reflection and autonomous learning? I tend to think it’s the latter. I don’t think that rules out the approach you describe as one helpful component, but it does make the question harder.
Agreed: and if this proceeds on the timelines I’m currently expecting, I’m looking forward to discussing all this with AGIs smarter than me, perhaps later this decade.
Quite possibly, some small number of groups will separately create semi-aligned AGIs with different alignment approaches and somewhat different definitions of alignment. I’m hoping the resulting conflict is a vigorous intellectual debate informed by experimental results, not a war.
I share that hope, but I want to do as much as I can now to ensure that outcome. Highly convincing arguments that an approach leads with high likelihood to catastrophic war might actually make people take a different approach. If such arguments exist, I want to find them and spread them ASAP. I see no reason to believe such arguments don’t exist. Even decent arguments for the risks might steer people away from them or generate solutions faster.
Yup — I deliberately left the process for producing the synthetic training set vague, other than that it involved less-well aligned AGI models, but I suspect part of it looks like the sort of thing you outline. And obviously we’ll be using alignment-by-prompt-engineering here, along the lines you discuss in Alignment by prompting. I do think that the only criterion for the AI mode should be the fundamental motivation that the model is operating from: so I would propose that in the training set, there are occasional examples of an aligned AI considering, or even carrying out, things like the use of force against an individual human or small group, in situations where that is actually justified by the collective well-being of all humanity. Situations like that do, unfortunately, arise in the real world, the morally-correct thing to do is often fairly clear, and for our AI’s aligned behavior to be effective and reflectively stable when constructing successors I think our training set should cover it.
There is a general observation that training sets work better if they are enriched in things like edge, corner, and boundary cases. I suspect this may not be completely unrelated to the way humans enjoy reading stories about high-stake situations, murder mysteries, moral conundrums and so forth — much more than they actually enjoy being in these situations: it’s a low-stakes way to prepare ourselves to know how to act if a high stakes situation ever arises.
This is one of the reasons I think it might be helpful to get people like writers, authors, and journalists involved in creating (perhaps a smaller golden set for) such a training set: by their training they tend to look for and locate the interesting boundary, edge, and corner cases in ethics and morality.
That all makes sense. This also sounds like you’re thinking of aligning an AGI, while I’m thinking of aligning the ASI that AGI will self-improve to become. In particular, I expect a level of reflective consistency from ASI that humans don’t have. I think that’s a central crux of alignment difficulty—can we just craft datasets and behaviors that would produce ethical behavior in something like an LLM, or do we need to grapple with how a superintellegent mind might understand the world and its goals after superhuman reflection and autonomous learning? I tend to think it’s the latter. I don’t think that rules out the approach you describe as one helpful component, but it does make the question harder.
Agreed: and if this proceeds on the timelines I’m currently expecting, I’m looking forward to discussing all this with AGIs smarter than me, perhaps later this decade.
Quite possibly, some small number of groups will separately create semi-aligned AGIs with different alignment approaches and somewhat different definitions of alignment. I’m hoping the resulting conflict is a vigorous intellectual debate informed by experimental results, not a war.
I share that hope, but I want to do as much as I can now to ensure that outcome. Highly convincing arguments that an approach leads with high likelihood to catastrophic war might actually make people take a different approach. If such arguments exist, I want to find them and spread them ASAP. I see no reason to believe such arguments don’t exist. Even decent arguments for the risks might steer people away from them or generate solutions faster.
More specifics on the other thread.