Added to my list of stacking alignment approaches for language model agents.
I agree that this is obvious-in-retrospect. But AFAIK nobody had published it (or at least not published well enough that I’d found it in searching).
This suggests to me that there’s probably a lot more work to be done, and that everyone arguing on general principles that alignment is hard or easy should roll up their sleeves and try to produce and analyze ideas on the current margins. And the same with people working dilligently on prosaic alignment projects—they should spend a little time helping on the conceptual level. We have not remotely exhausted, let alone analyzed, the relevant idea-space.
On your estimate of difficulty: I think you could approximate this at very low cost with the following approach: Produce the best language model you can; you were going to do this anyway. Now prompt it to produce only aligned responses, and use that dataset to train the “thought generator” portion of the model you describe. If you want to improve that dataset for small additional cost, have the model critique it, and have humans critique that critique to improve it.
I think the questions about whether there is such a thing as an aligned set of thoughts, as gears of ascension questions in the other thread is quite a good one. Don’t I need to be able to think about possibly unaligned actions (using force against humans) to arrive at the most aligned actions (stopping genocidal monomaniacs)? My intuition pulls both ways on this. I’d guess you could at least improve alignment certainty with a curated dataset for the thought generator portion of a language model agent. As you say, the idea could use more analysis.
Yup — I deliberately left the process for producing the synthetic training set vague, other than that it involved less-well aligned AGI models, but I suspect part of it looks like the sort of thing you outline. And obviously we’ll be using alignment-by-prompt-engineering here, along the lines you discuss in Alignment by prompting. I do think that the only criterion for the AI mode should be the fundamental motivation that the model is operating from: so I would propose that in the training set, there are occasional examples of an aligned AI considering, or even carrying out, things like the use of force against an individual human or small group, in situations where that is actually justified by the collective well-being of all humanity. Situations like that do, unfortunately, arise in the real world, the morally-correct thing to do is often fairly clear, and for our AI’s aligned behavior to be effective and reflectively stable when constructing successors I think our training set should cover it.
There is a general observation that training sets work better if they are enriched in things like edge, corner, and boundary cases. I suspect this may not be completely unrelated to the way humans enjoy reading stories about high-stake situations, murder mysteries, moral conundrums and so forth — much more than they actually enjoy being in these situations: it’s a low-stakes way to prepare ourselves to know how to act if a high stakes situation ever arises.
This is one of the reasons I think it might be helpful to get people like writers, authors, and journalists involved in creating (perhaps a smaller golden set for) such a training set: by their training they tend to look for and locate the interesting boundary, edge, and corner cases in ethics and morality.
That all makes sense. This also sounds like you’re thinking of aligning an AGI, while I’m thinking of aligning the ASI that AGI will self-improve to become. In particular, I expect a level of reflective consistency from ASI that humans don’t have. I think that’s a central crux of alignment difficulty—can we just craft datasets and behaviors that would produce ethical behavior in something like an LLM, or do we need to grapple with how a superintellegent mind might understand the world and its goals after superhuman reflection and autonomous learning? I tend to think it’s the latter. I don’t think that rules out the approach you describe as one helpful component, but it does make the question harder.
Agreed: and if this proceeds on the timelines I’m currently expecting, I’m looking forward to discussing all this with AGIs smarter than me, perhaps later this decade.
Quite possibly, some small number of groups will separately create semi-aligned AGIs with different alignment approaches and somewhat different definitions of alignment. I’m hoping the resulting conflict is a vigorous intellectual debate informed by experimental results, not a war.
I share that hope, but I want to do as much as I can now to ensure that outcome. Highly convincing arguments that an approach leads with high likelihood to catastrophic war might actually make people take a different approach. If such arguments exist, I want to find them and spread them ASAP. I see no reason to believe such arguments don’t exist. Even decent arguments for the risks might steer people away from them or generate solutions faster.
Excellent idea and excellent writeup.
Added to my list of stacking alignment approaches for language model agents.
I agree that this is obvious-in-retrospect. But AFAIK nobody had published it (or at least not published well enough that I’d found it in searching).
This suggests to me that there’s probably a lot more work to be done, and that everyone arguing on general principles that alignment is hard or easy should roll up their sleeves and try to produce and analyze ideas on the current margins. And the same with people working dilligently on prosaic alignment projects—they should spend a little time helping on the conceptual level. We have not remotely exhausted, let alone analyzed, the relevant idea-space.
On your estimate of difficulty: I think you could approximate this at very low cost with the following approach: Produce the best language model you can; you were going to do this anyway. Now prompt it to produce only aligned responses, and use that dataset to train the “thought generator” portion of the model you describe. If you want to improve that dataset for small additional cost, have the model critique it, and have humans critique that critique to improve it.
I think the questions about whether there is such a thing as an aligned set of thoughts, as gears of ascension questions in the other thread is quite a good one. Don’t I need to be able to think about possibly unaligned actions (using force against humans) to arrive at the most aligned actions (stopping genocidal monomaniacs)? My intuition pulls both ways on this. I’d guess you could at least improve alignment certainty with a curated dataset for the thought generator portion of a language model agent. As you say, the idea could use more analysis.
Yup — I deliberately left the process for producing the synthetic training set vague, other than that it involved less-well aligned AGI models, but I suspect part of it looks like the sort of thing you outline. And obviously we’ll be using alignment-by-prompt-engineering here, along the lines you discuss in Alignment by prompting. I do think that the only criterion for the AI mode should be the fundamental motivation that the model is operating from: so I would propose that in the training set, there are occasional examples of an aligned AI considering, or even carrying out, things like the use of force against an individual human or small group, in situations where that is actually justified by the collective well-being of all humanity. Situations like that do, unfortunately, arise in the real world, the morally-correct thing to do is often fairly clear, and for our AI’s aligned behavior to be effective and reflectively stable when constructing successors I think our training set should cover it.
There is a general observation that training sets work better if they are enriched in things like edge, corner, and boundary cases. I suspect this may not be completely unrelated to the way humans enjoy reading stories about high-stake situations, murder mysteries, moral conundrums and so forth — much more than they actually enjoy being in these situations: it’s a low-stakes way to prepare ourselves to know how to act if a high stakes situation ever arises.
This is one of the reasons I think it might be helpful to get people like writers, authors, and journalists involved in creating (perhaps a smaller golden set for) such a training set: by their training they tend to look for and locate the interesting boundary, edge, and corner cases in ethics and morality.
That all makes sense. This also sounds like you’re thinking of aligning an AGI, while I’m thinking of aligning the ASI that AGI will self-improve to become. In particular, I expect a level of reflective consistency from ASI that humans don’t have. I think that’s a central crux of alignment difficulty—can we just craft datasets and behaviors that would produce ethical behavior in something like an LLM, or do we need to grapple with how a superintellegent mind might understand the world and its goals after superhuman reflection and autonomous learning? I tend to think it’s the latter. I don’t think that rules out the approach you describe as one helpful component, but it does make the question harder.
Agreed: and if this proceeds on the timelines I’m currently expecting, I’m looking forward to discussing all this with AGIs smarter than me, perhaps later this decade.
Quite possibly, some small number of groups will separately create semi-aligned AGIs with different alignment approaches and somewhat different definitions of alignment. I’m hoping the resulting conflict is a vigorous intellectual debate informed by experimental results, not a war.
I share that hope, but I want to do as much as I can now to ensure that outcome. Highly convincing arguments that an approach leads with high likelihood to catastrophic war might actually make people take a different approach. If such arguments exist, I want to find them and spread them ASAP. I see no reason to believe such arguments don’t exist. Even decent arguments for the risks might steer people away from them or generate solutions faster.
More specifics on the other thread.