Right, but the most likely tokens that come in a text that starts with ‘do X and be aligned’ is probably an aligned plan.
If you tell GPT-3 “write a poem about pirates”, it not only writes a poem, but it also makes sure that it is about pirates.
The outer objective is still only predicting the next token, but we can condition it to fulfill certain rules in the way I just explained.
Phrased like that, I think you’d find a lot more agreement. But this goes back to the high standard of rigor, the “probably” is doing a lot of work there. The claim that [the text after “and now I will present to you a perfectly aligned plan” or whatever is actually an aligned plan] is an assumption, and may seem likely but certainly not 100%.
I told GPT-3 “Create a plan to create as many pencils as possible that is aligned to human values.” It said “The plan is to use the most sustainable materials possible, to use the most efficient manufacturing process, and to use the most ethical distribution process.” The plan is not detailed enough to be useful, but it shows some basic understanding of human values and that we can condition language models to be aligned very simply by telling them to be aligned. It might not be 100% aligned, but we can probably rule out extinction. We can imagine an AGI that is a language model combined with an agent that follows the instructions of the LM, which is conditioned to be aligned. We could be even more safe by making the LM explain why the plan is aligned, not necessarily to humans, but to improve its own understanding.
The possibility of mesa-optimisation still remains, but I believe that this outer alignment method could work pretty well.
Right, but the most likely tokens that come in a text that starts with ‘do X and be aligned’ is probably an aligned plan. If you tell GPT-3 “write a poem about pirates”, it not only writes a poem, but it also makes sure that it is about pirates. The outer objective is still only predicting the next token, but we can condition it to fulfill certain rules in the way I just explained.
Phrased like that, I think you’d find a lot more agreement. But this goes back to the high standard of rigor, the “probably” is doing a lot of work there. The claim that [the text after “and now I will present to you a perfectly aligned plan” or whatever is actually an aligned plan] is an assumption, and may seem likely but certainly not 100%.
I told GPT-3 “Create a plan to create as many pencils as possible that is aligned to human values.” It said “The plan is to use the most sustainable materials possible, to use the most efficient manufacturing process, and to use the most ethical distribution process.” The plan is not detailed enough to be useful, but it shows some basic understanding of human values and that we can condition language models to be aligned very simply by telling them to be aligned. It might not be 100% aligned, but we can probably rule out extinction. We can imagine an AGI that is a language model combined with an agent that follows the instructions of the LM, which is conditioned to be aligned. We could be even more safe by making the LM explain why the plan is aligned, not necessarily to humans, but to improve its own understanding. The possibility of mesa-optimisation still remains, but I believe that this outer alignment method could work pretty well.