Big upvote for summarizing the main arguments in your own words. Getting clarity on the arguments for alignment difficulty are critical to picking a metastrategy for alignment.
I think most of the difficulties in understanding and codifying human values are irrelevant to the alignment problem as most of us understand it: getting good results from AGI.
I think it’s highly unlikely that the first AGI will be launched with anything like CEV or human flourishing as its alignment target. The sensible and practical thing to do is make an AGI that wants to do what I want it to do. (Where “I” is the team making the AGI)
That value target postpones all of the hard issues in understanding and codifying human values, and the most of the difficulties in resolving conflicts among different humans.
It is still hard to convey “do what this guys says and check with him if it’s high impact or you’re not sure what he meant” if you only have reinforcement signals to convey those concepts. But if you have natural language (as in aligning a language model agent), it’s pretty easy and straightforward.
You don’t need a precise understanding of your “DWIM and check”, because you can keep tinkering with your instructions when your AGI asks you for clarification.
So I expect actual alignment attempts to follow that path, and thereby duck the vast majority of the difficulities you describe. This isn’t to say the project will be easy, just that the challenges will fall elsewhere, in the more technical aspects of aligning a specific design of AGI.
Big upvote for summarizing the main arguments in your own words. Getting clarity on the arguments for alignment difficulty are critical to picking a metastrategy for alignment.
I think most of the difficulties in understanding and codifying human values are irrelevant to the alignment problem as most of us understand it: getting good results from AGI.
I’m glad to see you recognize this in your section Alignment might not be required for real-world performance compatible with human values, but this is still hard and impacts performance.
I think it’s highly unlikely that the first AGI will be launched with anything like CEV or human flourishing as its alignment target. The sensible and practical thing to do is make an AGI that wants to do what I want it to do. (Where “I” is the team making the AGI)
That value target postpones all of the hard issues in understanding and codifying human values, and the most of the difficulties in resolving conflicts among different humans.
I recently wrote Corrigibility or DWIM is an attractive primary goal for AGI. The more I think about it, the more I think it’s overwhelmingly attractive.
It is still hard to convey “do what this guys says and check with him if it’s high impact or you’re not sure what he meant” if you only have reinforcement signals to convey those concepts. But if you have natural language (as in aligning a language model agent), it’s pretty easy and straightforward.
You don’t need a precise understanding of your “DWIM and check”, because you can keep tinkering with your instructions when your AGI asks you for clarification.
So I expect actual alignment attempts to follow that path, and thereby duck the vast majority of the difficulities you describe. This isn’t to say the project will be easy, just that the challenges will fall elsewhere, in the more technical aspects of aligning a specific design of AGI.