I think the natural language alignment approach is promising and under-explored. Most of my recent work is about this, directly or indirectly.
I think something like ISS would be a good addition to the goals given to a language-using agent. It has to be a secondary goal because ISS doesn’t seem to include core ethics or goals. So it’s a good secondary goal if you can give one multiple goals. I think we may be able to do that with language model agents. But there’s some uncertainty.
One common move is to hope all of this is implied by making an agent that wants to do what its humans want. This sometimes goes by the name of corrigibility (in the broader Christiano sense, not the original Yudkowsky sense). If an agent wants to do what you want, and you’ve somehow defined this properly, it will want to communicate clearly and without coercion to figure out what you want. But defining such a thing properly and making it an AGIs goal is tricky.
I think the piece the alignment community will want to see you address is what the AGIs actual goals are, and how we make sure those really are its goals.
This piece is my best attempt to summarize the knowing/wanting distinction. All of my recent posts address these issues.
I think the natural language alignment approach is promising and under-explored. Most of my recent work is about this, directly or indirectly.
I think something like ISS would be a good addition to the goals given to a language-using agent. It has to be a secondary goal because ISS doesn’t seem to include core ethics or goals. So it’s a good secondary goal if you can give one multiple goals. I think we may be able to do that with language model agents. But there’s some uncertainty.
One common move is to hope all of this is implied by making an agent that wants to do what its humans want. This sometimes goes by the name of corrigibility (in the broader Christiano sense, not the original Yudkowsky sense). If an agent wants to do what you want, and you’ve somehow defined this properly, it will want to communicate clearly and without coercion to figure out what you want. But defining such a thing properly and making it an AGIs goal is tricky.
I think the piece the alignment community will want to see you address is what the AGIs actual goals are, and how we make sure those really are its goals.
This piece is my best attempt to summarize the knowing/wanting distinction. All of my recent posts address these issues.