I think the natural language alignment approach is promising and under-explored. Most of my recent work is about this, directly or indirectly.
I think something like ISS would be a good addition to the goals given to a language-using agent. It has to be a secondary goal because ISS doesn’t seem to include core ethics or goals. So it’s a good secondary goal if you can give one multiple goals. I think we may be able to do that with language model agents. But there’s some uncertainty.
One common move is to hope all of this is implied by making an agent that wants to do what its humans want. This sometimes goes by the name of corrigibility (in the broader Christiano sense, not the original Yudkowsky sense). If an agent wants to do what you want, and you’ve somehow defined this properly, it will want to communicate clearly and without coercion to figure out what you want. But defining such a thing properly and making it an AGIs goal is tricky.
I think the piece the alignment community will want to see you address is what the AGIs actual goals are, and how we make sure those really are its goals.
This piece is my best attempt to summarize the knowing/wanting distinction. All of my recent work addresses these issues.
“The ISS doesn’t seem to include core ethics and goals”
This is actually untrue. Habermas’ contention is that literally all human values and universal norms derive from the ISS. I intend to walk through some of that argument in a future post. However, what’s important is that this is an approach that grounds all human values as such in language use. I think Habermas likely underrates biology in contributing to human values, however this is an advantage when thinking about aligning AI that operate around a core of being competent language users. Point being is that my contention is that a Habermasian AGI wouldn’t kill everyone. It might be hard to build, but in principle if you did build one, it would be aligned.
I think you should emphasize this more since that’s typically what alignment people think about. What part of the ISS statements do you take to imply values we’d like?
The more standard thinking is that human values are developed based on our innate drives, which includes prosocial drives. See Steve Byrnes work, particularly the intro to his brain-like AGI sequence. And that’s not guaranteed to produce an aligned human.
It’s hard for me to write well for an audience I don’t know well. I went through a number of iterations of this just trying to clarify the conceptual contours of such a research direction in a single post that’s clear and coherent. I have like 5 follow up posts planned, hopefully I’ll keep going. But the premise is “here’s a stack of like 10 things that we want the AI to do, if it does these things it will be aligned. Further, this is all rooted in language use and not in biology, which seems useful because AI is not biological.” Actually getting an AI to conform to those things is like a nightmarish challenge, but it seems useful to have a coherent conceptual framework that defines what alignment is exactly and can explain why those 10 things and not some others. My essential thesis in other words is that at a high level, reframing the alignment problem in Habermasian terms makes the problem appear tractable.
I’m trying to be helpful by guessing at the gap between what you’re saying and this particular audience’s interests and concerns. You said this is your first post, it’s a new account, and the post didn’t get much interest, so I’m trying to help you guess what needs to be addressed in future posts or edits.
I think the natural language alignment approach is promising and under-explored. Most of my recent work is about this, directly or indirectly.
I think something like ISS would be a good addition to the goals given to a language-using agent. It has to be a secondary goal because ISS doesn’t seem to include core ethics or goals. So it’s a good secondary goal if you can give one multiple goals. I think we may be able to do that with language model agents. But there’s some uncertainty.
One common move is to hope all of this is implied by making an agent that wants to do what its humans want. This sometimes goes by the name of corrigibility (in the broader Christiano sense, not the original Yudkowsky sense). If an agent wants to do what you want, and you’ve somehow defined this properly, it will want to communicate clearly and without coercion to figure out what you want. But defining such a thing properly and making it an AGIs goal is tricky.
I think the piece the alignment community will want to see you address is what the AGIs actual goals are, and how we make sure those really are its goals.
This piece is my best attempt to summarize the knowing/wanting distinction. All of my recent work addresses these issues.
“The ISS doesn’t seem to include core ethics and goals”
This is actually untrue. Habermas’ contention is that literally all human values and universal norms derive from the ISS. I intend to walk through some of that argument in a future post. However, what’s important is that this is an approach that grounds all human values as such in language use. I think Habermas likely underrates biology in contributing to human values, however this is an advantage when thinking about aligning AI that operate around a core of being competent language users. Point being is that my contention is that a Habermasian AGI wouldn’t kill everyone. It might be hard to build, but in principle if you did build one, it would be aligned.
I think you should emphasize this more since that’s typically what alignment people think about. What part of the ISS statements do you take to imply values we’d like?
The more standard thinking is that human values are developed based on our innate drives, which includes prosocial drives. See Steve Byrnes work, particularly the intro to his brain-like AGI sequence. And that’s not guaranteed to produce an aligned human.
It’s hard for me to write well for an audience I don’t know well. I went through a number of iterations of this just trying to clarify the conceptual contours of such a research direction in a single post that’s clear and coherent. I have like 5 follow up posts planned, hopefully I’ll keep going. But the premise is “here’s a stack of like 10 things that we want the AI to do, if it does these things it will be aligned. Further, this is all rooted in language use and not in biology, which seems useful because AI is not biological.” Actually getting an AI to conform to those things is like a nightmarish challenge, but it seems useful to have a coherent conceptual framework that defines what alignment is exactly and can explain why those 10 things and not some others. My essential thesis in other words is that at a high level, reframing the alignment problem in Habermasian terms makes the problem appear tractable.
I’m trying to be helpful by guessing at the gap between what you’re saying and this particular audience’s interests and concerns. You said this is your first post, it’s a new account, and the post didn’t get much interest, so I’m trying to help you guess what needs to be addressed in future posts or edits.
I apologize if I’m coming off combative, I am genuinely appreciative for the help.