Seth. I just spoke about this work at ICML yesterday. Some other similar works:
Eliezers work from way back in 2004. https://intelligence.org/files/CEV.pdf. I haven’t read it in full—but it’s about AIs that interact with human volition—which is what I’m also worried about.
My paper on arxiv is a bit of a long read (GPT-it) : https://arxiv.org/abs/2305.19223 But it tries to show where some of the weak points in human volition and intention generation are—and why we (i.e. “most developers and humanity in general”) still think of human reasoning in a mind-body dualistic framework: i.e. there’s a core to human thought, goal selection and decisoin making—that can never be corrupted or manipulated. We’ve already discovered loads of failure modes—and we weren’t even faced with omnipotent-like opponents. (https://www.sog.unc.edu/sites/www.sog.unc.edu/files/course_materials/Cognitive%20Biases%20Codex.pdf). The other point main point my work makes is that when you apply enough pressure on an aligned AI/AGI to find an optimal solution or “intent” you have for a problem that is too hard to solve—the solution it will eventually find is to change the “intent” of the human.
The link to your paper is broken. I’ve read the Christiano piece. And some/most of the CEV paper, I think.
Any working intent alignment solution needs to prevent changing the intent of the human on purpose. That is a solvable problem with an AGI that understands the concept.
The problem with “understanding the concept of intent”—is that intent and goal formation are some of the most complex notions in the universe involving genetics, development, psychology, culture and everything in between. We have been arguing about what intent—and correlates like “well-being” mean—for the entire history of our civilization. It looks like we have a good set of no-nos (e.g. read the UN declaration on human rights) - but in terms of positive descriptions of good long term outcomes it gets fuzzy. There we have less guidance, though I guess trans- and post-humanism seems to be a desirable goal to many.
I intended to refer to understanding the concept of manipulation adequately to avoid it if the AGI “wanted” to.
As for understanding the concept of intent, I agree that “true” intent is very difficult to understand, particularly if it’s projected far into the future. That’s a huge problem for approaches like CEV. The virtue of the approach I’m suggesting is that it entirely bypasses that complexity (while introducing new problems). Instead of inferring “true” intent, the AGI just “wants” to do what the human principal tells it to do. The human gets to decide what their intent is. The machine just has to understand what the human meant by what they said- and the human can clarify that in a conversation. I’m thinking of this as do what I mean and check (DWIMAC) alignment. More on this in Instruction-following AGI is easier and more likely than value aligned AGI.
What’s a good overview of those grounded arguments?
Thanks, appreciating your question. The best overview I managed to write was the control problem post. Still takes quite some reading through to put the different parts of the argument together though.
There are also grounded arguments why alignment is unworkable. Ie. that AGI could not control its effects in line with staying safe to humans.
I’ve written about this, and Anders Sandberg is currently working on mathematically formalising an elegant model of AGI uncontainability.
What’s a good overview of those grounded arguments? I looked at your writings and it wasn’t clear where to start.
Seth. I just spoke about this work at ICML yesterday. Some other similar works:
Eliezers work from way back in 2004. https://intelligence.org/files/CEV.pdf. I haven’t read it in full—but it’s about AIs that interact with human volition—which is what I’m also worried about.
Christiano’s: https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like. This is a lot about slow take offs and AI’s that slowly become unstoppable or unchangeable because they become part of our economic world.
My paper on arxiv is a bit of a long read (GPT-it) : https://arxiv.org/abs/2305.19223 But it tries to show where some of the weak points in human volition and intention generation are—and why we (i.e. “most developers and humanity in general”) still think of human reasoning in a mind-body dualistic framework: i.e. there’s a core to human thought, goal selection and decisoin making—that can never be corrupted or manipulated. We’ve already discovered loads of failure modes—and we weren’t even faced with omnipotent-like opponents. (https://www.sog.unc.edu/sites/www.sog.unc.edu/files/course_materials/Cognitive%20Biases%20Codex.pdf). The other point main point my work makes is that when you apply enough pressure on an aligned AI/AGI to find an optimal solution or “intent” you have for a problem that is too hard to solve—the solution it will eventually find is to change the “intent” of the human.
Thank you!
The link to your paper is broken. I’ve read the Christiano piece. And some/most of the CEV paper, I think.
Any working intent alignment solution needs to prevent changing the intent of the human on purpose. That is a solvable problem with an AGI that understands the concept.
Sorry, fixed broken link now.
The problem with “understanding the concept of intent”—is that intent and goal formation are some of the most complex notions in the universe involving genetics, development, psychology, culture and everything in between. We have been arguing about what intent—and correlates like “well-being” mean—for the entire history of our civilization. It looks like we have a good set of no-nos (e.g. read the UN declaration on human rights) - but in terms of positive descriptions of good long term outcomes it gets fuzzy. There we have less guidance, though I guess trans- and post-humanism seems to be a desirable goal to many.
I intended to refer to understanding the concept of manipulation adequately to avoid it if the AGI “wanted” to.
As for understanding the concept of intent, I agree that “true” intent is very difficult to understand, particularly if it’s projected far into the future. That’s a huge problem for approaches like CEV. The virtue of the approach I’m suggesting is that it entirely bypasses that complexity (while introducing new problems). Instead of inferring “true” intent, the AGI just “wants” to do what the human principal tells it to do. The human gets to decide what their intent is. The machine just has to understand what the human meant by what they said- and the human can clarify that in a conversation. I’m thinking of this as do what I mean and check (DWIMAC) alignment. More on this in Instruction-following AGI is easier and more likely than value aligned AGI.
I’ll read your article.
Thanks, appreciating your question. The best overview I managed to write was the control problem post. Still takes quite some reading through to put the different parts of the argument together though.