I feel like you’re saying “safety research” when the examples of what corporations centrally want is “reliable control over their slaves”… that is to say, they want “alignment” and “corrigibility” research.
This has been my central beef for a long time.
Eliezer’s old Friendliness proposals were at least AIMED at the right thing (a morally praiseworthy vision of humanistic flourishing) and CEV is more explicitly trying for something like this, again, in a way that mostly just tweaks the specification (because Eliezer stopped believing that his earliest plans would “do what they said on the tin they were aimed at” and started over).
If an academic is working on AI, and they aren’t working on Friendliness, and aren’t working on CEV, and it isn’t “alignment to benevolence ” or making “corrigibly seeking humanistic flourishing for all”… I don’t understand why it deserves applause lights.
(EDITED TO ADD: exploring the links more, I see “benevolent game theory, algorithmic foundations of human rights” as topics you raise. This stuff seems good! Maybe this is the stuff you’re trying to sneak into getting more eyeballs via some rhetorical strategy that makes sense in your target audience?)
“The alignment problem” (without extra qualifications) is an academic framing that could easily fit in a grant proposal by an academic researcher to get funding from a slave company to make better slaves. “Alignment IS capabilities research”.
Similarly, there’s a very easy way to be “safe” from skynet: don’t built skynet!
I wouldn’t call a gymnastics curriculum that focused on doing flips while you pick up pennies in front of a bulldozer “learning to be safe”. Similarly, here, it seems like there’s some insane culture somewhere that you’re speaking to whose words are just systematically confused (or intentionally confusing).
Can you explain why you’re even bothering to use the euphemism of “Safety” Research? How does it ever get off the ground of “the words being used denote what naive people would think those words mean” in any way that ever gets past “research on how to put an end to all AI capabilities research in general, by all state actors, and all corporations, and everyone (until such time as non-safety research, aimed at actually good outcomes (instead of just marginally less bad outcomes from current AI) has clearly succeeding as a more important and better and more funding worthy target)”? What does “Safety Research” even mean if it isn’t inclusive of safety from the largest potential risks?
I think this is a good steelman of the original post. I find it more compelling.
Your “easy way to be safe,” just not building AGI is commonly considered near-impossible. Can you point me to plans or arguments for how we can convince people not to build AGI? The arguments I’m aware of, that alignment is very very hard, they’ll have the moral status of slaves, or that they’re likely to lock in a bad future, are not complete enough to be compelling even to me, let alone technologists or politicians with their own agenda and limited attention for the arguments.
I suspect we’d be wiser not to build AGI, and definitely wiser to go slower, but I see no route to convincing enough of the world to do that.
What does “Safety Research” even mean if it isn’t inclusive of safety from the largest potential risks?
I very much agree. I don’t call my work safety research, to differentiate it from all of the stuff that may-or-may-not actually help with AGI alignment. To be fair, steering and interpretability work might contribute to building safe AGI, there’s just not a very clear plan for how it would be applied to LLM-based AGI, rather than tool LLMs—so much of it probably contributes approximately nothing (depending on how you factor the capabilities appplications) to mitigating the largest risk: misaligned AGI.
Seth. I just spoke about this work at ICML yesterday. Some other similar works:
Eliezers work from way back in 2004. https://intelligence.org/files/CEV.pdf. I haven’t read it in full—but it’s about AIs that interact with human volition—which is what I’m also worried about.
My paper on arxiv is a bit of a long read (GPT-it) : https://arxiv.org/abs/2305.19223 But it tries to show where some of the weak points in human volition and intention generation are—and why we (i.e. “most developers and humanity in general”) still think of human reasoning in a mind-body dualistic framework: i.e. there’s a core to human thought, goal selection and decisoin making—that can never be corrupted or manipulated. We’ve already discovered loads of failure modes—and we weren’t even faced with omnipotent-like opponents. (https://www.sog.unc.edu/sites/www.sog.unc.edu/files/course_materials/Cognitive%20Biases%20Codex.pdf). The other point main point my work makes is that when you apply enough pressure on an aligned AI/AGI to find an optimal solution or “intent” you have for a problem that is too hard to solve—the solution it will eventually find is to change the “intent” of the human.
The link to your paper is broken. I’ve read the Christiano piece. And some/most of the CEV paper, I think.
Any working intent alignment solution needs to prevent changing the intent of the human on purpose. That is a solvable problem with an AGI that understands the concept.
The problem with “understanding the concept of intent”—is that intent and goal formation are some of the most complex notions in the universe involving genetics, development, psychology, culture and everything in between. We have been arguing about what intent—and correlates like “well-being” mean—for the entire history of our civilization. It looks like we have a good set of no-nos (e.g. read the UN declaration on human rights) - but in terms of positive descriptions of good long term outcomes it gets fuzzy. There we have less guidance, though I guess trans- and post-humanism seems to be a desirable goal to many.
I intended to refer to understanding the concept of manipulation adequately to avoid it if the AGI “wanted” to.
As for understanding the concept of intent, I agree that “true” intent is very difficult to understand, particularly if it’s projected far into the future. That’s a huge problem for approaches like CEV. The virtue of the approach I’m suggesting is that it entirely bypasses that complexity (while introducing new problems). Instead of inferring “true” intent, the AGI just “wants” to do what the human principal tells it to do. The human gets to decide what their intent is. The machine just has to understand what the human meant by what they said- and the human can clarify that in a conversation. I’m thinking of this as do what I mean and check (DWIMAC) alignment. More on this in Instruction-following AGI is easier and more likely than value aligned AGI.
What’s a good overview of those grounded arguments?
Thanks, appreciating your question. The best overview I managed to write was the control problem post. Still takes quite some reading through to put the different parts of the argument together though.
I feel like you’re saying “safety research” when the examples of what corporations centrally want is “reliable control over their slaves”… that is to say, they want “alignment” and “corrigibility” research.
This has been my central beef for a long time.
Eliezer’s old Friendliness proposals were at least AIMED at the right thing (a morally praiseworthy vision of humanistic flourishing) and CEV is more explicitly trying for something like this, again, in a way that mostly just tweaks the specification (because Eliezer stopped believing that his earliest plans would “do what they said on the tin they were aimed at” and started over).
If an academic is working on AI, and they aren’t working on Friendliness, and aren’t working on CEV, and it isn’t “alignment to benevolence ” or making “corrigibly seeking humanistic flourishing for all”… I don’t understand why it deserves applause lights.
(EDITED TO ADD: exploring the links more, I see “benevolent game theory, algorithmic foundations of human rights” as topics you raise. This stuff seems good! Maybe this is the stuff you’re trying to sneak into getting more eyeballs via some rhetorical strategy that makes sense in your target audience?)
“The alignment problem” (without extra qualifications) is an academic framing that could easily fit in a grant proposal by an academic researcher to get funding from a slave company to make better slaves. “Alignment IS capabilities research”.
Similarly, there’s a very easy way to be “safe” from skynet: don’t built skynet!
I wouldn’t call a gymnastics curriculum that focused on doing flips while you pick up pennies in front of a bulldozer “learning to be safe”. Similarly, here, it seems like there’s some insane culture somewhere that you’re speaking to whose words are just systematically confused (or intentionally confusing).
Can you explain why you’re even bothering to use the euphemism of “Safety” Research? How does it ever get off the ground of “the words being used denote what naive people would think those words mean” in any way that ever gets past “research on how to put an end to all AI capabilities research in general, by all state actors, and all corporations, and everyone (until such time as non-safety research, aimed at actually good outcomes (instead of just marginally less bad outcomes from current AI) has clearly succeeding as a more important and better and more funding worthy target)”? What does “Safety Research” even mean if it isn’t inclusive of safety from the largest potential risks?
I think this is a good steelman of the original post. I find it more compelling.
Your “easy way to be safe,” just not building AGI is commonly considered near-impossible. Can you point me to plans or arguments for how we can convince people not to build AGI? The arguments I’m aware of, that alignment is very very hard, they’ll have the moral status of slaves, or that they’re likely to lock in a bad future, are not complete enough to be compelling even to me, let alone technologists or politicians with their own agenda and limited attention for the arguments.
I suspect we’d be wiser not to build AGI, and definitely wiser to go slower, but I see no route to convincing enough of the world to do that.
I very much agree. I don’t call my work safety research, to differentiate it from all of the stuff that may-or-may-not actually help with AGI alignment. To be fair, steering and interpretability work might contribute to building safe AGI, there’s just not a very clear plan for how it would be applied to LLM-based AGI, rather than tool LLMs—so much of it probably contributes approximately nothing (depending on how you factor the capabilities appplications) to mitigating the largest risk: misaligned AGI.
There are also grounded arguments why alignment is unworkable. Ie. that AGI could not control its effects in line with staying safe to humans.
I’ve written about this, and Anders Sandberg is currently working on mathematically formalising an elegant model of AGI uncontainability.
What’s a good overview of those grounded arguments? I looked at your writings and it wasn’t clear where to start.
Seth. I just spoke about this work at ICML yesterday. Some other similar works:
Eliezers work from way back in 2004. https://intelligence.org/files/CEV.pdf. I haven’t read it in full—but it’s about AIs that interact with human volition—which is what I’m also worried about.
Christiano’s: https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like. This is a lot about slow take offs and AI’s that slowly become unstoppable or unchangeable because they become part of our economic world.
My paper on arxiv is a bit of a long read (GPT-it) : https://arxiv.org/abs/2305.19223 But it tries to show where some of the weak points in human volition and intention generation are—and why we (i.e. “most developers and humanity in general”) still think of human reasoning in a mind-body dualistic framework: i.e. there’s a core to human thought, goal selection and decisoin making—that can never be corrupted or manipulated. We’ve already discovered loads of failure modes—and we weren’t even faced with omnipotent-like opponents. (https://www.sog.unc.edu/sites/www.sog.unc.edu/files/course_materials/Cognitive%20Biases%20Codex.pdf). The other point main point my work makes is that when you apply enough pressure on an aligned AI/AGI to find an optimal solution or “intent” you have for a problem that is too hard to solve—the solution it will eventually find is to change the “intent” of the human.
Thank you!
The link to your paper is broken. I’ve read the Christiano piece. And some/most of the CEV paper, I think.
Any working intent alignment solution needs to prevent changing the intent of the human on purpose. That is a solvable problem with an AGI that understands the concept.
Sorry, fixed broken link now.
The problem with “understanding the concept of intent”—is that intent and goal formation are some of the most complex notions in the universe involving genetics, development, psychology, culture and everything in between. We have been arguing about what intent—and correlates like “well-being” mean—for the entire history of our civilization. It looks like we have a good set of no-nos (e.g. read the UN declaration on human rights) - but in terms of positive descriptions of good long term outcomes it gets fuzzy. There we have less guidance, though I guess trans- and post-humanism seems to be a desirable goal to many.
I intended to refer to understanding the concept of manipulation adequately to avoid it if the AGI “wanted” to.
As for understanding the concept of intent, I agree that “true” intent is very difficult to understand, particularly if it’s projected far into the future. That’s a huge problem for approaches like CEV. The virtue of the approach I’m suggesting is that it entirely bypasses that complexity (while introducing new problems). Instead of inferring “true” intent, the AGI just “wants” to do what the human principal tells it to do. The human gets to decide what their intent is. The machine just has to understand what the human meant by what they said- and the human can clarify that in a conversation. I’m thinking of this as do what I mean and check (DWIMAC) alignment. More on this in Instruction-following AGI is easier and more likely than value aligned AGI.
I’ll read your article.
Thanks, appreciating your question. The best overview I managed to write was the control problem post. Still takes quite some reading through to put the different parts of the argument together though.