One reason to be pessimistic about the “goals” and/or “values” that future ASIs will have is that “we” have a very poor understanding of “goals” and “values” right now. Like, there is not even widespread agreement that “goals” are even a meaningful abstraction to use. Let’s put aside the object-level question of whether this would even buy us anything in terms of safety, if it were true. The mere fact of such intractable disagreements about core philosophical questions, on which hinge substantial parts of various cases for and against doom, with no obvious way to resolve them, is not something that makes me feel good about superintelligent optimization power being directed at any particular thing, whether or not some underlying “goal” is driving it.
Separately, I continue to think that most such disagreements are not True Rejections, rather than e.g. disbelieving that we will create meaningful superintelligences, or that superintelligences would be able to execute a takeover or human-extinction-event if their cognition were aimed at that. I would change my mind about this if a saw a story of a “good ending” involving us creating a superintelligence without having confidence in its, uh… “goals”… that stood up to even minimal scrutiny, like “now play forward events a year; why hasn’t someone paperclipped the planet yet?”.
We do have a poor understanding of human values. That’s one more reason we shouldn’t and probably won’t try to build them into AGI.
You’re expressing a common view among the alignment community. I think we should update from that view to the more likely scenario in which we don’t even try to align AGI to human values.
What we’re actually doing is training LLMs to answer questions as they were intended, and to follow instructions as they were intended. The AI needs to understand human values to some degree to do that, but training is really focused on those things. There’s an interesting bit in this interview with Tan Zhi Xuan on this distinction between theory and practice of training LLMs, and to a lesser degree in their paper.
It’s counterintuitive to think about a highly intelligent agent that wants to do what someone else tells it. But it’s not logically incoherent.
And when the first human decides what goal to put in the system prompt of the first agent they think might ultimately surpass human competence and intelligence, there’s little doubt what they’ll put there: “follow my instructions, favoring the most recent”. Everything else is a subgoal of that non-consequentialist central goal.
This approach leaves humans in charge, and that’s a problem. Ultimately I think that sort of instrucion-following intent alignment can be a stepping-stone to value alignment, once we’ve got a superintelligent instruction-following system to help us with that very difficult problem. But there’s neither a need nor an incentive to aim directly at that with our first AGIs. So alignment will succeed or fail on other issues.
Separately, I fully agree that most people who don’t believe in AGI x-risk aren’t making a true rejection. They usually really don’t believe we’ll make autonomous AGI soon enough to worry about it.
One reason to be pessimistic about the “goals” and/or “values” that future ASIs will have is that “we” have a very poor understanding of “goals” and “values” right now. Like, there is not even widespread agreement that “goals” are even a meaningful abstraction to use. Let’s put aside the object-level question of whether this would even buy us anything in terms of safety, if it were true. The mere fact of such intractable disagreements about core philosophical questions, on which hinge substantial parts of various cases for and against doom, with no obvious way to resolve them, is not something that makes me feel good about superintelligent optimization power being directed at any particular thing, whether or not some underlying “goal” is driving it.
Separately, I continue to think that most such disagreements are not True Rejections, rather than e.g. disbelieving that we will create meaningful superintelligences, or that superintelligences would be able to execute a takeover or human-extinction-event if their cognition were aimed at that. I would change my mind about this if a saw a story of a “good ending” involving us creating a superintelligence without having confidence in its, uh… “goals”… that stood up to even minimal scrutiny, like “now play forward events a year; why hasn’t someone paperclipped the planet yet?”.
We do have a poor understanding of human values. That’s one more reason we shouldn’t and probably won’t try to build them into AGI.
You’re expressing a common view among the alignment community. I think we should update from that view to the more likely scenario in which we don’t even try to align AGI to human values.
What we’re actually doing is training LLMs to answer questions as they were intended, and to follow instructions as they were intended. The AI needs to understand human values to some degree to do that, but training is really focused on those things. There’s an interesting bit in this interview with Tan Zhi Xuan on this distinction between theory and practice of training LLMs, and to a lesser degree in their paper.
Not only is that what we are doing for current AI, I think it’s both what we should do for future AGI, and what we probably will do. Instruction-following AGI is easier and more likely than value aligned AGI.
It’s counterintuitive to think about a highly intelligent agent that wants to do what someone else tells it. But it’s not logically incoherent.
And when the first human decides what goal to put in the system prompt of the first agent they think might ultimately surpass human competence and intelligence, there’s little doubt what they’ll put there: “follow my instructions, favoring the most recent”. Everything else is a subgoal of that non-consequentialist central goal.
This approach leaves humans in charge, and that’s a problem. Ultimately I think that sort of instrucion-following intent alignment can be a stepping-stone to value alignment, once we’ve got a superintelligent instruction-following system to help us with that very difficult problem. But there’s neither a need nor an incentive to aim directly at that with our first AGIs. So alignment will succeed or fail on other issues.
Separately, I fully agree that most people who don’t believe in AGI x-risk aren’t making a true rejection. They usually really don’t believe we’ll make autonomous AGI soon enough to worry about it.