We do have a poor understanding of human values. That’s one more reason we shouldn’t and probably won’t try to build them into AGI.
You’re expressing a common view among the alignment community. I think we should update from that view to the more likely scenario in which we don’t even try to align AGI to human values.
What we’re actually doing is training LLMs to answer questions as they were intended, and to follow instructions as they were intended. The AI needs to understand human values to some degree to do that, but training is really focused on those things. There’s an interesting bit in this interview with Tan Zhi Xuan on this distinction between theory and practice of training LLMs, and to a lesser degree in their paper.
It’s counterintuitive to think about a highly intelligent agent that wants to do what someone else tells it. But it’s not logically incoherent.
And when the first human decides what goal to put in the system prompt of the first agent they think might ultimately surpass human competence and intelligence, there’s little doubt what they’ll put there: “follow my instructions, favoring the most recent”. Everything else is a subgoal of that non-consequentialist central goal.
This approach leaves humans in charge, and that’s a problem. Ultimately I think that sort of instrucion-following intent alignment can be a stepping-stone to value alignment, once we’ve got a superintelligent instruction-following system to help us with that very difficult problem. But there’s neither a need nor an incentive to aim directly at that with our first AGIs. So alignment will succeed or fail on other issues.
Separately, I fully agree that most people who don’t believe in AGI x-risk aren’t making a true rejection. They usually really don’t believe we’ll make autonomous AGI soon enough to worry about it.
We do have a poor understanding of human values. That’s one more reason we shouldn’t and probably won’t try to build them into AGI.
You’re expressing a common view among the alignment community. I think we should update from that view to the more likely scenario in which we don’t even try to align AGI to human values.
What we’re actually doing is training LLMs to answer questions as they were intended, and to follow instructions as they were intended. The AI needs to understand human values to some degree to do that, but training is really focused on those things. There’s an interesting bit in this interview with Tan Zhi Xuan on this distinction between theory and practice of training LLMs, and to a lesser degree in their paper.
Not only is that what we are doing for current AI, I think it’s both what we should do for future AGI, and what we probably will do. Instruction-following AGI is easier and more likely than value aligned AGI.
It’s counterintuitive to think about a highly intelligent agent that wants to do what someone else tells it. But it’s not logically incoherent.
And when the first human decides what goal to put in the system prompt of the first agent they think might ultimately surpass human competence and intelligence, there’s little doubt what they’ll put there: “follow my instructions, favoring the most recent”. Everything else is a subgoal of that non-consequentialist central goal.
This approach leaves humans in charge, and that’s a problem. Ultimately I think that sort of instrucion-following intent alignment can be a stepping-stone to value alignment, once we’ve got a superintelligent instruction-following system to help us with that very difficult problem. But there’s neither a need nor an incentive to aim directly at that with our first AGIs. So alignment will succeed or fail on other issues.
Separately, I fully agree that most people who don’t believe in AGI x-risk aren’t making a true rejection. They usually really don’t believe we’ll make autonomous AGI soon enough to worry about it.