Hmm, thinking about it more I guess I’d say that “alignment” is not a binary. Like maybe:
Good alignment:Algorithm is helping the human work faster and better, while not doing anything more dangerous than what the human would have done by themselves without AI assistance
Even better alignment:Algorithm is trying to maximize the operator’s synthesized preferences / trying to implement CEV / whatever.
One thing is, there’s a bootstrapping thing here: if AI alignment researchers had AIs with “good alignment”, that would help them make AIs with “even better alignment”.
Another thing is, I dunno, I feel like having a path that definitely gets us to “good alignment” would be such fantastic wonderful progress that I would want to pop some champagne and sing the praises of whoever figured that out. That’s not to say that we can all retire and go to the beach, but I think there’s a legitimate sense in which this kind of advance would solve a huge and important class of problems.
Like, maybe we should say that “radically superhuman safety and benevolence” is a different problem than “alignment”? We still want to solve both of course. The pre-AGI status quo has more than its share of safety problems.
Alignment is always contextual in regards to the social norms at the time. We’ve not had AI for that long so people assume that the alignment problem is a solve it once for all type of thing instead of an ever changing problem.
It’s very similar in nature as in how they test new technologies for mass adoption. Things have been massively adopted before their safety is thoroughly researched, but you can only do so much before the demand for their necessity and people’s impatience push for their ubiquity, like asbestos and radiation. When we fail to find alternatives for the new demands, it will be massively adopted regardless of their consequences. AI can be thought of as just an extension of computers, specialized to certain tasks. The technology itself is fundamentally the same, how it’s been used is mostly what’s been changing because of the improved efficiency. The technology, computer, has seen mass adoption already, but it’s no longer the same computers as people were using 30 or even 20 years ago. Most new technologies are even as close to multipurpose as the computer, so we are dealing with an unprecedented type of mass adoption event in human history where the technology itself is closely tied to how it’s been used and its ever changing nature of the type of computations people at the time decide to use them for.
I’d say alignment should be about values, so only your “even better alignment” qualifies. The non-agentic AI safety concepts like corrigibility, that might pave the way to aligned systems if controllers manage to keep their values throughout the process, are not themselves examples of alignment.
Like, maybe we should say that “radically superhuman safety and benevolence” is a different problem than “alignment”?
Ah, you mean that “alignment” is a different problem than “subhuman and human-imitating training safety”? :P
So is there a continuum between category 1 and category 2? The transitional fossils could be non-human-imitating AIs that are trying to be a little bit general or have goals that refer to a model of the human a little bit, but the designers still understand the search space better than the AIs.
Ah, you mean that “alignment” is a different problem than “subhuman and human-imitating training safety”? :P
“Quantilizing from the human policy” is human-imitating in a sense, but also superhuman. At least modestly superhuman—depends on how hard you quantilize. (And maybe very superhuman in speed.)
If you could fork your brain state to create an exact clone, would that clone be “aligned” with you? I think that we should define the word “aligned” such that the answer is “yes”. Common sense, right?
Seems to me that if you say “yes it’s aligned” to that question, then you should also say “yes it’s aligned” to a quantilize-from-the-human-policy agent. It’s kinda in the same category, seems to me.
Hmm, Stuart Armstrong suggested here that “alignment is conditional: an AI is aligned with humans in certain circumstances, at certain levels of power.” So then maybe as you quantilize harder and harder, you get less and less confident in that system’s “alignment”?
(I’m not sure we’re disagreeing about anything substantive, just terminology, right? Also, I don’t actually personally buy into this quantilization picture, to be clear.)
Yup, I more or less agree with all that. The name thing was just a joke about giving things we like better priority in namespace.
I think quantilization is safe when it’s a slightly “lucky” human-imitation (also if it’s a slightly “lucky” version of some simpler base distribution, but then it won’t be as smart). But push too hard, which might not be very hard at all if you’re iterating quantilization steps rather than quantilizing over a long-term policy, and instead you get an unaligned intelligence that happens to interact with the world by picking human-like behaviors that serve its purposes. (Vanessa pointed out to me that timeline-based DRL gets around the iteration problem because it relies on the human as an oracle for expected utility.)
Hmm, thinking about it more I guess I’d say that “alignment” is not a binary. Like maybe:
Good alignment: Algorithm is helping the human work faster and better, while not doing anything more dangerous than what the human would have done by themselves without AI assistance
Even better alignment: Algorithm is trying to maximize the operator’s synthesized preferences / trying to implement CEV / whatever.
One thing is, there’s a bootstrapping thing here: if AI alignment researchers had AIs with “good alignment”, that would help them make AIs with “even better alignment”.
Another thing is, I dunno, I feel like having a path that definitely gets us to “good alignment” would be such fantastic wonderful progress that I would want to pop some champagne and sing the praises of whoever figured that out. That’s not to say that we can all retire and go to the beach, but I think there’s a legitimate sense in which this kind of advance would solve a huge and important class of problems.
Like, maybe we should say that “radically superhuman safety and benevolence” is a different problem than “alignment”? We still want to solve both of course. The pre-AGI status quo has more than its share of safety problems.
Alignment is always contextual in regards to the social norms at the time. We’ve not had AI for that long so people assume that the alignment problem is a solve it once for all type of thing instead of an ever changing problem.
It’s very similar in nature as in how they test new technologies for mass adoption. Things have been massively adopted before their safety is thoroughly researched, but you can only do so much before the demand for their necessity and people’s impatience push for their ubiquity, like asbestos and radiation. When we fail to find alternatives for the new demands, it will be massively adopted regardless of their consequences. AI can be thought of as just an extension of computers, specialized to certain tasks. The technology itself is fundamentally the same, how it’s been used is mostly what’s been changing because of the improved efficiency. The technology, computer, has seen mass adoption already, but it’s no longer the same computers as people were using 30 or even 20 years ago. Most new technologies are even as close to multipurpose as the computer, so we are dealing with an unprecedented type of mass adoption event in human history where the technology itself is closely tied to how it’s been used and its ever changing nature of the type of computations people at the time decide to use them for.
I’d say alignment should be about values, so only your “even better alignment” qualifies. The non-agentic AI safety concepts like corrigibility, that might pave the way to aligned systems if controllers manage to keep their values throughout the process, are not themselves examples of alignment.
Ah, you mean that “alignment” is a different problem than “subhuman and human-imitating training safety”? :P
So is there a continuum between category 1 and category 2? The transitional fossils could be non-human-imitating AIs that are trying to be a little bit general or have goals that refer to a model of the human a little bit, but the designers still understand the search space better than the AIs.
“Quantilizing from the human policy” is human-imitating in a sense, but also superhuman. At least modestly superhuman—depends on how hard you quantilize. (And maybe very superhuman in speed.)
If you could fork your brain state to create an exact clone, would that clone be “aligned” with you? I think that we should define the word “aligned” such that the answer is “yes”. Common sense, right?
Seems to me that if you say “yes it’s aligned” to that question, then you should also say “yes it’s aligned” to a quantilize-from-the-human-policy agent. It’s kinda in the same category, seems to me.
Hmm, Stuart Armstrong suggested here that “alignment is conditional: an AI is aligned with humans in certain circumstances, at certain levels of power.” So then maybe as you quantilize harder and harder, you get less and less confident in that system’s “alignment”?
(I’m not sure we’re disagreeing about anything substantive, just terminology, right? Also, I don’t actually personally buy into this quantilization picture, to be clear.)
Yup, I more or less agree with all that. The name thing was just a joke about giving things we like better priority in namespace.
I think quantilization is safe when it’s a slightly “lucky” human-imitation (also if it’s a slightly “lucky” version of some simpler base distribution, but then it won’t be as smart). But push too hard, which might not be very hard at all if you’re iterating quantilization steps rather than quantilizing over a long-term policy, and instead you get an unaligned intelligence that happens to interact with the world by picking human-like behaviors that serve its purposes. (Vanessa pointed out to me that timeline-based DRL gets around the iteration problem because it relies on the human as an oracle for expected utility.)