I agree if the criterion is to get us to utopia, it’s a problem (maybe not even that, but whatever), but if we instead say that it has to avoid x-risk, then we do have some options. My favorite research directions are IDA and HCH, with ELK a second option for alignment. We aren’t fully finished on those ideas, but we do have at least some idea about what we can do.
Also, it’s very unlikely theoretical work or mathematical work like provable alignment will do much, beyond toy problems here.
I agree if the criterion is to get us to utopia, it’s a problem (maybe not even that, but whatever), but if we instead say that it has to avoid x-risk
This seems to misunderstand my view? My goal is to avoid x-risk, not to get us to utopia. (Or rather, my proximate goal is to end the acute risk period; ultimate goal utopia, but I want to pint nearly all of the utopia-work to the period after we’ve ended the acute risk period.
“When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone. When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get. So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it. Even smaller chances of killing even fewer people would be a nice luxury, but if you can get as incredibly far as “less than roughly certain to kill everybody”, then you can probably get down to under a 5% chance with only slightly more effort. Practically all of the difficulty is in getting to “less than certainty of killing literally everyone”. Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment. At this point, I no longer care how it works, I don’t care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI ‘this will not kill literally everyone’. Anybody telling you I’m asking for stricter ‘alignment’ than this has failed at reading comprehension. The big ask from AGI alignment, the basic challenge I am saying is too difficult, is to obtain by any strategy whatsoever a significant chance of there being any survivors.”
mathematical work like provable alignment will do much
I don’t know what you mean by “do much”, but if you think theory/math work is about “provable alignment” then you’re misunderstanding all (or at least the vast majority?) of the theory/math work on alignment. “Is this system aligned?” is not the sort of property that admits of deductive proof, even if the path to understanding “how does alignment work?” better today routes through some amount of theorem-proving on more abstract and fully-formalizable questions.
I agree if the criterion is to get us to utopia, it’s a problem (maybe not even that, but whatever), but if we instead say that it has to avoid x-risk, then we do have some options. My favorite research directions are IDA and HCH, with ELK a second option for alignment. We aren’t fully finished on those ideas, but we do have at least some idea about what we can do.
Also, it’s very unlikely theoretical work or mathematical work like provable alignment will do much, beyond toy problems here.
This seems to misunderstand my view? My goal is to avoid x-risk, not to get us to utopia. (Or rather, my proximate goal is to end the acute risk period; ultimate goal utopia, but I want to pint nearly all of the utopia-work to the period after we’ve ended the acute risk period.
Cf., from Eliezer:
“When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone. When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get. So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it. Even smaller chances of killing even fewer people would be a nice luxury, but if you can get as incredibly far as “less than roughly certain to kill everybody”, then you can probably get down to under a 5% chance with only slightly more effort. Practically all of the difficulty is in getting to “less than certainty of killing literally everyone”. Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment. At this point, I no longer care how it works, I don’t care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI ‘this will not kill literally everyone’. Anybody telling you I’m asking for stricter ‘alignment’ than this has failed at reading comprehension. The big ask from AGI alignment, the basic challenge I am saying is too difficult, is to obtain by any strategy whatsoever a significant chance of there being any survivors.”
I don’t know what you mean by “do much”, but if you think theory/math work is about “provable alignment” then you’re misunderstanding all (or at least the vast majority?) of the theory/math work on alignment. “Is this system aligned?” is not the sort of property that admits of deductive proof, even if the path to understanding “how does alignment work?” better today routes through some amount of theorem-proving on more abstract and fully-formalizable questions.