Maybe corrigibility / task AGI / etc. is harder than CEV, but it just doesn’t seem realistic to me to try to achieve full, up-and-running CEV with the very first AGI systems you build, within a few months or a few years of humanity figuring out how to build AGI at all.
The way I imagine the win scenario is, we’re going to make a lot of progress in understanding alignment before we know how to build AGI. And, we’re going to do it by prioritizing understanding alignment modulo capability (the two are not really possible to cleanly separate, but it might be possible to separate them enough for this purpose). For example, we can assume the existence of algorithms with certain properties, s.t. these properties arguably imply the algorithms can be used as building-blocks for AGI, and then ask: given such algorithms, how would we build aligned AGI? Or, we can come up with some toy setting where we already know how to build “AGI” in some sense, and ask, how to make it aligned in that setting? And then, once we know how to build AGI in the real world, it would hopefully not be too difficult to translate the alignment method.
One caveat in all this is, if AGI is going to use deep learning, we might not know how to apply the lesson from the “oracle”/toy setting, because we don’t understand what deep learning is actually doing, and because of that, we wouldn’t be sure where to “slot” it in the correspondence/analogy s.t. the alignment method remains sound. But, mainstream researchers have been making progress on understanding what deep learning is actually doing, and IMO it’s plausible we will have a good mathematical handle on it before AGI.
And rushing CEV and getting it only 95% correct poses far larger s-risks than rushing low-impact non-operator-modeling strawberry AGI and getting it only 95% correct.
I’m not sure whether you mean “95% correct CEV has a lot of S-risk” or “95% correct CEV has a little S-risk, and even a tiny amount of S-risk is terrifying”? I think I agree with the latter but not with the former. (How specifically does 95% CEV produce S-risk? I can imagine something like “AI realizes we want non-zero amount of pain/suffering to exist, somehow miscalibrates the amount and creates a lot of pain/suffering” or “AI realizes we don’t want to die, and focuses on this goal on the expense of everything else, preserving us forever in a state of complete sensory deprivation”. But these scenarios don’t seem very likely?)
I’m not sure whether you mean “95% correct CEV has a lot of S-risk” or “95% correct CEV has a little S-risk, and even a tiny amount of S-risk is terrifying”?
The way I imagine the win scenario is, we’re going to make a lot of progress in understanding alignment before we know how to build AGI. And, we’re going to do it by prioritizing understanding alignment modulo capability (the two are not really possible to cleanly separate, but it might be possible to separate them enough for this purpose). For example, we can assume the existence of algorithms with certain properties, s.t. these properties arguably imply the algorithms can be used as building-blocks for AGI, and then ask: given such algorithms, how would we build aligned AGI? Or, we can come up with some toy setting where we already know how to build “AGI” in some sense, and ask, how to make it aligned in that setting? And then, once we know how to build AGI in the real world, it would hopefully not be too difficult to translate the alignment method.
One caveat in all this is, if AGI is going to use deep learning, we might not know how to apply the lesson from the “oracle”/toy setting, because we don’t understand what deep learning is actually doing, and because of that, we wouldn’t be sure where to “slot” it in the correspondence/analogy s.t. the alignment method remains sound. But, mainstream researchers have been making progress on understanding what deep learning is actually doing, and IMO it’s plausible we will have a good mathematical handle on it before AGI.
I’m not sure whether you mean “95% correct CEV has a lot of S-risk” or “95% correct CEV has a little S-risk, and even a tiny amount of S-risk is terrifying”? I think I agree with the latter but not with the former. (How specifically does 95% CEV produce S-risk? I can imagine something like “AI realizes we want non-zero amount of pain/suffering to exist, somehow miscalibrates the amount and creates a lot of pain/suffering” or “AI realizes we don’t want to die, and focuses on this goal on the expense of everything else, preserving us forever in a state of complete sensory deprivation”. But these scenarios don’t seem very likely?)
The latter, as I was imagining “95%”.