Pretty sure OpenPhil and OpenAI currently try to fund plans that claim to look like this (e.g. all the ones orthonormal linked in the OP), though I agree that they could try increasing the financial reward by 100x (e.g. a prize) and see what that inspires.
As I understand it, the high level summary (naturally Eliezer can correct me) is that (a) corrigible behaviour is very unnatural and hard to find (most nearby things in mindspace are not in equilibrium and will move away from corrigibility as they reflect / improve), and (b) using complicated recursive setups with gradient descent to do supervised learning is incredible chaotic and hard to manage, and shouldn’t be counted on working without major testing and delays (i.e. could not be competitive).
There’s also some more subtle and implicit disagreement that’s not been quite worked out but feeds into the above, where a lot of the ML-focused alignment strategies contain this idea that we will be able to expose ML system’s thought processes to humans in a transparent and inspectable way, and check whether it has corrigibility, alignment, and intelligence, then add them up together like building blocks. My read is that Eliezer finds this to be an incredible claim that would be a truly dramatic update if there was a workable proposal for it, whereas many of the proposals above take it more as a starting assumption that this is feasible and move on from there to use it in a recursive setup, then alter the details of the recursive setup in order to patch any subsequent problems.
For more hashed out details on that subtle disagreement, see the response post linked above which has several concrete examples.
As I understand it, the high level summary (naturally Eliezer can correct me) is that (a) corrigible behaviour is very unnatural and hard to find (most nearby things in mindspace are not in equilibrium and will move away from corrigibility as they reflect / improve), and (b) using complicated recursive setups with gradient descent to do supervised learning is incredible chaotic and hard to manage, and shouldn’t be counted on working without major testing and delays (i.e. could not be competitive).
Perhaps Eliezer can interject here, but it seems to me like these are not knockdown criticisms that such an approach can’t “possibly, possibly, possibly work”—just reasons that it’s unlikely to and that we shouldn’t rely on it working.
My model is that those two are the well-operationalised disagreements and thus productive to focus on, but that most of the despair is coming from the third and currently more implicit point.
Then there are more interesting proposals that require being able to fully inspect the cognition of an ML system and have it be fully introspectively clear and then use it as a building block to build stronger, competitive, corrigible and aligned ML systems. I think this is an accurate description of Iterated Amplification + Debate as Zhu says in section 1.1.4 of his FAQ, and I think something very similar to this is what Chris Olah is excited about re: microscopes about reverse engineering the entire codebase/cognition of an ML system.
I don’t deny that there are lot of substantive and fascinating details to a lot of these proposals and that if this is possible we might indeed solve the alignment problem, but I think that is a large step that sounds from some initial perspectives kind of magical. And don’t forget that at the same time we have to be able to combine it in a way that is competitive and corrigible and aligned.
I feel like it’s one reasonable position to call such proposals non-starters until a possibility proof is shown, and instead work on basic theory that will eventually be able to give more plausible basic building blocks for designing an intelligent system. I feel confident that certain sorts of basic theories are definitely there to be discovered, that there are strong intuitions about where to look, they haven’t been worked on much, and that there is low-hanging fruit to be plucked. I think Jessica Taylor wrote about a similar intuition about why she moved away from ML to do basic theory work.
I feel like it’s one reasonable position to call such proposals non-starters until a possibility proof is shown, and instead work on basic theory that will eventually be able to give more plausible basic building blocks for designing an intelligent system.
I agree that deciding to work on basic theory is a pretty reasonable research direction—but that doesn’t imply that other proposals can’t possibly work. Thinking that a research direction is less likely to mitigate existential risk than another is different than thinking that a research direction is entirely a non-starter. The second requires significantly more evidence than the first and it doesn’t seem to me like the points that you referenced cross that bar, though of course that’s a subjective distinction.
Pretty sure OpenPhil and OpenAI currently try to fund plans that claim to look like this (e.g. all the ones orthonormal linked in the OP), though I agree that they could try increasing the financial reward by 100x (e.g. a prize) and see what that inspires.
If you want to understand why Eliezer doesn’t find the current proposals feasible, his best writeups critiquing them specifically are this long comment containing high level disagreements with Alex Zhu’s FAQ on iterated amplification and this response post to the details of Iterated Amplification.
As I understand it, the high level summary (naturally Eliezer can correct me) is that (a) corrigible behaviour is very unnatural and hard to find (most nearby things in mindspace are not in equilibrium and will move away from corrigibility as they reflect / improve), and (b) using complicated recursive setups with gradient descent to do supervised learning is incredible chaotic and hard to manage, and shouldn’t be counted on working without major testing and delays (i.e. could not be competitive).
There’s also some more subtle and implicit disagreement that’s not been quite worked out but feeds into the above, where a lot of the ML-focused alignment strategies contain this idea that we will be able to expose ML system’s thought processes to humans in a transparent and inspectable way, and check whether it has corrigibility, alignment, and intelligence, then add them up together like building blocks. My read is that Eliezer finds this to be an incredible claim that would be a truly dramatic update if there was a workable proposal for it, whereas many of the proposals above take it more as a starting assumption that this is feasible and move on from there to use it in a recursive setup, then alter the details of the recursive setup in order to patch any subsequent problems.
For more hashed out details on that subtle disagreement, see the response post linked above which has several concrete examples.
(Added: Here’s the state of discussion on site for AI safety via debate, which has a lot of overlap with Iterated Amplification. And here’s all the postss on Iterated Amplification. I should make a tag for CIRL...)
Perhaps Eliezer can interject here, but it seems to me like these are not knockdown criticisms that such an approach can’t “possibly, possibly, possibly work”—just reasons that it’s unlikely to and that we shouldn’t rely on it working.
My model is that those two are the well-operationalised disagreements and thus productive to focus on, but that most of the despair is coming from the third and currently more implicit point.
Stepping back, the baseline is that most plans are crossing over dozens of kill-switches without realising it (e.g. Yann LeCun’s “objectives can be changed quickly when issues surface”).
Then there are more interesting proposals that require being able to fully inspect the cognition of an ML system and have it be fully introspectively clear and then use it as a building block to build stronger, competitive, corrigible and aligned ML systems. I think this is an accurate description of Iterated Amplification + Debate as Zhu says in section 1.1.4 of his FAQ, and I think something very similar to this is what Chris Olah is excited about re: microscopes about reverse engineering the entire codebase/cognition of an ML system.
I don’t deny that there are lot of substantive and fascinating details to a lot of these proposals and that if this is possible we might indeed solve the alignment problem, but I think that is a large step that sounds from some initial perspectives kind of magical. And don’t forget that at the same time we have to be able to combine it in a way that is competitive and corrigible and aligned.
I feel like it’s one reasonable position to call such proposals non-starters until a possibility proof is shown, and instead work on basic theory that will eventually be able to give more plausible basic building blocks for designing an intelligent system. I feel confident that certain sorts of basic theories are definitely there to be discovered, that there are strong intuitions about where to look, they haven’t been worked on much, and that there is low-hanging fruit to be plucked. I think Jessica Taylor wrote about a similar intuition about why she moved away from ML to do basic theory work.
I agree that deciding to work on basic theory is a pretty reasonable research direction—but that doesn’t imply that other proposals can’t possibly work. Thinking that a research direction is less likely to mitigate existential risk than another is different than thinking that a research direction is entirely a non-starter. The second requires significantly more evidence than the first and it doesn’t seem to me like the points that you referenced cross that bar, though of course that’s a subjective distinction.
Even if available plans do get funded getting new plan ideas might be underfunded.