It seems to me that uploading tech would be a solution to AI risk, because a trusted team of uploads running at high speed can stop other AIs from arising and figure out the next steps. The first stage assistants proposed by Paul’s plan already require tech that’s pretty close to uploading tech, and will be very useful for developing uploading tech even without the later recursive stages. So the window of usefulness for the first stage seems small, and the window of usefulness for the later recursive stages seems even smaller. Am I missing something?
I don’t think it requires anything like uploading tech. It just involves training a model using RL (or RL+imitation learning), that’s something we can do today.
I thought your first stage assistants were supposed to be as good as humans at many tasks, including answering questions about their own thoughts. Is that much easier than imitating a specific human?
There is no requirement about the first stage assistant being human-level. I expect they will be superhuman in some respects and subhuman in others, just like existing AI.
The Distill procedure robustly preserves alignment: Given an aligned agent H we can use narrow safe learning techniques to train a much faster agent A which behaves as H would have behaved, without introducing any misaligned optimization or losing important aspects of what H values.
This seems to say every step of IDA, including the first, requires a Distill procedure that’s at least strong enough to upload a human. Maybe I’m looking at the wrong post?
I agree that “behaves as H would have behaved” seems wrong/sloppy.Iit’s referring to the “narrow” end of the spectrum introduced in that post (containing imitation learning, narrow RL, narrow IRL). So H is a rough upper bound on its intelligence.
The assumption is “The Distill procedure robustly preserves alignment.” You may think that’s only possible with an exact imitation, in which case I agree that you will never get off the ground.
In Ajeya’s defense, people do often use shorthand like describing an imitation learner’s behavior as “doing what the expert would do” without meaning to imply that it’s a perfect imitation. I agree in this case it’s unusually confusing.
It seems to me that doing that without “losing important aspects of what H values” would lead to something human-like anyway (though maybe not an exact imitation of H), because of complexity of value. Basically after the first step you get human-like entities running on computers. Then they can prevent AI risk and carefully figure out what to do next, same as a team of uploads. So the first step looks strategically similar to uploading, and solving stability for further steps might be unnecessary.
The resulting agent is supposed to be trying to help H get what its wants, but won’t generally encode most of H’s values directly (it will only encode them indirectly as “what the operator wants”).
I agree that Ajeya’s description in that paragraph is problematic (though I think the descriptions in the body of the post were mostly fine), will probably correct it.
Then I’m not sure I understand how the scheme works. If all questions about values are punted to the single living human at the top, won’t that be a bottleneck for any complex plan?
It seems to me that uploading tech would be a solution to AI risk, because a trusted team of uploads running at high speed can stop other AIs from arising and figure out the next steps. The first stage assistants proposed by Paul’s plan already require tech that’s pretty close to uploading tech, and will be very useful for developing uploading tech even without the later recursive stages. So the window of usefulness for the first stage seems small, and the window of usefulness for the later recursive stages seems even smaller. Am I missing something?
I don’t think it requires anything like uploading tech. It just involves training a model using RL (or RL+imitation learning), that’s something we can do today.
I thought your first stage assistants were supposed to be as good as humans at many tasks, including answering questions about their own thoughts. Is that much easier than imitating a specific human?
There is no requirement about the first stage assistant being human-level. I expect they will be superhuman in some respects and subhuman in others, just like existing AI.
From Ajeya Cotra’s post:
This seems to say every step of IDA, including the first, requires a Distill procedure that’s at least strong enough to upload a human. Maybe I’m looking at the wrong post?
I agree that “behaves as H would have behaved” seems wrong/sloppy.Iit’s referring to the “narrow” end of the spectrum introduced in that post (containing imitation learning, narrow RL, narrow IRL). So H is a rough upper bound on its intelligence.
The assumption is “The Distill procedure robustly preserves alignment.” You may think that’s only possible with an exact imitation, in which case I agree that you will never get off the ground.
In Ajeya’s defense, people do often use shorthand like describing an imitation learner’s behavior as “doing what the expert would do” without meaning to imply that it’s a perfect imitation. I agree in this case it’s unusually confusing.
It seems to me that doing that without “losing important aspects of what H values” would lead to something human-like anyway (though maybe not an exact imitation of H), because of complexity of value. Basically after the first step you get human-like entities running on computers. Then they can prevent AI risk and carefully figure out what to do next, same as a team of uploads. So the first step looks strategically similar to uploading, and solving stability for further steps might be unnecessary.
The resulting agent is supposed to be trying to help H get what its wants, but won’t generally encode most of H’s values directly (it will only encode them indirectly as “what the operator wants”).
I agree that Ajeya’s description in that paragraph is problematic (though I think the descriptions in the body of the post were mostly fine), will probably correct it.
Then I’m not sure I understand how the scheme works. If all questions about values are punted to the single living human at the top, won’t that be a bottleneck for any complex plan?