There is no requirement about the first stage assistant being human-level. I expect they will be superhuman in some respects and subhuman in others, just like existing AI.
The Distill procedure robustly preserves alignment: Given an aligned agent H we can use narrow safe learning techniques to train a much faster agent A which behaves as H would have behaved, without introducing any misaligned optimization or losing important aspects of what H values.
This seems to say every step of IDA, including the first, requires a Distill procedure that’s at least strong enough to upload a human. Maybe I’m looking at the wrong post?
I agree that “behaves as H would have behaved” seems wrong/sloppy.Iit’s referring to the “narrow” end of the spectrum introduced in that post (containing imitation learning, narrow RL, narrow IRL). So H is a rough upper bound on its intelligence.
The assumption is “The Distill procedure robustly preserves alignment.” You may think that’s only possible with an exact imitation, in which case I agree that you will never get off the ground.
In Ajeya’s defense, people do often use shorthand like describing an imitation learner’s behavior as “doing what the expert would do” without meaning to imply that it’s a perfect imitation. I agree in this case it’s unusually confusing.
It seems to me that doing that without “losing important aspects of what H values” would lead to something human-like anyway (though maybe not an exact imitation of H), because of complexity of value. Basically after the first step you get human-like entities running on computers. Then they can prevent AI risk and carefully figure out what to do next, same as a team of uploads. So the first step looks strategically similar to uploading, and solving stability for further steps might be unnecessary.
The resulting agent is supposed to be trying to help H get what its wants, but won’t generally encode most of H’s values directly (it will only encode them indirectly as “what the operator wants”).
I agree that Ajeya’s description in that paragraph is problematic (though I think the descriptions in the body of the post were mostly fine), will probably correct it.
Then I’m not sure I understand how the scheme works. If all questions about values are punted to the single living human at the top, won’t that be a bottleneck for any complex plan?
There is no requirement about the first stage assistant being human-level. I expect they will be superhuman in some respects and subhuman in others, just like existing AI.
From Ajeya Cotra’s post:
This seems to say every step of IDA, including the first, requires a Distill procedure that’s at least strong enough to upload a human. Maybe I’m looking at the wrong post?
I agree that “behaves as H would have behaved” seems wrong/sloppy.Iit’s referring to the “narrow” end of the spectrum introduced in that post (containing imitation learning, narrow RL, narrow IRL). So H is a rough upper bound on its intelligence.
The assumption is “The Distill procedure robustly preserves alignment.” You may think that’s only possible with an exact imitation, in which case I agree that you will never get off the ground.
In Ajeya’s defense, people do often use shorthand like describing an imitation learner’s behavior as “doing what the expert would do” without meaning to imply that it’s a perfect imitation. I agree in this case it’s unusually confusing.
It seems to me that doing that without “losing important aspects of what H values” would lead to something human-like anyway (though maybe not an exact imitation of H), because of complexity of value. Basically after the first step you get human-like entities running on computers. Then they can prevent AI risk and carefully figure out what to do next, same as a team of uploads. So the first step looks strategically similar to uploading, and solving stability for further steps might be unnecessary.
The resulting agent is supposed to be trying to help H get what its wants, but won’t generally encode most of H’s values directly (it will only encode them indirectly as “what the operator wants”).
I agree that Ajeya’s description in that paragraph is problematic (though I think the descriptions in the body of the post were mostly fine), will probably correct it.
Then I’m not sure I understand how the scheme works. If all questions about values are punted to the single living human at the top, won’t that be a bottleneck for any complex plan?