CEO at Redwood Research.
AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.
Thanks for this post.
There are two senses in which control or alignment could be a number-go-up science:
1: Right now, we have a metric that we can optimize on to direct our research.
2: At the point where we have systems in front of us that actively pose misalignment risk, we will have a metric that we can optimize.
I think you’re mostly talking about 1 here. The property of control that drew us to it is that it has that second sense much more than alignment research does. I think its advantage is smaller in property 1.
In order for 2 to go as well as possible, in the present, we should do research that fills some combination of two roles:
1. We try to improve our techniques, using a proxy for the future methodology.
2. We try to improve our methodology, so that we’ll be able to use it better at the point where we can directly assess risk.