I had made a post proposing a new alignment technique. I didn’t get any responses, but it still seems like a reasonable idea to me, so I’m interested in hearing what others think of it. I think the basic idea of the post, if correct, could be useful for future study. However, I don’t want to waste time doing this if the idea is unworkable for a reason I hadn’t thought of.
(If you’re interested, please read the post before reading below.)
Of course, the idea’s not a complete solution to alignment, and things have a risk of going catastrophically wrong due to other problems, like unreliable reasoning. But it still seems to me that it’s potentially helpful for outer alignment and corrigability.
If the humans actually directly answer any query about the desirability of an outcome, then it’s hard for me to see a way this system wouldn’t be outer-aligned.
Now, consulting humans every time results in a very slow objective function. Most optimization algorithms I know of rely on huge numbers of queries to the objective function, so using these algorithms with humans manually implementing the objective function would be infeasible. However, I don’t see anything in principle impossible with coming up with an optimization algorithm that scores well on its objective function even if that function is extremely slow. Even if the technique I described to do in the post this was wrong, I haven’t seen anyone looking into this, so it doesn’t seem clearly unworkable to me.
Even if this does turn out to be intractable, I think the basic motivation of my post still has the potential to be useful. The main motivation of my post is to have a hard-coded method of querying humans before making major strategic decisions and to update its beliefs about what is desirable with their responses. But that is a technique that could be used in other AI systems as well. It wouldn’t solve the everything, of course, but it could provide an additional level of safety. I’m not sure if this idea has been discussed before.
I also have yet to find anything seriously problematic about the method I did provided to optimize with limited calls to the objective function. There could of course be some I haven’t thought of, though.
I had made a post proposing a new alignment technique. I didn’t get any responses, but it still seems like a reasonable idea to me, so I’m interested in hearing what others think of it. I think the basic idea of the post, if correct, could be useful for future study. However, I don’t want to waste time doing this if the idea is unworkable for a reason I hadn’t thought of.
(If you’re interested, please read the post before reading below.)
Of course, the idea’s not a complete solution to alignment, and things have a risk of going catastrophically wrong due to other problems, like unreliable reasoning. But it still seems to me that it’s potentially helpful for outer alignment and corrigability.
If the humans actually directly answer any query about the desirability of an outcome, then it’s hard for me to see a way this system wouldn’t be outer-aligned.
Now, consulting humans every time results in a very slow objective function. Most optimization algorithms I know of rely on huge numbers of queries to the objective function, so using these algorithms with humans manually implementing the objective function would be infeasible. However, I don’t see anything in principle impossible with coming up with an optimization algorithm that scores well on its objective function even if that function is extremely slow. Even if the technique I described to do in the post this was wrong, I haven’t seen anyone looking into this, so it doesn’t seem clearly unworkable to me.
Even if this does turn out to be intractable, I think the basic motivation of my post still has the potential to be useful. The main motivation of my post is to have a hard-coded method of querying humans before making major strategic decisions and to update its beliefs about what is desirable with their responses. But that is a technique that could be used in other AI systems as well. It wouldn’t solve the everything, of course, but it could provide an additional level of safety. I’m not sure if this idea has been discussed before.
I also have yet to find anything seriously problematic about the method I did provided to optimize with limited calls to the objective function. There could of course be some I haven’t thought of, though.