It’s unfortunate that Ajeya’s article doesn’t mention Paul’s conception of corrigibility which is really central to understanding how his scheme is supposed to achieve alignment. In short, instead of having what we normally think of as values or a utility function, each of A[n] is doing something like “trying to be helpful to the user and keeping the user in control”, and this corrigibility is hopefully learned from the human Overseer and kept intact (and self-correcting) through the iterated Amplify-Distill process. For Paul’s own explanations, see this post and section 5 of this post.
(Of course this depends on H trying to be corrigible in the first place (as opposed to trying to maximize their own values, or trying to infer the user’s values and maximizing those without keeping the user in control). So if H is a religious fanatic then this is not going to work.)
The main motivation here is (as I understand it) that learning corrigibility may be easier and more tolerant of errors than learning values. So for example, whereas an AI that learns slightly wrong values may be motivated to manipulate H into accepting those wrong values or prevent itself from being turned off, intuitively it seems like it would take bigger errors in learning corrigibility for those things to happen. (This may well be a mirage; when people look more deeply into Paul’s idea of corrigibility maybe we’ll realize that learning it is actually as hard and error-sensitive as learning values. Sort of like how AI alignment through value learning perhaps didn’t seem that hard at first glance.) Again see Paul’s posts linked above for his own views on this.
Ok. I have all sorts of reasons to be doubtful of the strong form of corrigibility (mainly based on my repeated failure to get it to work, for reasons that seemed to be fundamental problems); in particular, it doesn’t solve the inconsistency of human values.
But I’ll look at those posts and write something more formal.
I see many possible concerns with the way I try to invoke corrigibility; but I don’t see immediately see how inconsistent values in particular are a problem (I suspect there is some miscommunication here).
The argument would be that the problem is not well defined because of that. But I’ll look at the other linked posts (do you have any more you suggest reading?) and get back to you.
It’s unfortunate that Ajeya’s article doesn’t mention Paul’s conception of corrigibility which is really central to understanding how his scheme is supposed to achieve alignment. In short, instead of having what we normally think of as values or a utility function, each of A[n] is doing something like “trying to be helpful to the user and keeping the user in control”, and this corrigibility is hopefully learned from the human Overseer and kept intact (and self-correcting) through the iterated Amplify-Distill process. For Paul’s own explanations, see this post and section 5 of this post.
(Of course this depends on H trying to be corrigible in the first place (as opposed to trying to maximize their own values, or trying to infer the user’s values and maximizing those without keeping the user in control). So if H is a religious fanatic then this is not going to work.)
The main motivation here is (as I understand it) that learning corrigibility may be easier and more tolerant of errors than learning values. So for example, whereas an AI that learns slightly wrong values may be motivated to manipulate H into accepting those wrong values or prevent itself from being turned off, intuitively it seems like it would take bigger errors in learning corrigibility for those things to happen. (This may well be a mirage; when people look more deeply into Paul’s idea of corrigibility maybe we’ll realize that learning it is actually as hard and error-sensitive as learning values. Sort of like how AI alignment through value learning perhaps didn’t seem that hard at first glance.) Again see Paul’s posts linked above for his own views on this.
Ok. I have all sorts of reasons to be doubtful of the strong form of corrigibility (mainly based on my repeated failure to get it to work, for reasons that seemed to be fundamental problems); in particular, it doesn’t solve the inconsistency of human values.
But I’ll look at those posts and write something more formal.
I see many possible concerns with the way I try to invoke corrigibility; but I don’t see immediately see how inconsistent values in particular are a problem (I suspect there is some miscommunication here).
The argument would be that the problem is not well defined because of that. But I’ll look at the other linked posts (do you have any more you suggest reading?) and get back to you.
I’m looking at corrigibility here: https://www.lesswrong.com/posts/T5ZyNq3fzN59aQG5y/the-limits-of-corrigibility