Suppose my AI has a conversation with the purpose of convincing me to value X. If you were to ask me “Hey Paul, do you want your AI to choose actions to try to cause you to value X, unbeknownst to you?” I’d say “No.” It doesn’t really matter whether the conversation is pleasant.
What if the AI says “Hey Paul, I think it’s a really good idea to talk about whether you should value X, because I think you currently don’t value X but according to my best understanding of moral philosophy, there’s a high probability you actually should value X, and this will be relevant in the near future. Would you like to schedule some time to let me lay out the most important arguments for and against X so you can decide if you want to change your mind?”
Suppose the AI is fully honest here and really doing what it says it’s doing, but its understanding of moral philosophy is biased in some way (let’s say it over-values the importance of certain types of arguments and under-values other types of arguments), and its attempt to optimize the presentation of arguments for understandability to the user has the side effect of making them very convincing. It seems to me that the user could ask a bunch of questions, which the AI honestly answers, without detecting anything wrong, and end up being convinced of a wrong X. (It seems that similar things could happen if X was a belief about facts or a action/strategy instead of a value.)
Stuart had asked, corrigibility to whom? Do we define corrigibility as being corrigible to the original user (i.e., the user with their beliefs/values at time t_0), or to the current user? If we define it with regard to the original user, it seems that corrigibility does not have a basin of attraction. (If the user is convinced of a wrong X by the AI, there’s no force on the AI pushing back to the original belief.) If we define it with regard to the current user, “basin of attraction” may be true but is not as useful a property as it might intuitively seem, because the basin itself is now a moving target which can be pushed around by the AI.
Do we define corrigibility as being corrigible to the original user (i.e., the user with their beliefs/values at time t_0), or to the current user?
Definitely current user, as you say that’s the only way to have a basin of attraction.
If we define it with regard to the current user, “basin of attraction” may be true but is not as useful a property as it might intuitively seem, because the basin itself is now a moving target which can be pushed around by the AI.
Yes, it can be pushed around by the user or the AI or any other process in the world. The goal is to have it push around the user’s values in the way that the user wants, so that we aren’t at a disadvantage relative to normal reflection. (We might separately be at a disadvantage if the AI is relatively better at some kinds of thinking than others. That also seems like it ought to be addressed by separate work.)
It seems to me that the user could ask a bunch of questions, which the AI honestly answers, without detecting anything wrong, and end up being convinced of a wrong X. (It seems that similar things could happen if X was a belief about facts or a action/strategy instead of a value.)
Yes, if we need to answer a moral question in the short term, then we may get the wrong answer, whether we deliberate on our own or the AI helps us. My goal is to have the AI try to help us in the way we want to be helped, I am not currently holding out hope for the kind of AI that eliminates the possibility of moral error.
Of course our AI can also follow along with this kind of reasoning and therefore be conservative about making irreversible commitments or risking value drift, just as we would be. But if you postulate a situation that requires making a hasty moral judgment, I don’t think you can avoid the risk of error.
The goal is to have it push around the user’s values in the way that the user wants, so that we aren’t at a disadvantage relative to normal reflection.
My concern here is that even small errors in this area (i.e., in AI’s understanding of how the user wants their values to be pushed around) could snowball into large amounts of value drift, and no obvious “basin of attraction” protects against this even if the AI is corrigible.
Another concern is that the user may have little idea or a very vague idea of how they want their values to be pushed around, so the choice of how to push the user’s values around is largely determined by what the AI “wants” to do (i.e., tends to do in such cases). And this may end up being very different from where the user would end up by using “normal reflection”.
I guess my point is that there are open questions about how to protect against value drift caused by AI, what the AI should do when the user doesn’t have much idea of how they want their values to be pushed around, and how to get the AI to competently help the user with moral questions, which seem to be orthogonal to how to make the AI corrigible. I think you don’t necessarily disagree but just see these as lower priority problems than corrigibility? Without arguing about that, perhaps we can agree that listing these explicitly at least makes it clearer what problems corrigibility by itself can and can’t solve?
Of course our AI can also follow along with this kind of reasoning and therefore be conservative about making irreversible commitments or risking value drift, just as we would be.
If other areas of intellectual development are progressing at a very fast pace, I’m not sure being conservative about values would work out well.
We might separately be at a disadvantage if the AI is relatively better at some kinds of thinking than others. That also seems like it ought to be addressed by separate work.
This seems fine, as long as people who need to make strategic decisions about AI safety are aware of this, and whatever separate work that needs to be done is compatible with your basic approach.
I don’t think you can avoid the risk of error.
You say this a couple of times, seeming to imply that I’m asking for something unrealistic. I just want an AI that’s as competent in value learning/morality/philosophy as in science/technology/persuasion/etc. (or ideally more competent in the former to give a bigger safety margin), which unlike “eliminates the possibility of moral error” does not seem like asking for too much.
I guess my point is that there are open questions about how to protect against value drift caused by AI, what the AI should do when the user doesn’t have much idea of how they want their values to be pushed around, and how to get the AI to competently help the user with moral questions, which seem to be orthogonal to how to make the AI corrigible. I think you don’t necessarily disagree but just see these as lower priority problems than corrigibility? Without arguing about that, perhaps we can agree that listing these explicitly at least makes it clearer what problems corrigibility by itself can and can’t solve?
I agree with all of this. Yes, I see these other problems as (significantly) lower priority problems than alignment/corrigibility. But I do agree that it’s worth listing those problems explicitly.
My current guess is that the most serious non-alignment AI problems are:
1. AI will enable access to destructive physical technologies (without corresponding improvements in coordination).
2. AI will enable access to more AI, not covered by existing alignment techniques (without corresponding speedups in alignment).
These are both related to the more general problem: “Relative to humans, AI might be even better at tasks with rapid feedback relative to tasks without rapid feedback.” Moral/philosophical competence is also related to that general problem.
I typically list this more general problem prominently (as opposed to all of the other particular problems possibly posed by AI), because I think it’s especially important. (I may also be influenced by the fact that iterated amplification or debate also seem like a good approaches to this problem.)
This seems fine, as long as people who need to make strategic decisions about AI safety are aware of this, and whatever separate work that needs to be done is compatible with your basic approach.
I agree with this.
(I expect we disagree about practical recommendations, because we disagree about the magnitude of different problems.)
open questions about how to protect against value drift caused by AI
Do you see this problem as much different / more serious than value drift caused by other technology? (E.g. by changing how we interact with each other?)
I typically list this more general problem prominently (as opposed to all of the other particular problems possibly posed by AI), because I think it’s especially important.
Have you written about this in a post or paper somewhere? (I’m thinking of writing a post about this and related topics and would like to read and build upon existing literature.)
Do you see this problem as much different / more serious than value drift caused by other technology? (E.g. by changing how we interact with each other?)
What other technology are you thinking of, that might have an effect comparable to AI? As far as how we interact with each other, it seems likely that once superintelligent AIs come into existence, all or most interactions between humans will be mediated through AIs, which surely will have a much greater effect than any other change in communications technology?
Have you written about this in a post or paper somewhere? (I’m thinking of writing a post about this and related topics and would like to read and build upon existing literature.)
Not usefully. If I had to link to something on it, I might link to the Ought mission page, but I don’t have any substantive analysis to point to.
As far as how we interact with each other, it seems likely that once superintelligent AIs come into existence, all or most interactions between humans will be mediated through AIs, which surely will have a much greater effect than any other change in communications technology?
I agree with “larger effect than historical changes” but not “larger effect than all changes that we could speculate about” or even “larger effect sthan all changes between now and one superintelligent AIs come into existence.”
If AI is aligned, then it’s also worth noting that this effect is large but not obviously unusually disruptive, since e.g. the AI is trying to think about how to minimize it (though it may be doing that imperfectly).
As a random example, it seems plausible to me that changes to the way society is organized—what kinds of jobs people do, compulsory schooling, weaker connections to family, lower religiosity—over the last few centuries have had a larger unendorsed impact on values than AI will. I don’t see any principled reason to expect those changes to be positive while the changes from AI are negative, it seems like in expectation both of them would be positive but for the opportunity cost effect (where today we have the option to let our values and views change in whatever way we most endorse, and we foreclose this option when we let our values drift anything less than maximally-reflectively).
What if the AI says “Hey Paul, I think it’s a really good idea to talk about whether you should value X, because I think you currently don’t value X but according to my best understanding of moral philosophy, there’s a high probability you actually should value X, and this will be relevant in the near future. Would you like to schedule some time to let me lay out the most important arguments for and against X so you can decide if you want to change your mind?”
Suppose the AI is fully honest here and really doing what it says it’s doing, but its understanding of moral philosophy is biased in some way (let’s say it over-values the importance of certain types of arguments and under-values other types of arguments), and its attempt to optimize the presentation of arguments for understandability to the user has the side effect of making them very convincing. It seems to me that the user could ask a bunch of questions, which the AI honestly answers, without detecting anything wrong, and end up being convinced of a wrong X. (It seems that similar things could happen if X was a belief about facts or a action/strategy instead of a value.)
Stuart had asked, corrigibility to whom? Do we define corrigibility as being corrigible to the original user (i.e., the user with their beliefs/values at time t_0), or to the current user? If we define it with regard to the original user, it seems that corrigibility does not have a basin of attraction. (If the user is convinced of a wrong X by the AI, there’s no force on the AI pushing back to the original belief.) If we define it with regard to the current user, “basin of attraction” may be true but is not as useful a property as it might intuitively seem, because the basin itself is now a moving target which can be pushed around by the AI.
Definitely current user, as you say that’s the only way to have a basin of attraction.
Yes, it can be pushed around by the user or the AI or any other process in the world. The goal is to have it push around the user’s values in the way that the user wants, so that we aren’t at a disadvantage relative to normal reflection. (We might separately be at a disadvantage if the AI is relatively better at some kinds of thinking than others. That also seems like it ought to be addressed by separate work.)
Yes, if we need to answer a moral question in the short term, then we may get the wrong answer, whether we deliberate on our own or the AI helps us. My goal is to have the AI try to help us in the way we want to be helped, I am not currently holding out hope for the kind of AI that eliminates the possibility of moral error.
Of course our AI can also follow along with this kind of reasoning and therefore be conservative about making irreversible commitments or risking value drift, just as we would be. But if you postulate a situation that requires making a hasty moral judgment, I don’t think you can avoid the risk of error.
My concern here is that even small errors in this area (i.e., in AI’s understanding of how the user wants their values to be pushed around) could snowball into large amounts of value drift, and no obvious “basin of attraction” protects against this even if the AI is corrigible.
Another concern is that the user may have little idea or a very vague idea of how they want their values to be pushed around, so the choice of how to push the user’s values around is largely determined by what the AI “wants” to do (i.e., tends to do in such cases). And this may end up being very different from where the user would end up by using “normal reflection”.
I guess my point is that there are open questions about how to protect against value drift caused by AI, what the AI should do when the user doesn’t have much idea of how they want their values to be pushed around, and how to get the AI to competently help the user with moral questions, which seem to be orthogonal to how to make the AI corrigible. I think you don’t necessarily disagree but just see these as lower priority problems than corrigibility? Without arguing about that, perhaps we can agree that listing these explicitly at least makes it clearer what problems corrigibility by itself can and can’t solve?
If other areas of intellectual development are progressing at a very fast pace, I’m not sure being conservative about values would work out well.
This seems fine, as long as people who need to make strategic decisions about AI safety are aware of this, and whatever separate work that needs to be done is compatible with your basic approach.
You say this a couple of times, seeming to imply that I’m asking for something unrealistic. I just want an AI that’s as competent in value learning/morality/philosophy as in science/technology/persuasion/etc. (or ideally more competent in the former to give a bigger safety margin), which unlike “eliminates the possibility of moral error” does not seem like asking for too much.
I agree with all of this. Yes, I see these other problems as (significantly) lower priority problems than alignment/corrigibility. But I do agree that it’s worth listing those problems explicitly.
My current guess is that the most serious non-alignment AI problems are:
1. AI will enable access to destructive physical technologies (without corresponding improvements in coordination).
2. AI will enable access to more AI, not covered by existing alignment techniques (without corresponding speedups in alignment).
These are both related to the more general problem: “Relative to humans, AI might be even better at tasks with rapid feedback relative to tasks without rapid feedback.” Moral/philosophical competence is also related to that general problem.
I typically list this more general problem prominently (as opposed to all of the other particular problems possibly posed by AI), because I think it’s especially important. (I may also be influenced by the fact that iterated amplification or debate also seem like a good approaches to this problem.)
I agree with this.
(I expect we disagree about practical recommendations, because we disagree about the magnitude of different problems.)
Do you see this problem as much different / more serious than value drift caused by other technology? (E.g. by changing how we interact with each other?)
Have you written about this in a post or paper somewhere? (I’m thinking of writing a post about this and related topics and would like to read and build upon existing literature.)
What other technology are you thinking of, that might have an effect comparable to AI? As far as how we interact with each other, it seems likely that once superintelligent AIs come into existence, all or most interactions between humans will be mediated through AIs, which surely will have a much greater effect than any other change in communications technology?
Not usefully. If I had to link to something on it, I might link to the Ought mission page, but I don’t have any substantive analysis to point to.
I agree with “larger effect than historical changes” but not “larger effect than all changes that we could speculate about” or even “larger effect sthan all changes between now and one superintelligent AIs come into existence.”
If AI is aligned, then it’s also worth noting that this effect is large but not obviously unusually disruptive, since e.g. the AI is trying to think about how to minimize it (though it may be doing that imperfectly).
As a random example, it seems plausible to me that changes to the way society is organized—what kinds of jobs people do, compulsory schooling, weaker connections to family, lower religiosity—over the last few centuries have had a larger unendorsed impact on values than AI will. I don’t see any principled reason to expect those changes to be positive while the changes from AI are negative, it seems like in expectation both of them would be positive but for the opportunity cost effect (where today we have the option to let our values and views change in whatever way we most endorse, and we foreclose this option when we let our values drift anything less than maximally-reflectively).