What does it mean for human values to be vulnerable to adversarial examples? When we say this about AI systems (e.g. image classifiers), I think it’s either because their judgments on manipulated situations/images are misaligned with ours/humans, or perhaps because they get the “ground truth” wrong. But how can a value system be misaligned with itself or different from the ground truth? For alignment purposes, isn’t it itself the ground truth? It could of course fail to match “objective morality” if you believe in that, but in that case we should probably be trying to make our AI align with that and not with someone’s human values.
I could (easily) imagine that my values are inconsistent, conflicting, and ever-changing, but these seem like different issues.
It also seems like you have a value that says something to the effect of “it’s wrong to corrupt people’s values (in certain circumstances)”. Then wouldn’t an AI that’s aligned with your values share this value, and not do this intentionally? And as for unintentionally: it seems that you have thought of this problem, and an ASI would presumably be much smarter than you, so wouldn’t it think of it too, and try hard to avoid it? [My reasoning here sounds a bit naive or “too simple” to me, but I’m not sure it’s wrong.]
I could understand that there might be issues with value learning AIs that imperfectly learn something close to a human’s value function, which may be vulnerable to adversarial examples, but this again seems like a different issue.
What does it mean for human values to be vulnerable to adversarial examples?
I’m not sure how to think about this formally, but intuitively, our value functions probably only “make sense” in a small region of possibility space, and just starts behaving randomly outside of that. It doesn’t seem right to treat that random behavior as someone’s “real values” and try to maximize that.
It also seems like you have a value that says something to the effect of “it’s wrong to corrupt people’s values (in certain circumstances)”. Then wouldn’t an AI that’s aligned with your values share this value, and not do this intentionally?
I wouldn’t want to corrupt the values of people who share roughly the same moral and philosophical outlook as myself, but if someone already has values that are very likely to be wrong (e.g., they just want to maximize the complexity or the universe, or how technologically advanced we are, or the glory of their god) I might be ok with trying to manipulate their values, especially if they’re trying to do the same thing to me. The problem is that it’s much easier for them to defend their values. Since they don’t think they need further moral development, they can just tell their AI to block any outside messages that might cause any changes to their values, but I can’t do that.
And as for unintentionally: it seems that you have thought of this problem, and an ASI would presumably be much smarter than you, so wouldn’t it think of it too, and try hard to avoid it?
Other people may not think of the problem, or may not be as concerned about it as I am, and in some alignment schemes their AI would share their level of concern and not try very hard to avoid this problem. I don’t want to see their values corrupted this way. Even for myself, if AIs overall are accelerating technological development faster than moral/philosophical progress, it’s unclear how I can avoid this problem even with the assistance of an aligned AI. The AI may be faced with many choices that it doesn’t know how to answer directly, and it also doesn’t know how to ask me for help without risking corrupting me. If the AI is conservative it might be paralyzed with indecision or be forced to make a lot of suboptimal decisions that seem “safe”, and if it’s not conservative enough it might corrupt me even though it’s trying hard not to.
(I probably should have explained more in the OP, so I’m glad you’re asking these questions.)
our value functions probably only “make sense” in a small region of possibility space, and just starts behaving randomly outside of that.
Okay, that helps me understand what you’re talking about a bit better. It sounds like the concept of a partial function, and in the ML realm like the notorious brittleness that makes systems incapable of generalizing or extrapolating outside of a limited training set. I understand why you’re approaching this from the adversarial angle though, because I suppose you’re concerned about the AI just bringing about some state that’s outside the domain of definition which just happens to yield a high “random” score.
It doesn’t seem right to treat that random behavior as someone’s “real values” and try to maximize that.
Upon first reading, I kind of agreed, so I definitely understand this intuition. “Random” behavior certainly doesn’t sound great, and “arbitrary” or “undefined” isn’t much better. But upon further reflection I’m not so sure.
First of all, what does it mean for a value system to behave randomly/arbitrarily, and is it ever not arbitrary? Arbitrary to me means that there is no reason for something, which sounds a lot like a terminal value to me. If you morally justify having a terminal value X because of reason Y, then X is instrumental to the real terminal value Y.
Secondly, I question whether my value system really is like some kind of partial function that yields random outcomes outside the domain of definition. If you asked me for a (relative) value judgment about two situations that are completely alien to me, then I would imagine being indifferent about their ordering: not ordering them randomly. It’s possible that I could be persuaded to order one over the other, but then that seems more about changing my beliefs/knowledge and understanding (the is domain) than it is about changing my values (the ought domain). This may happen in less alien situations too: should we invest in education or healthcare? I don’t know, but that’s primarily because I can’t predict the actual outcomes in terms of things I care about.
Finally, even if a value system was to order two alien situations randomly, how can we say it’s wrong? Clearly it wouldn’t be wrong according to / compared with that value system, right? And how else are you going to judge whether something is right or wrong, better or worse?
I feel like these questions lead deeply into philosophical territory that I’m not particularly familiar with, but I hope it’s useful (rather than nitpicky) to ask these things, because if the intuitive that “random is wrong” is itself wrong, then perhaps there’s no actual problem we need to pay extra attention to. I also think that some of my questions here can be answered by pointing out that someone’s values may be inconsistent / conflicting. But then that seems to be the problem that needs to be solved.
---
I would like to acknowledge the rest of your comment without responding to it in-depth. I think I have personally spent relatively little time thinking about the complexities of multipolar scenarios (which is likely in part because I haven’t stumbled upon as much reading material about it, which may reflect on the AI safety community), so I don’t have much to add on this. My previous comment was aimed almost exclusively at your first point (in my mind), because the issue of what value systems are like and what an ASI that’s aligned with your might (unintentionally) do wrong seems somewhat separate from the issue of defending against competing ASIs doing bad things to you or others.
I acknowledge that having simpler and constant values may be a competitive advantage, and that it may be difficult to transfer the nuances of when you think it’s okay to manipulate/corrupt someone’s values into an ASI. I’m less concerned about other people not thinking of the corruption problem (since their ASIs are presumably smarter), and if they simply don’t care (and their aligned ASIs don’t either), then this seems like a classic case of AI that’s misaligned with your values. Unless you want to turn this hypothetical multipolar scenario into a singleton with your ASI at the top, it seems inevitable that some things are going to happen that you don’t like.
I also acknowledge that your ASI may in some sense behave suboptimally if it’s overly conservative or cautious. If a choice must be made between alien situations, then it may certainly seem prudent to defer judgment until more information can be gathered, but this is again a knowledge issue rather than a values issue. The values system should then help determine a trade-off between the present uncertainty about the alternatives and the utility of spending more time to gather information (presumably getting outcompeted while you do nothing ranks as “bad” according to most value systems). This can certainly go wrong, but again that seems like more of a knowledge issue (although I acknowledge some value systems may have a competitive advantage over others;
First of all, what does it mean for a value system to behave randomly/arbitrarily, and is it ever not arbitrary?
Again, I don’t have a definitive answer, but we do have some intuitions about which values are more and less arbitrary. For example values about familiar situations that you learned as a child and values that have deep philosophical justifications (for example, valuing positive conscious experiences, if we ever solve the problem of consciousness and start to understand the valence of qualia) seem less arbitrary than values that are caused by cosmic rays that hit your brain in the past. Values that are the result of random extrapolations seem closer to the latter than the former.
Secondly, I question whether my value system really is like some kind of partial function that yields random outcomes outside the domain of definition. If you asked me for a (relative) value judgment about two situations that are completely alien to me, then I would imagine being indifferent about their ordering: not ordering them randomly.
Thinking this over, I guess what’s happening here is that our values don’t apply directly to physical reality, but instead to high level mental models. So if a situation is too alien, our model building breaks down completely and we can’t evaluate the situation at all.
(This suggests that adversarial examples are likely also an issue for the modules that make up our model building machinery. For example, a lot of ineffective charities might essentially be adversarial examples against the part of our brain that evaluates how much our actions are helping others.)
Finally, even if a value system was to order two alien situations randomly, how can we say it’s wrong? Clearly it wouldn’t be wrong according to / compared with that value system, right? And how else are you going to judge whether something is right or wrong, better or worse?
We can use philosophical reasoning, for example to try to determine if there is a right way to extrapolate from the parts of our values that seem to make more sense or are less arbitrary, or to try to determine if “objective morality” exists and if so what it says about the alien situations.
and if they simply don’t care (and their aligned ASIs don’t either), then this seems like a classic case of AI that’s misaligned with your values.
Not caring about value corruption is likely an error. If I can help ensure that their aligned AI helps them prevent or correct this error, I don’t see why that’s not a win-win.
What does it mean for human values to be vulnerable to adversarial examples? When we say this about AI systems (e.g. image classifiers), I think it’s either because their judgments on manipulated situations/images are misaligned with ours/humans, or perhaps because they get the “ground truth” wrong. But how can a value system be misaligned with itself or different from the ground truth? For alignment purposes, isn’t it itself the ground truth? It could of course fail to match “objective morality” if you believe in that, but in that case we should probably be trying to make our AI align with that and not with someone’s human values.
I could (easily) imagine that my values are inconsistent, conflicting, and ever-changing, but these seem like different issues.
It also seems like you have a value that says something to the effect of “it’s wrong to corrupt people’s values (in certain circumstances)”. Then wouldn’t an AI that’s aligned with your values share this value, and not do this intentionally? And as for unintentionally: it seems that you have thought of this problem, and an ASI would presumably be much smarter than you, so wouldn’t it think of it too, and try hard to avoid it? [My reasoning here sounds a bit naive or “too simple” to me, but I’m not sure it’s wrong.]
I could understand that there might be issues with value learning AIs that imperfectly learn something close to a human’s value function, which may be vulnerable to adversarial examples, but this again seems like a different issue.
I’m not sure how to think about this formally, but intuitively, our value functions probably only “make sense” in a small region of possibility space, and just starts behaving randomly outside of that. It doesn’t seem right to treat that random behavior as someone’s “real values” and try to maximize that.
I wouldn’t want to corrupt the values of people who share roughly the same moral and philosophical outlook as myself, but if someone already has values that are very likely to be wrong (e.g., they just want to maximize the complexity or the universe, or how technologically advanced we are, or the glory of their god) I might be ok with trying to manipulate their values, especially if they’re trying to do the same thing to me. The problem is that it’s much easier for them to defend their values. Since they don’t think they need further moral development, they can just tell their AI to block any outside messages that might cause any changes to their values, but I can’t do that.
Other people may not think of the problem, or may not be as concerned about it as I am, and in some alignment schemes their AI would share their level of concern and not try very hard to avoid this problem. I don’t want to see their values corrupted this way. Even for myself, if AIs overall are accelerating technological development faster than moral/philosophical progress, it’s unclear how I can avoid this problem even with the assistance of an aligned AI. The AI may be faced with many choices that it doesn’t know how to answer directly, and it also doesn’t know how to ask me for help without risking corrupting me. If the AI is conservative it might be paralyzed with indecision or be forced to make a lot of suboptimal decisions that seem “safe”, and if it’s not conservative enough it might corrupt me even though it’s trying hard not to.
(I probably should have explained more in the OP, so I’m glad you’re asking these questions.)
Thanks for your reply!
Okay, that helps me understand what you’re talking about a bit better. It sounds like the concept of a partial function, and in the ML realm like the notorious brittleness that makes systems incapable of generalizing or extrapolating outside of a limited training set. I understand why you’re approaching this from the adversarial angle though, because I suppose you’re concerned about the AI just bringing about some state that’s outside the domain of definition which just happens to yield a high “random” score.
Upon first reading, I kind of agreed, so I definitely understand this intuition. “Random” behavior certainly doesn’t sound great, and “arbitrary” or “undefined” isn’t much better. But upon further reflection I’m not so sure.
First of all, what does it mean for a value system to behave randomly/arbitrarily, and is it ever not arbitrary? Arbitrary to me means that there is no reason for something, which sounds a lot like a terminal value to me. If you morally justify having a terminal value X because of reason Y, then X is instrumental to the real terminal value Y.
Secondly, I question whether my value system really is like some kind of partial function that yields random outcomes outside the domain of definition. If you asked me for a (relative) value judgment about two situations that are completely alien to me, then I would imagine being indifferent about their ordering: not ordering them randomly. It’s possible that I could be persuaded to order one over the other, but then that seems more about changing my beliefs/knowledge and understanding (the is domain) than it is about changing my values (the ought domain). This may happen in less alien situations too: should we invest in education or healthcare? I don’t know, but that’s primarily because I can’t predict the actual outcomes in terms of things I care about.
Finally, even if a value system was to order two alien situations randomly, how can we say it’s wrong? Clearly it wouldn’t be wrong according to / compared with that value system, right? And how else are you going to judge whether something is right or wrong, better or worse?
I feel like these questions lead deeply into philosophical territory that I’m not particularly familiar with, but I hope it’s useful (rather than nitpicky) to ask these things, because if the intuitive that “random is wrong” is itself wrong, then perhaps there’s no actual problem we need to pay extra attention to. I also think that some of my questions here can be answered by pointing out that someone’s values may be inconsistent / conflicting. But then that seems to be the problem that needs to be solved.
---
I would like to acknowledge the rest of your comment without responding to it in-depth. I think I have personally spent relatively little time thinking about the complexities of multipolar scenarios (which is likely in part because I haven’t stumbled upon as much reading material about it, which may reflect on the AI safety community), so I don’t have much to add on this. My previous comment was aimed almost exclusively at your first point (in my mind), because the issue of what value systems are like and what an ASI that’s aligned with your might (unintentionally) do wrong seems somewhat separate from the issue of defending against competing ASIs doing bad things to you or others.
I acknowledge that having simpler and constant values may be a competitive advantage, and that it may be difficult to transfer the nuances of when you think it’s okay to manipulate/corrupt someone’s values into an ASI. I’m less concerned about other people not thinking of the corruption problem (since their ASIs are presumably smarter), and if they simply don’t care (and their aligned ASIs don’t either), then this seems like a classic case of AI that’s misaligned with your values. Unless you want to turn this hypothetical multipolar scenario into a singleton with your ASI at the top, it seems inevitable that some things are going to happen that you don’t like.
I also acknowledge that your ASI may in some sense behave suboptimally if it’s overly conservative or cautious. If a choice must be made between alien situations, then it may certainly seem prudent to defer judgment until more information can be gathered, but this is again a knowledge issue rather than a values issue. The values system should then help determine a trade-off between the present uncertainty about the alternatives and the utility of spending more time to gather information (presumably getting outcompeted while you do nothing ranks as “bad” according to most value systems). This can certainly go wrong, but again that seems like more of a knowledge issue (although I acknowledge some value systems may have a competitive advantage over others;
Again, I don’t have a definitive answer, but we do have some intuitions about which values are more and less arbitrary. For example values about familiar situations that you learned as a child and values that have deep philosophical justifications (for example, valuing positive conscious experiences, if we ever solve the problem of consciousness and start to understand the valence of qualia) seem less arbitrary than values that are caused by cosmic rays that hit your brain in the past. Values that are the result of random extrapolations seem closer to the latter than the former.
Thinking this over, I guess what’s happening here is that our values don’t apply directly to physical reality, but instead to high level mental models. So if a situation is too alien, our model building breaks down completely and we can’t evaluate the situation at all.
(This suggests that adversarial examples are likely also an issue for the modules that make up our model building machinery. For example, a lot of ineffective charities might essentially be adversarial examples against the part of our brain that evaluates how much our actions are helping others.)
We can use philosophical reasoning, for example to try to determine if there is a right way to extrapolate from the parts of our values that seem to make more sense or are less arbitrary, or to try to determine if “objective morality” exists and if so what it says about the alien situations.
Not caring about value corruption is likely an error. If I can help ensure that their aligned AI helps them prevent or correct this error, I don’t see why that’s not a win-win.