How much can value learning be disentangled?
In the context of whether the definition of human values can disentangled from the process of approximating/implementing that definition, David asks me:
But I think it’s reasonable to assume (within the bounds of a discussion) that there is a non-terrible way (in principle) to specify things like “manipulation”. So do you disagree?
I think it’s a really good question, and its answer is related to a lot of relevant issues, so I put this here as a top-level post. My current feeling is, contrary to my previous intuitions, that things like “manipulation” might not be possible to specify in a way that leads to useful disentanglement.
Why manipulate?
First of all, we should ask why an AI would be tempted to manipulate us in the first place. It may be that it needs us to do something for it to accomplish its goal; in that case it is trying to manipulate our actions. Or maybe its goal includes something that cashes out as out mental states; in that case, it is trying to manipulate our mental state directly.
The problem is that any reasonable friendly AI would have our mental states as part of its goal—it would at least want us to be happy rather than miserable. And (almost) any AI that wasn’t perfectly indifferent to our actions would be trying to manipulate us just to get its goals accomplished.
So manipulation is to be expected by most AI designs, friendly or not.
Manipulation versus explanation
Well, since the urge to manipulate is expected to be present, could we just rule it out? The problem is that we need to define the difference between manipulation and explanation.
Suppose I am fully aligned/corrigible/nice or whatever other properties you might desire, and I want to inform you of something important and relevant. In doing so, especially if I am more intelligent than you, I will simplify, I will omit irrelevant details, I will omit arguably relevant details, I will emphasise things that help you get a better understanding of my position, and de-emphasise things that will just confuse you.
And these are exactly the same sorts of behaviours that smart manipulator would do. Nor can we define the difference as whether the AI is truthful or not. We want human understanding of the problem, not truth. It’s perfectly possible to manipulate people while telling them nothing but the truth. And if the AI structures the order in which it presents the true facts, it can manipulate people while presenting the whole truth as well as nothing but the truth.
It seems that the only difference between manipulation and explanation is whether we end up with a better understanding of the situation at the end. And measuring understanding is very subtle. And even if we do it right, note that we have now motivated the AI to… aim for a particular set of mental states. We are rewarding it for manipulating us. This is contrary to the standard understanding of manipulation, which focuses on the means, not the end result.
Bad behaviour and good values
Does this mean that the situation is completely hopeless? No. There are certain manipulative practices that we might choose to ban. Especially if the AI is limited in capability at some level, this would force it to follow behaviours that are less likely to be manipulative.
Essentially, there is no boundary between manipulation and explanation, but there is a difference between extreme manipulation and explanation, so ruling out the first can help (or maybe not).
The other thing that can be done is to ensure that the AI has values close to ours. The closer the values of the AI are to us, the less manipulation it will need to use, and the less egregious the manipulation will be. It might be that, between partial value convergence and ruling out specific practices (and maybe some physical constraints), we may be able to get an AI that is very unlikely to manipulate us much.
Incidentally, I feel the same about low-impact approaches. The full generality problem, an AI that is low impact but value-agnostic, I think is impossible. But if the values of the AI are better aligned with us, and more physically constrained, then low impact becomes easier to define.
- Research Agenda v0.9: Synthesising a human’s preferences into a utility function by 17 Jun 2019 17:46 UTC; 70 points) (
- Alignment Newsletter #44 by 6 Feb 2019 8:30 UTC; 18 points) (
- 29 Jan 2019 14:17 UTC; 3 points) 's comment on Assuming we’ve solved X, could we do Y... by (
Not only it is hard to disentangle manipulation and explanation; it is actually difficult to disentangle even manipulation and just asking the human about preferences (like here).
Manipulation via incorrect “understanding” is IMO somewhat easier problem (understanding can be possibly tested by something like simulating the human’s capacity to predict). Manipulation via messing up with our internal multi-agent system of values seems subtle and harder. (You can imagine AI roughly in the shape of Robin Hanson, explaining to one part of the mind how some of the other parts work. Or just drawing the attention of consciousness to some sub-agents and not others.)
My impression is that in full generality it is unsolvable, but something like starting with an imprecise model of approval / utility function learned via ambitious value learning and restricting explanations/questions/manipulation by that may be work.
Yep. As so often, I think these things are not fully value agnostic, but don’t need full human values to be defined.
So I want to emphasize that I’m only saying it’s *plausible* that *there exists* a specification of “manipulation”. This is my default position on all human concepts. I also think it’s plausible that there does not exist such a specification, or that the specification is too complex to grok, or that there end up being multiple conflicting notions we conflate under the heading of “manipulation”. See this post for more.
Overall, I understand and appreciate the issues you’re raising, but I think all this post does is show that naive attempts to specify “manipulation” fail; I think it’s quite difficult to argue compellingly that no such specification exists ;)
“It seems that the only difference between manipulation and explanation is whether we end up with a better understanding of the situation at the end. And measuring understanding is very subtle.”
^ Actually, I think “ending up with a better understanding” (in the sense I’m reading it)is probably not sufficient to rule out manipulation; what I mean is that I can do something which actually improves your model of the world, but leads you to follow a policy with worse expected returns. A simple example would be if you are doing Bayesian updating and your prior over returns for two bandit arms is P(r|a_1) = N(1,1), P(r|a_2) = N(2, 1), while the true returns are 1⁄2 and 2⁄3 (respectively). So your current estimates are optimistic, but they are ordered correctly, and so induce the optimal (greedy) policy.
Now if I give you a bunch of observations of a_2, I will be giving you true information, that will lead you to learn, correctly and with high confidence, that the expected reward for a_2 is ~2/3, improving your model of the world. But since you haven’t updated your estimate for a_1, you will now prefer a_1 to a_2 (if acting greedily), which is suboptimal. So overall I’ve informed you with true information, but disadvantaged you nonetheless. I’d argue that if I did this intentionally, it should count as a form of manipulation.
Thanks for writing that post; have you got much in terms of volunteers currently?
Haha no not at all ;)
I’m not actually trying to recruit people to work on that, just trying to make people aware of the idea of doing such projects. I’d suggest it to pretty much anyone who wants to work on AI-Xrisk without diving deep into math or ML.
Shame :-(
It sounds like by the definitions you’re using, a teacher who aims to help a student end up with a better understanding of the situation at the end is “manipulating” the student. Is that right?
I’m not persuaded measuring understanding is “very subtle”. It seems like teachers manage to do it alright.
Certain groups (most prominently religious ones) see secular education systems as examples of indoctrination. I’m not saying that it’s impossible to distinguish manipulation from coercion, just that we have to use part of our values when doing the judgement.
Hm, I understood the traditional Less Wrong view to be something along the lines of: there is truth about the world, and that truth is independent of your values. Wanting something to be true won’t make it so. Whereas I’d expect a postmodernist to say something like: the Christians have their truth, the Buddhists have their truth, and the Atheists have theirs. Whose truth is the “real” truth comes down to the preferences of the individual. Your statement sounds more in line with the postmodernist view than the Less Wrong one.
This matters because if the Less Wrong view of the world is correct, it’s more likely that there are clean mathematical algorithms for thinking about and sharing truth that are value-neutral (or at least value-orthogonal, e.g. “aim to share facts that the student will think are maximally interesting or surprising”. Note that this doesn’t necessarily need to be implemented in a way that a “fact” which triggers an epileptic fit and causes the student to hit the “maximally interesting” button will be selected for sharing. If I have a rough model of the user’s current beliefs and preferences, I could use that to estimate the VoI of various bits of information to the user and use that as my selection criterion. Point being that our objective doesn’t need to be defined in terms of “aiming for a particular set of mental states”.)
I don’t think this is correct—it misses the key map-territory distinction in the human mind. Even though there is “truth” in an objective sense, there is no necessity that the human mind can think about or share that truth. Obviously we can say that experientially we have something in our heads that correlates with reality, but that doesn’t imply that we can think about truth without implicating values. It also says nothing about whether we can discuss truth without manipulating the brain to represent things differently—and all imperfect approximations require trade-offs. If you want to train the brain to do X, you’re implicitly prioritizing some aspect of the brain’s approximation of reality over others.
Yep. There are a number of intelligent agents, each with their own subset of true beliefs. Since agents have finite resources, the they cannot learn everything, and so their subset of true beliefs must be random or guided by some set of goals or values. So truth is entangled with value in that sense, and if not in the sense of wishful thinking.
Also, there is no evidence of a any kind of One Algorithm To Rule Them All. Its in no way implied by the existence of objective reality, and everything that has been exhibited along those lines has turned out to be computationally intractable.
What’s your answer to the postmodernist?
That they make some sensible points, but they’re wrong when they push them to far (and that they are mixing factual truths with preferences a lot). Christians do have their own “truths”, if we interpret these truths as values, which is what they generally are. “It is a sin to engage in sex before marriage” vs “(some) sex can lead to pregnancy”. If we call both of these “truths”, then we have a confusion.
Right, both of these views on truth, traditional rationality and postmodernism, result in theories of truth that don’t quite line up with what we see in the world but in different ways. The traditional rationality view fails to account for the fact that humans judge truth and we have no access to the view from nowhere, so it’s right that traditional rationality is “wrong” in the sense that it incorrectly assumes it can gain privileged access to the truth of claims to know which ones are facts and which ones are falsehoods. The postmodernist view makes an opposite and only slightly less equal mistake by correctly noticing that humans judge truth but then failing to adequately account for the ways those judgements are entangled with a shared reality. The way through is to see that both there is something shared out there that there can in theory be a fact of the matter of and also realizing that we can’t directly ascertain those facts because we must do so across the gap of (subjective) experience.
As always, I say it comes back to the problem of the criterion and our failure to adequately accept that it demands we make a leap of faith, small though we may manage to make it.
Humans have beliefs and values twisted together in all kinds of odd ways. In practice, increasing our understanding tends to go along with having a more individualist outlook, a greater power to impact the natural world, less concern about difficult-to-measure issues, and less respect for traditional practices and group identities (and often the creation of new group identities, and sometimes new traditions).
Now, I find those changes to be (generally) positive, and I’d like them to be more common. But these are value changes, and I understand why people with different values could object to them.
Your original argument, as I understood it, was something like: Explanation aims for a particular set of mental states in the student, which is also what manipulation does, so therefore explanation can’t be defined in a way that distinguishes it from manipulation. I pushed back on that. Now you’re saying that explanation tends to produce side effects in the listener’s values. Does this mean you’re allowing the possibility that explanation can be usefully defined in a way that distinguishes it from manipulation?
BTW, computer security researchers distinguish between “reject by default” (whitelisting) and “accept by default” (blacklisting). “Reject by default” is typically more secure. I’m more optimistic about trying to specify what it means to explain something (whitelisting) than what it means to manipulate someone in a way that’s improper (blacklisting). So maybe we’re shooting at different targets.
Tying all of this back to FAI… you say you find the value changes that come with greater understanding to be (generally) positive and you’d like them to be more common. I’m worried about the possibility that AGI will be a global catastrophic risk. I think there are good arguments that by default, AGI will be something which is not positive. Maybe from a triage point of view, it makes sense to focus on minimizing the probability that AGI is a global catastrophic risk, and worry about the prevention of things that we think are likely to be positive once we’re pretty sure the global catastrophic risk aspect of things has been solved?
In Eliezer’s CEV paper, he writes:
I haven’t seen anyone on Less Wrong argue against CEV as a vision for how the future of humanity should be determined. And CEV seems to involve having the future be controlled by humans who are more knowledgable than current humans in some sense. But maybe you’re a CEV skeptic?
Well, now you’ve seen one ^_^ : https://www.lesswrong.com/posts/vgFvnr7FefZ3s3tHp/mahatma-armstrong-ceved-to-death
I’ve been going on about the problems with CEV (specifically with extrapolation) for years. This post could also be considered a CEV critique: https://www.lesswrong.com/posts/WeAt5TeS8aYc4Cpms/values-determined-by-stopping-properties
I think explanation can be defined (see https://agentfoundations.org/item?id=1249 ). I’m not confident “explanation with no manipulation” can be defined.
IMO, VoI is also not a sufficient criteria for defining manipulation… I’ll list a few problems I have with it, OTTMH:
1) It seems to reduce it to “providing misinformation, or providing information to another agent that is not maximally/sufficiently useful for them (in terms of their expected utility)”. An example (due to Mati Roy) of why this doesn’t seem to match our intuition is: what if I tell someone something true and informative that serves (only) to make them sadder? That doesn’t really seem like manipulation (although you could make a case for it).
2) I don’t like the “maximally/sufficiently” part; maybe my intuition is misleading, but manipulation seems like a qualitative thing to me. Maybe we should just constrain VoI to be positive?
3) Actually, it seems weird to talk about VoI here; VoI is prospective and subjective… it treats an agent’s beliefs as real and asks how much value they should expect to get from samples or perfect knowledge, assuming these samples or the ground truth would be distributed according to their beliefs; this makes VoI strictly non-negative. But when we’re considering whether to inform an agent of something, we might recognize that certain information we’d provide would actually be net negative (see my top level comment for an example). Not sure what to make of that ATM...
re: #2, VoI doesn’t need to be constrained to be positive. If in expectation you think the information will have a net negative impact, you shouldn’t get the information.
re: #3, of course VoI is subjective. It MUST be, because value is subjective. Spending 5 minutes to learn about the contents of a box you can buy is obviously more valuable to you than to me. Similarly, if I like chocolate more than you, finding out if a cake has chocolate is more valuable for me than for you. The information is the same, the value differs.
FWICT, both of your points are actually responses to be point (3).
RE “re: #2”, see: https://en.wikipedia.org/wiki/Value_of_information#Characteristics
RE “re: #3”, my point was that it doesn’t seem like VoI is the correct way for one agent to think about informing ANOTHER agent. You could just look at the change in expected utility for the receiver after updating on some information, but I don’t like that way of defining it.
My (admittedly hazy) recollection of our last conversation is that your concerns were that “value agnostic, low impact, and still does stuff” is impossible. Can you expand on what you mean by value agnostic here, and why you think we can’t even have that and low impact?
This is based more on experience than on a full formal argument (yet). Take an AI that, according to our preferences, is low impact and still does stuff. Then there is a utility function U for which that “does stuff” is the single worst and highest impact thing the AI could have done (you just trivially define a U that only cares about that “stuff”).
Now, that’s a contrived case, but my experience is that problems like that come up all the time in low impact research, and that we really need to include—explicitly or implicitly—a lot of our values/preferences directly, in order to have something that satisfies low impact.
This seems to prove too much; the same argument proves friendly behavior can’t exist ever, or that including our preferences directly is (literally) impossible. The argument doesn’t show that that utility has to be important to / considered by the impact measure.
Plus, low impact doesn’t have to be robust to adversarially chosen input attainable utilities—we get to choose them. Just choose the “am I activated” indicator utility and AUP seems to do fine, modulo open questions raised in the post and comments.
? I don’t see that. What’s the argument?
(If you want to say that we can’t define friendly behaviour without using our values, then I would agree ^_^ but I think you’re trying to argue something else).
Take a friendly AI that does stuff. Then there is a utility function for which that “does stuff” is the single worst thing the AI could have done.
The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.
As I understand it, the impact version of this argument is flawed in the same way (but less blatantly so): something being high impact according to a contrived utility function doesn’t mean we can’t induce behavior that is, with high probability, low impact for the vast majority of reasonable utility functions.
Indeed, by “friendly AI” I meant “an AI friendly for us”. So yes, I was showing a contrived example of an AI that was friendly, and low impact, from our perspective, but that was not, as you said, universally friendly (or universally low impact).
In my experience so far, we need to include our values, in part, to define “reasonable” utility functions.
It seems that an extremely broad set of input attainable functions suffice to capture the “reasonable“ functions with respect to which we want to be low impact. For example, “remaining on”, “reward linear in how many blue pixels are observed each time step”, etc. All thanks to instrumental convergence and opportunity cost.
Even zero impact AI which is limited to pure observation may be not acceptable for many people (not everybody wants his-her sex life to be recorded and analysed).
If the AI isn’t just fed all the data by default (ie via a camera already at the opportune location), taking steps to observe is (AUP-)impactful. I think you’re right that agents with small impact allowances can still violate values.