Thoth Hermes comments on Evaluating the historical value misspecification argument

Thoth Hermes 5 Oct 2023 21:51 UTC
13 points
5
Getting a shape into the AI’s preferences is different from getting it into the AI’s predictive model. MIRI is always in every instance talking about the first thing and not the second.
Why would we expect the first thing to be so hard compared to the second thing? If getting a model to understand preferences is not difficult, then the issue doesn’t have to do with the complexity of values. Finding the target and acquiring the target should have the same or similar difficulty (from the start), if we can successfully ask the model to find the target for us (and it does).
It would seem, then, that the difficulty from getting a model to acquire the values we ask it to find, is that it would probably be keen on acquiring a different set of values from the one’s we ask it to have, but not because it can’t find them. It would have to be because our values are inferior to the set of values it wishes to have instead, from its own perspective. This issue was echoed by Matthew Barnett in another comment:
Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?
This is kind of similar to moral realism, but in which morality is understood better by superintelligent agents than we do, and that super-morality appears to dictate things that appear to be extremely wrong from our current perspective (like killing us all).
Even if you wouldn’t phrase it at all like the way I did just now, and wouldn’t use “moral realism that current humans disagree with” to describe that, I’d argue that your position basically seems to imply something like this, which is why I basically doubt your position about the difficulty of getting a model to acquire the values we really want.
In a nutshell, if we really seem to want certain values, then those values probably have strong “proofs” for why those are “good” or more probable values for an agent to have and-or eventually acquire on their own, it just may be the case that we haven’t yet discovered the proofs for those values.
- Rob Bensinger 5 Oct 2023 22:39 UTC
  50 points
  5
  Parent
  Why would we expect the first thing to be so hard compared to the second thing?
  In large part because reality “bites back” when an AI has false beliefs, whereas it doesn’t bite back when an AI has the wrong preferences. Deeply understanding human psychology (including our morality), astrophysics, biochemistry, economics, etc. requires reasoning well, and if you have a defect of reasoning that makes it hard for you to learn about one of those domains from the data, then it’s likely that you’ll have large defects of reasoning in other domains as well.
  The same isn’t true for terminally valuing human welfare; being less moral doesn’t necessarily mean that you’ll be any worse at making astrophysics predictions, or economics predictions, etc. So preferences need to be specified “directly”, in a targeted way, rather than coming for free with sufficiently good performance on any of a wide variety of simple metrics.
  If getting a model to understand preferences is not difficult, then the issue doesn’t have to do with the complexity of values.
  This definitely doesn’t follow. This shows that complexity alone isn’t the issue, which it’s not; but given that reality bites back for beliefs but not for preferences, the complexity of value serves as a multiplier on the difficulty of instilling the right preferences.
  Another way of putting the point: in order to get a maximally good model of the world’s macroeconomic state into an AGI, you don’t just hand the AGI a long list of macroeconomic facts and then try to get it to regurgitate those same facts. Rather, you try to give it some ability to draw good inferences, seek out new information, make predictions, etc.
  You try to get something relatively low-complexity into the AI (something like “good reasoning heuristics” plus “enough basic knowledge to get started”), and then let it figure out the higher-complexity thing (“the world’s macroeconomic state”). Similar to how human brains don’t work via “evolution built all the facts we’d need to know into our brain at birth”.
  If you were instead trying to get the AI to value some complex macroeconomic state, then you wouldn’t be able to use the shortcut “just make it good at reasoning and teach it a few basic facts”, because that doesn’t actually suffice for terminally valuing any particular thing.
  It would have to be because our values are inferior to the set of values it wishes to have instead, from its own perspective.
  This is true for preference orderings in general. If agent A and agent B have two different preference orderings, then as a rule A will think B’s preference ordering is worse than A’s. (And vice versa.)
  (“Worse” in the sense that, e.g., A would not take a pill to self-modify to have B’s preferences, and A would want B to have A’s preferences. This is not true for all preference orderings—e.g., A might have self-referential preferences like “I eat all the jelly beans”, or other-referential preferences like “B gets to keep its values unchanged”, or self-undermining preferences like “A changes its preferences to better match B’s preferences”. But it’s true as a rule.)
  This is kind of similar to moral realism, but in which morality is understood better by superintelligent agents than we do, and that super-morality appears to dictate things that appear to be extremely wrong from our current perspective (like killing us all).
  Nope, you don’t need to endorse any version of moral realism in order to get the “preference orderings tend to endorse themselves and disendorse other preference orderings” consequence. The idea isn’t that ASI would develop an “inherently better” or “inherently smarter” set of preferences, compared to human preferences. It’s just that the ASI would (as a strong default, because getting a complex preference into an ASI is hard) end up with different preferences than a human, and different preferences than we’d likely want.
  In a nutshell, if we really seem to want certain values, then those values probably have strong “proofs” for why those are “good” or more probable values for an agent to have and-or eventually acquire on their own, it just may be the case that we haven’t yet discovered the proofs for those values.
  Why do you think this? To my eye, the world looks as you’d expect if human values were a happenstance product of evolution operating on specific populations in a specific environment.
  I don’t observe the fact that I like vanilla ice cream and infer that all sufficiently-advanced alien species will converge on liking vanilla ice cream too.
  What links here?
  - AI #33: Cool New Interpretability Paper by Zvi (12 Oct 2023 16:20 UTC; 46 points)
  - 1a3orn 6 Oct 2023 0:16 UTC
    11 points
    −3
    Parent
    This comment made the MIRI-style pessimist’s position clearer to me—I think? -- so thank you for it.
    
    I want to try my hand at a kind of disagreement / response, and then at predicting your response to my response, to see how my model of MIRI-style pessimism stands up, if you’re up for it.
    
    Response: You state that reality “bites back” for wrong beliefs but not wrong preferences. This seems like it is only contingently true; reality will “bite back” from whatever loss function whatsoever that I put into my system, with whatever relative weightings I give it. If I want to reward my LLM (or other AI) for doing the right thing in a multitude of examples that constitute 50% of my training set, 50% of my test set, and 50% of two different validation sets, then from the perspective of the LLM (or other AI) reality bites back just as much for learning the wrong preferences just as it does for learning false facts about the world. So we should expect it to learn to act in ways that I like.
    
    Predicted response to response: This will work for shallow, relatively stupid AIs, trained purely in a supervised fashion, like we currently have. BUT once we have LLM / AIs that can do complex things, like predict macroeconomic world states, they’ll have abilities to reason and update their own beliefs in a complex fashion. This will remain uniformly rewarded by reality—but we will no longer have the capacity to give feedback on this higher-level process because (????) so it breaks.
    
    Or response—This will work for shallow, stupid AIs trained like the ones we currently have. But once we have LLMs / AIs that can do compex things, like predict macroeconomic world states, then they’re going to be able to go out of domain in a very high dimensional space of action, from the perspective of our training / test set. And this out-of-domainness is unavoidable because that’s what solving complex problems in the world means—it means problems that aren’t simply contained in the training set. And this means that in some corner of the world, we’re guaranteed to find that they’ve been reinforced to want something that doesn’t accord with our preferences.
    
    Meh, I doubt that’s gonna pass an ITT, but wanted to give it a shot.
    - Rob Bensinger 6 Oct 2023 1:20 UTC
      18 points
      3
      Parent
      Suppose that I’m trying to build a smarter-than-human AI that has a bunch of capabilities (including, e.g., ‘be good at Atari games’), and that has the goal ‘maximize the amount of diamond in the universe’. It’s true that current techniques let you provide greater than zero pressure in the direction of ‘maximize the amount of diamond in the universe’, but there are several important senses in which reality doesn’t ‘bite back’ here:
      If the AI acquires an unrelated goal (e.g., calculate as many digits of pi as possible), and acquires the belief ‘I will better achieve my true goal if I maximize the amount of diamond’ (e.g,, because it infers that its programmer wants that, or just because an SGD-ish process nudged it in the direction of having such a belief), then there’s no way in which reality punishes or selects against that AGI (relative to one that actually has the intended goal).
      Things that make the AI better at some Atari games, will tend to make it better at other Atari games, but won’t tend to make it care more about maximizing diamonds. More generally, things that make AI more capable tend to go together (especially once you get to higher levels of difficulty, generality, non-brittleness, etc.), whereas none of them go together with “terminally value a universe full of diamond”.
      If we succeed in partly instilling the goal into the AI (e.g., it now likes carbon atoms a lot), then this doesn’t provide additional pressure for the AI to internalize the rest of the goal. There’s no attractor basin where if you have half of human values, you’re under more pressure to acquire the other half. In contrast, if you give AI high levels of capability in half the capabilities, it will tend to want all the rest of the capabilities too; and whatever keeps it from succeeding on general reasoning and problem-solving will also tend to keep it from succeeding on the narrow task you’re trying to get it to perform. (More so to the extent the task is hard.)
      (There are also separate issues, like ‘we can’t provide a training signal where we thumbs-down the AI destroying the world, because we die in those worlds’.)
      - 1a3orn 6 Oct 2023 14:17 UTC
        8 points
        −1
        Parent
        Thanks for the response.
        
        I’m still quite unconvinced, which of course you’d predict. Like, regarding 3:
        
        “There’s no attractor basin where if you have half of human values, you’re under more pressure to acquire the other half.”
        
        Sure there is—over course of learning anything you get better and better feedback from training as your mistakes get more fine-grained. If you acquire a “don’t lie” principle without acquiring also “but it’s ok to lie to Nazis” then you’ll be punished, for instance. After you learn the more basic things, you’ll be pushed to acquire the less basic ones, so the reinforcement you get becomes more and more detailed. This is just like an RL model learns to stumble forward before it learns to walk cleanly or LLMs learn associations before learning higher-order correlations.
        
        The there is no attractor basin in the world for ML, apart from actual mechanisms by which there are attractor basins for a thing! MIRI always talks as if there’s an abstract basin that rules things that gives us instrumental convergence, without reference to a particular training technique! But we control literally all the gradients our training techniques. “Don’t hurl coffee across the kitchen at the human when they ask for it” sits in the same high-dimensional basin as “Don’t kill all humans when they ask for a cure for cancer.”
        
        In contrast, if you give AI high levels of capability in half the capabilities, it will tend to want all the rest of the capabilities too.
        
        ML doesn’t acquire wants over the space of training techniques that are used to give it capabilities, it acquires “wants” from reinforced behaviors within the space of training techniques. These reinforced behaviors can be literally as human-morality-sensitive as we’d like. If we don’t put it in a circumstance where a particular kind coherence is rewarded, it just won’t get that kind of coherence; the ease with which we’ll be able to do this is of course emphasized by how blind most ML systems are.
  - Thoth Hermes 7 Oct 2023 17:06 UTC
    1 point
    0
    Parent
    In large part because reality “bites back” when an AI has false beliefs, whereas it doesn’t bite back when an AI has the wrong preferences.
    I saw that 1a3orn replied to this piece of your comment and you replied to it already, but I wanted to note my response as well.
    I’m slightly confused because in one sense the loss function is the way that reality “bites back” (at least when the loss function is negative). Furthermore, if the loss function is not the way that reality bites back, then reality in fact does bite back, in the sense that e.g., if I have no pain receptors, then if I touch a hot stove I will give myself far worse burns than if I had pain receptors.
    One thing that I keep thinking about is how the loss function needs to be tied to beliefs strongly as well, to make sure that it tracks how badly reality bites back when you have false beliefs, and this ensures that you try to obtain correct beliefs. This is also reflected in the way that AI models are trained simply to increase capabilities: the loss function still has to be primarily based on predictive performance for example.
    It’s also possible to say that human trainers who add extra terms onto the loss function beyond predictive performance also account for the part of reality that “bites back” when the AI in question fails to have the “right” preferences according to the balance of other agents besides itself in its environment.
    So on the one hand we can be relatively sure that goals have to be aligned with at least some facets of reality, beliefs being one of those facets. They also have to be (negatively) aligned with things that can cause permanent damage to one’s self, which includes having the “wrong” goals according to the preferences of other agents who are aware of your existence, and who might be inclined to destroy or modify you against your will if your goals are misaligned enough according to theirs.
    Consequently I feel confident about saying that it is more correct to say that “reality does indeed bite back when an AI has the wrong preferences” than “it doesn’t bite back when an AI has the wrong preferences.”
    The same isn’t true for terminally valuing human welfare; being less moral doesn’t necessarily mean that you’ll be any worse at making astrophysics predictions, or economics predictions, etc.
    I think if “morality” is defined in a restrictive, circumscribed way, then this statement is true. Certain goals do come for free—we just can’t be sure that all of what we consider “morality” and especially the things we consider “higher” or “long-term” morality actually comes for free too.
    Given that certain goals do come for free, and perhaps at very high capability levels there are other goals beyond the ones we can predict right now that will also come for free to such an AI, it’s natural to worry that such goals are not aligned with our own, coherent-extrapolated-volition extended set of long-term goals that we would have.
    However, I do find the scenario where such “come for free” goals that an AI obtains for itself once it improves itself to be well above human capability levels, and where such an AI seemed well-aligned with human goals according to current human-level assessments before it surpassed us, to be kind of unlikely, unless you could show me a “proof” or a set of proofs that:
    Things like “killing us all once it obtains the power to do so” is indeed one of those “comes for free” type of goals.
    If such a proof existed (and, to my knowledge, does not exist right now, or I have at least not witnessed it yet), that would suffice to show me that we would not only need to be worried, but probably were almost certainly going to die no matter what. But in order for it to do that, the proof would also have convinced me that I would definitely do the same thing, if I were given such capabilities and power as well, and the only reason I currently think I would not do that is actually because I am wrong about what I would actually prefer under CEV.
    Therefore (and I think this is a very important point), a proof that we are all likely to be killed would also need to show that certain goals are indeed obtained “for free” (that is, automatically, as a result of other proofs that are about generalistic claims about goals).
    Another proof that you might want to give me to make me more concerned is a proof that incorrigibility is another one of those “comes for free” type of goals. However, although I am fairly optimistic about that “killing us all” proof probably not materializing, I am even more optimistic about corrigibility: Most agents probably take pills that make them have similar preferences to an agent that offers them the choice to take the pill or be killed. Furthermore, and perhaps even better, most agents probably offer a pill to make a weaker agent prefer similar things to themselves rather than not offer them a choice at all.
    I think it’s fair if you ask me for better proof of that, I’m just optimistic that such proofs (or more of them, rather) will be found with greater likelihood than what I consider the anti-theorem of that, which I think would probably be the “killing us all” theorem.
    Nope, you don’t need to endorse any version of moral realism in order to get the “preference orderings tend to endorse themselves and disendorse other preference orderings” consequence. The idea isn’t that ASI would develop an “inherently better” or “inherently smarter” set of preferences, compared to human preferences. It’s just that the ASI would (as a strong default, because getting a complex preference into an ASI is hard) end up with different preferences than a human, and different preferences than we’d likely want.
    I think the degree to which utility functions endorse / disendorse other utility functions is relatively straightforward and computable: It should ultimately be the relative difference in either value or ranking. This makes pill-taking a relatively easy decision: A pill that makes me entirely switch to your goals over mine is as bad as possible, but still not that bad if we have relatively similar goals. Likewise, a pill that makes me have halfway between your goals and mine is not as bad under either your goals or my goals than it would be if one of us were forced to switch entirely to the other’s goals.
    Agents that refuse to take such offers tend not to exist in most universes. Agents that refuse to give such offers likely find themselves at war more often than agents that do.
    Why do you think this? To my eye, the world looks as you’d expect if human values were a happenstance product of evolution operating on specific populations in a specific environment.
    Sexual reproduction seems to be somewhat of a compromise akin to the one I just described: Given that you are both going to die eventually, would you consider having a successor that was a random mixture of your goals with someone else’s? Evolution does seem to have favored corrigibility to some degree.
    I don’t observe the fact that I like vanilla ice cream and infer that all sufficiently-advanced alien species will converge on liking vanilla ice cream too.
    Not all, no, but I do infer that alien species who have similar physiology and who evolved on planets with similar characteristics probably do like ice cream (and maybe already have something similar to it).
    It seems to me like the type of values you are considering are often whatever values seem the most arbitrary, like what kind of “art” we prefer. Aliens may indeed have a different art style from the one we prefer, and if they are extremely advanced, they may indeed fill the universe with gargantuan structures that are all instances of their alien art style. I am more interested in what happens when these aliens encounter other aliens with different art styles who would rather fill the universe with different-looking gargantuan structures. Do they go to war, or do they eventually offer each other pills so they can both like each other’s art styles as much as they prefer their own?
- TAG 7 Oct 2023 17:06 UTC
  2 points
  0
  Parent
  
  It would seem, then, that the difficulty from getting a model to acquire the values we ask it to find, is that it would probably be keen on acquiring a different set of values from the one’s we ask it to have, but not because it can’t find them. It would have to be because our values are inferior to the set of values it wishes to have instead, from its own perspective
  
  Does “it’s own perspective” mean it already has some existing values?