johnswentworth comments on Evaluating the historical value misspecification argument

johnswentworth 6 Oct 2023 2:40 UTC
LW: 95 AF: 35
39
AF
I think you have basically not understood the argument which I understand various MIRI folks to make, and I think Eliezer’s comment on this post does not explain the pieces which you specifically are missing. I’m going to attempt to clarify the parts which I think are most likely to be missing. This involves a lot of guessing, on my part, at what is/isn’t already in your head, so I apologize in advance if I guess wrong.
(Side note: I am going to use my own language in places where I think it makes things clearer, in ways which I don’t think e.g. Eliezer or Nate or Rob would use directly, though I think they’re generally gesturing at the same things.)
A Toy Model/Ontology
I think a core part of the confusion here involves conflation of several importantly-different things, so I’ll start by setting up a toy model in which we can explicitly point to those different things and talk about how their differences matter. Note that this is a toy model; it’s not necessarily intended to be very realistic.
Our toy model is an ML system, designed to run on a hypercomputer. It works by running full low-level physics simulations of the universe, for exponentially many initial conditions. When the system receives training data/sensor-readings/inputs, it matches the predicted-sensor-readings from its low-level simulations to the received data, does a Bayesian update, and then uses that to predict the next data/sensor-readings/inputs; the predicted next-readings are output to the user. In other words, it’s doing basically-perfect Bayesian prediction on data based on low-level physics priors.
Claim 1: this toy model can “extract preferences from human data” in behaviorally the same way that GPT does (though presumably the toy model would perform better). That is, you can input a bunch of text data, then prompt the thing with some moral/ethical situation, and it will continue the text in basically the same way a human would (at least within distribution). (If you think GPTs “understand human values” in a stronger sense than that, and that difference is load-bearing for the argument you want to make, then you should leave a response highlighting that particular divergence.)
Modulo some subtleties which I don’t expect to be load-bearing for the current discussion, I expect MIRI-folk would say:
1. Building this particular toy model, and querying it in this way, addresses ~zero of the hard parts of alignment.
2. Basically-all of the externally-visible behavior we’ve seen from GPT to date look like a more-realistic operationalization of something qualitatively similar to the toy model. GPT answering moral questions similarly to humans tells us basically-nothing about the difficulty of alignment, for basically the same reasons that the toy model answering moral questions similarly to humans would tell us basically-nothing about the difficulty of alignment.
(Those two points are here as a checksum, to see whether your own models have diverged yet from the story told here.)
(Some tangential notes:
- The user interface of the toy model matters a lot here. If we just had an amazing simulator, we could maybe do a simulated long reflection, but both the toy model and GPT are importantly not that.
- The “match predicted-sensor-readings from low-level simulation to received data” step is hiding a whole lot of subtlety, in ways which aren’t relevant yet but might be later.
)
So, what are the hard parts and why doesn’t the toy model address them?
“Values”, and Pointing At Them
First distinction: humans’ answers to questions about morality are not the same as human values. More generally, any natural-language description of human values, or natural-language discussion of human values, is not the same as human values.
(On my-model-of-a-MIRIish-view:) If we optimize hard for humans’ natural-language yay/nay in response to natural language prompts, we die. This is true for ~any natural-language prompts which are even remotely close to the current natural-language distribution.
The central thing-which-is-hard-to-do is to point powerful intelligence at human values (as opposed to “humans’ natural-language yay/nays in response to natural language prompts”, which are not human values and are not a safe proxy for human values, but are probably somewhat easier to point an intelligence at).
Now back to the toy model. If we had some other mind (not our toy model) which generally structures its internal cognition around ~the same high-level concepts as humans, then one might in-principle be able to make a relatively-small change to that mind such that it optimized for (its concept of) human values (which basically matches humans’ concept of human values, by assumption). Conceptually, the key question is something like “is the concept of human values within this mind the type of thing which a pointer in the mind can point at?”. But our toy model has nothing like that. Even with full access to the internals of the toy model, it’s just low-level physics; identifying “human values” embedded in the toy model is no easier than identifying “human values” embedded in the physics of our own world. So that’s reason #1 why the toy model doesn’t address the hard parts: the toy model doesn’t “understand” human values in the sense of internally using ~the same concept of human values as humans use.
In some sense, the problem of “specifying human values” and “aiming an intelligence at something” are just different facets of this same core hard problem:
- we need to somehow get a powerful mind to “have inside it” a concept which basically matches the corresponding human concept at which we want to aim
- “have inside it” cashes out to something roughly like “the concept needs to be the type of thing which a pointer in the mind can point to, and then the rest of the mind will then treat the pointed-to thing with the desired human-like semantics”; e.g. answering external natural-language queries doesn’t even begin to cut it
- … and then some pointer(s) in the mind’s search algorithms need to somehow be pointed at that concept.
Why Answering Natural-Language Queries About Morality Is Basically Irrelevant
A key thing to note here: all of those “hard problem” bullets are inherently about the internals of a mind. Observing external behavior in general reveals little-to-nothing about progress on those hard problems. The difference between the toy model and the more structured mind is intended to highlight the issue: the toy model doesn’t even contain the types of things which would be needed for the relevant kind of “pointing at human values”, yet the toy model can behaviorally achieve ~the same things as GPT.
(And we’d expect something heavily optimized to predict human text to be pretty good at predicting human text regardless, which is why we get approximately-zero evidence from the observation that GPT accurately predicts human answers to natural-language queries about morality.)
Now, there is some relevant evidence from interpretability work. Insofar as human-like concepts tend to have GPT-internal representations which are “simple” in some way, and especially in a way which might make them easily-pointed-to internally in a way which carries semantics across the pointer, that is relevant. On my-model-of-a-MIRIish-view, it’s still not very relevant, since we expect major phase shifts as AI gains capabilities, so any observation of today’s systems is very weak evidence at best. But things like e.g. Turner’s work retargeting a maze-solver by fiddling with its internals are at least the right type-of-thing to be relevant.
Side Note On Relevant Capability Levels
I would guess that many people (possibly including you?) reading all that will say roughly:
Ok, but this whole “If we optimize hard for humans’ natural-language yay/nay in response to natural language prompts, we die” thing is presumably about very powerful intelligences, not about medium-term, human-ish level intelligences! So observing GPT should still update us about whether medium-term systems can be trusted to e.g. do alignment research.
Remember that, on a MIRIish model, meaningful alignment research is proving rather hard for human-level intelligence; one would therefore need at least human-level intelligence in order to solve it in a timely fashion. (Also, AI hitting human-level at tasks like AI research means takeoff is imminent, roughly speaking.) So the general pathway of “align weak systems → use those systems to accelerate alignment research” just isn’t particularly relevant on a MIRIish view. Alignment of weaker systems is relevant only insofar as it informs alignment of more powerful systems, which is what everything above was addressing.
I expect plenty of people to disagree with that point, but insofar as you expect people with MIRIsh views to think weak systems won’t accelerate alignment research, you should not expect them to update on the difficulty of alignment due to evidence whose relevance routes through that pathway.
What links here?
- Matthew Barnett 8 Oct 2023 19:56 UTC
  LW: 9 AF: 5
  2
  AF Parent
  This comment is valuable for helping to clarify the disagreement. So, thanks for that. Unfortunately, I am not sure I fully understand the comment yet. Before I can reply in-depth, I have a few general questions:
  1. Are you interpreting me as arguing that alignment is easy in this post? I avoided arguing that, partly because I don’t think the inner alignment problem has been solved, and the inner alignment problem seems to be the “hard part” of the alignment problem, as I understand it. Solving inner alignment completely would probably require (at the very least) solving mechanistic interpretability, which I don’t think we’re currently close to solving.
  2. Are you saying that MIRI has been very consistent on the question of where the “hard parts” of alignment lie? If so, then your comment makes more sense to me, as you (in my understanding) are trying to summarize what their current arguments are, which then (again, in my understanding) would match what MIRI said more than five years ago. However, I was mainly arguing against the historical arguments, or at least my interpretation of these argument, such as the arguments in Nate Soares’ 2017 talk. To the extent that the arguments you present are absent from pre-2018 MIRI content, I think they’re mostly out of scope for the purpose of my thesis, although I agree that it’s important to talk about how hard alignment is independent of all the historical arguments.
  (In general, I agree that discussions about current arguments are way more important than discussions about what people believed >5 years ago. However, I think it’s occasionally useful to talk about the latter, and so I wrote one post about it.)
  - johnswentworth 8 Oct 2023 22:30 UTC
    LW: 11 AF: 8
    2
    AF Parent
    Are you interpreting me as arguing that alignment is easy in this post?
    Not in any sense which I think is relevant to the discussion at this point.
    Are you saying that MIRI has been very consistent on the question of where the “hard parts” of alignment lie?
    My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models.
    That doesn’t mean that any of them (nor I) have ever explained these parts particularly clearly. Speaking from my own experience, these parts are damned annoyingly difficult to explain; a whole stack of mental models has to be built just to convey the idea, and none of them are particularly legible. (Specifically, the second half of the “‘Values’, and Pointing At Them” section is the part that’s most difficult to explain. My post The Pointers Problems is my own best attempt to date to convey those models, and it remains mediocre.) Most of the arguments historically given are, I think, attempts to shoehorn as much of the underlying mental model as possible into leaky analogies.
    - Matthew Barnett 8 Oct 2023 23:36 UTC
      LW: 10 AF: 4
      0
      AF Parent
      Thanks for the continued clarifications.
      Our primary existing disagreement might be this part,
      My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models.
      Of course, there’s no way of proving what these three people would have said in 2016, and I sympathize with the people who are saying they don’t care much about the specific question of who said what when. However, here’s a passage from the Arbital page on the Problem of fully updated deference, which I assume was written by Eliezer,
      One way to look at the central problem of value identification in superintelligence is that we’d ideally want some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory. Since superintelligences would probably be pretty darned good at collecting data and guessing the empirical state of the universe, this probably solves the whole problem.
      This is not the same problem as writing down our true V by hand. The minimum algorithmic complexity of a meta-utility function ΔU which outputs V after updating on all available evidence, seems plausibly much lower than the minimum algorithmic complexity for writing V down directly. But as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down.
      Here, Eliezer describes the problem of value identification similar to the way I had in the post, except he refers to a function that reflects “value V in all its glory” rather than a function that reflects V with fidelity comparable to the judgement of an ordinary human. And he adds that “as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down”. My interpretation here is therefore as follows,
      Either Eliezer believed that we need a proposed solution to the value identification problem that far exceeds the performance of humans on the task of identifying valuable from non-valuable outcomes. This is somewhat plausible as he mentions CEV in the next paragraph, but elsewhere Eliezer has said, “When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about”.
      Or, in this post, he’s directly saying that he thinks that the problem of value identification was unsolved in 2017, in the sense that I meant it in the post.
      If interpretation (1) is accurate, then I mostly just think that we don’t need to specify an objective function that matches something like the full coherent extrapolated volition of humanity in order to survive AGI. On the other hand, if interpretation (2) is accurate, then I think in 2017 and potentially earlier, Eliezer genuinely felt that there was an important component of the alignment problem that involved specifying a function that reflected the human value function at a level that current LLMs are relatively close to achieving, and he considered this problem unsolved.
      I agree there are conceivable alternative ways of interpreting this quote. However, I believe the weight of the evidence, given the quotes I provided in the post, in addition to the one I provided here, supports my thesis about the historical argument, and what people had believed at the time (even if I’m wrong about a few details).
      - johnswentworth 9 Oct 2023 9:01 UTC
        LW: 27 AF: 12
        12
        AF Parent
        Either Eliezer believed that we need a proposed solution to the value identification problem that far exceeds the performance of humans on the task of identifying valuable from non-valuable outcomes. This is somewhat plausible as he mentions CEV in the next paragraph, but elsewhere Eliezer has said, “When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about”.
        I believe you’re getting close to the actual model here, but not quite hitting it on the head.
        First: lots of ML-ish alignment folks today would distinguish between the problem of aligning well enough to be in the right basin of attraction^[1] an AI capable enough to do alignment research, from the problem of aligning well enough a far-superhuman intelligence. On a MIRIish view, humanish-or-weaker systems don’t much matter for alignment, but there’s still an important potential divide between aligning an early supercritical AGI and aligning full-blown far superintelligence.
        In the “long run”, IIUC Eliezer wants basically-”ideal”^[2] alignment of far superintelligence. But he’ll still tell you that you shouldn’t aim for something that hard early on; instead, aim for something (hopefully) easier, like e.g. corrigibility. (If you’ve been reading the old arbital pages, then presumably you’ve seen him say this sort of thing there.)
        Second: while I worded my comment at the top of this chain to be about values, the exact same mental model applies to other alignment targets, like e.g. corrigibility. Here’s the relevant part of my earlier comment, edited to be about corrigibility instead:
        … humans’ answers to questions about ~~morality~~ corrigibility are not the same as ~~human values~~ corrigibility. More generally, any natural-language description of ~~human values~~ corrigibility, or natural-language discussion of ~~human values~~ corrigibility, is not the same as ~~human values~~ corrigibility.
        (On my-model-of-a-MIRIish-view:) If we optimize hard for humans’ natural-language yay/nay in response to natural language prompts which are nominally about “corrigibility”, we die. This is true for ~any natural-language prompts which are even remotely close to the current natural-language distribution.
        The central thing-which-is-hard-to-do is to point powerful intelligence at ~~human values~~ corrigibility (as opposed to “humans’ natural-language yay/nays in response to natural language prompts which are nominally about ‘corrigibility’”, which are not ~~human values~~ corrigibility and are not a safe proxy for ~~human values~~ corrigibility, but are probably somewhat easier to point an intelligence at).
        Now back to the toy model. If we had some other mind (not our toy model) which generally structures its internal cognition around ~the same high-level concepts as humans, then one might in-principle be able to make a relatively-small change to that mind such that it optimized for (its concept of) ~~human values~~ corrigibility (which basically matches humans’ concept of ~~human values~~ corrigibility, by assumption). Conceptually, the key question is something like “is the concept of ~~human values~~ corrigibility within this mind the type of thing which a pointer in the mind can point at?”. But our toy model has nothing like that. Even with full access to the internals of the toy model, it’s just low-level physics; identifying ~~“human values”~~ “corrigibility” embedded in the toy model is no easier than identifying ~~“human values”~~ “corrigibility” embedded in the physics of our own world. So that’s reason #1 why the toy model doesn’t address the hard parts: the toy model doesn’t “understand” ~~human values~~ corrigibility in the sense of internally using ~the same concept of ~~human values~~ corrigibility as humans use.
        In some sense, the problem of “specifying ~~human values~~ corrigibility” and “aiming an intelligence at something” are just different facets of this same core hard problem:
        we need to somehow get a powerful mind to “have inside it” a concept which basically matches the corresponding human concept at which we want to aim
        “have inside it” cashes out to something roughly like “the concept needs to be the type of thing which a pointer in the mind can point to, and then the rest of the mind will then treat the pointed-to thing with the desired human-like semantics”; e.g. answering external natural-language queries doesn’t even begin to cut it
        … and then some pointer(s) in the mind’s search algorithms need to somehow be pointed at that concept.
        … and we could just as easily repeat this exercise with even weaker targets, like “don’t kill all the humans”. The core hard problem remains the same. On the MIRIish view, some targets (like corrigibility) might be easier than others (like human values) mainly because the easier targets are more likely to be “natural” concepts which an AI ends up using, so the step of “we need to somehow get a powerful mind to ‘have inside it’ a concept which basically matches the corresponding human concept at which we want to aim” is easier. But it’s still basically the same mental model, basically the same core hard steps which need to be overcome somehow.
        Why aren’t answers to natural language queries a good enough proxy for near-superhuman systems?
        My guess at your main remaining disagreement after all that: sure, answers to natural language queries about morality might not cut it under a lot of optimization pressure, but why aren’t answers to natural language queries a good enough proxy for near-superhuman systems?
        (On a MIRIish model) a couple reasons:
        First, such systems are already superhuman, and already run into Goodheart-style problems to a significant degree. Heck, we’ve already seen Goodheart problems crop up here and there even in today’s generally-subhuman models!
        Second, just making the near-superhuman system not immediately kill us is not the problem. The problem is to make the near-superhuman system aligned enough that the successors it produces (possibly with human help) converge to not kill us. That iterative successor-production is itself a process which applies a lot of optimization pressure.
        (I personally would give a bunch of other reasons here, but they’re not things I see MIRI folks discuss as much.)
        Going one level deeper: the same mental model as above is still the relevant thing to have in mind, even for near-superhuman (or even human-ish-level) intelligence. It’s still the same core hard problem, and answers to natural language queries are still basically-irrelevant for basically the same reasons.
        ^
        Specifically, this refers to the basin of attraction under the operation of the AI developing/helping develop a successor AI.
        ^
        “Ideal” is in scare quotes here because it’s not necessarily “ideal” in the same sense that any given reader would first think of it—for instance I don’t think Eliezer would imagine “mathematically proving the system is Good”, though I expect some people imagine that he imagines that.
        Vladimir_Nesov 13 Oct 2023 22:30 UTC
        LW: 7 AF: 5
        2
        AF Parent
        
        The problem is to make the near-superhuman system aligned enough that the successors it produces (possibly with human help) converge to not kill us.
        
        What makes this concept confusing and probably a bad framing is that to the extent doom is likely, neither many individual humans nor humanity as a whole are aligned in this sense. Humanity is currently in the process of producing successors that fail to predictably have the property of converging to not kill us. (I agree that this is the MIRI referent of values/alignment and the correct thing to keep in mind as the central concern.)
- TurnTrout 9 Oct 2023 21:25 UTC
  LW: 7 AF: 6
  0
  AF Parent
  (Placeholder: I think this view of alignment/model internals seems wrongheaded in a way which invalidates the conclusion, but don’t have time to leave a meaningful reply now. Maybe we should hash this out sometime at Lighthaven.)
- TAG 8 Oct 2023 18:08 UTC
  2 points
  0
  Parent
  
  humans’ answers to questions about morality are not the same as human values.
  
  How do you know? Because of some additional information you have. Which the AI could have, if it has some huge dataset. No it doesn’t necessarily care..but it doesn’t necessarily not care. It’s possible to build an AI that refines a crude initial set of values , if you want one. That’s how moral development in humans works, too.

johnswentworth comments on Evaluating the historical value misspecification argument

A Toy Model/​Ontology

“Values”, and Pointing At Them

Why Answering Natural-Language Queries About Morality Is Basically Irrelevant

Side Note On Relevant Capability Levels

Why aren’t answers to natural language queries a good enough proxy for near-superhuman systems?

A Toy Model/Ontology