Matthew Barnett comments on Evaluating the historical value misspecification argument

Matthew Barnett 8 Oct 2023 23:36 UTC
LW: 10 AF: 4
0
AF
Thanks for the continued clarifications.
Our primary existing disagreement might be this part,
My estimate of how well Eliezer or Nate or Rob of 2016 would think my comment above summarizes the relevant parts of their own models, is basically the same as my estimate of how well Eliezer or Nate or Rob of today would think my comment above summarizes the relevant parts of their own models.
Of course, there’s no way of proving what these three people would have said in 2016, and I sympathize with the people who are saying they don’t care much about the specific question of who said what when. However, here’s a passage from the Arbital page on the Problem of fully updated deference, which I assume was written by Eliezer,
One way to look at the central problem of value identification in superintelligence is that we’d ideally want some function that takes a complete but purely physical description of the universe, and spits out our true intended notion of value V in all its glory. Since superintelligences would probably be pretty darned good at collecting data and guessing the empirical state of the universe, this probably solves the whole problem.
This is not the same problem as writing down our true V by hand. The minimum algorithmic complexity of a meta-utility function ΔU which outputs V after updating on all available evidence, seems plausibly much lower than the minimum algorithmic complexity for writing V down directly. But as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down.
Here, Eliezer describes the problem of value identification similar to the way I had in the post, except he refers to a function that reflects “value V in all its glory” rather than a function that reflects V with fidelity comparable to the judgement of an ordinary human. And he adds that “as of 2017, nobody has yet floated any formal proposal for a ΔU of this sort which has not been immediately shot down”. My interpretation here is therefore as follows,
1. Either Eliezer believed that we need a proposed solution to the value identification problem that far exceeds the performance of humans on the task of identifying valuable from non-valuable outcomes. This is somewhat plausible as he mentions CEV in the next paragraph, but elsewhere Eliezer has said, “When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about”.
2. Or, in this post, he’s directly saying that he thinks that the problem of value identification was unsolved in 2017, in the sense that I meant it in the post.
If interpretation (1) is accurate, then I mostly just think that we don’t need to specify an objective function that matches something like the full coherent extrapolated volition of humanity in order to survive AGI. On the other hand, if interpretation (2) is accurate, then I think in 2017 and potentially earlier, Eliezer genuinely felt that there was an important component of the alignment problem that involved specifying a function that reflected the human value function at a level that current LLMs are relatively close to achieving, and he considered this problem unsolved.
I agree there are conceivable alternative ways of interpreting this quote. However, I believe the weight of the evidence, given the quotes I provided in the post, in addition to the one I provided here, supports my thesis about the historical argument, and what people had believed at the time (even if I’m wrong about a few details).
- johnswentworth 9 Oct 2023 9:01 UTC
  LW: 27 AF: 12
  12
  AF Parent
  Either Eliezer believed that we need a proposed solution to the value identification problem that far exceeds the performance of humans on the task of identifying valuable from non-valuable outcomes. This is somewhat plausible as he mentions CEV in the next paragraph, but elsewhere Eliezer has said, “When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about”.
  I believe you’re getting close to the actual model here, but not quite hitting it on the head.
  First: lots of ML-ish alignment folks today would distinguish between the problem of aligning well enough to be in the right basin of attraction^[1] an AI capable enough to do alignment research, from the problem of aligning well enough a far-superhuman intelligence. On a MIRIish view, humanish-or-weaker systems don’t much matter for alignment, but there’s still an important potential divide between aligning an early supercritical AGI and aligning full-blown far superintelligence.
  In the “long run”, IIUC Eliezer wants basically-”ideal”^[2] alignment of far superintelligence. But he’ll still tell you that you shouldn’t aim for something that hard early on; instead, aim for something (hopefully) easier, like e.g. corrigibility. (If you’ve been reading the old arbital pages, then presumably you’ve seen him say this sort of thing there.)
  Second: while I worded my comment at the top of this chain to be about values, the exact same mental model applies to other alignment targets, like e.g. corrigibility. Here’s the relevant part of my earlier comment, edited to be about corrigibility instead:
  … humans’ answers to questions about ~~morality~~ corrigibility are not the same as ~~human values~~ corrigibility. More generally, any natural-language description of ~~human values~~ corrigibility, or natural-language discussion of ~~human values~~ corrigibility, is not the same as ~~human values~~ corrigibility.
  (On my-model-of-a-MIRIish-view:) If we optimize hard for humans’ natural-language yay/nay in response to natural language prompts which are nominally about “corrigibility”, we die. This is true for ~any natural-language prompts which are even remotely close to the current natural-language distribution.
  The central thing-which-is-hard-to-do is to point powerful intelligence at ~~human values~~ corrigibility (as opposed to “humans’ natural-language yay/nays in response to natural language prompts which are nominally about ‘corrigibility’”, which are not ~~human values~~ corrigibility and are not a safe proxy for ~~human values~~ corrigibility, but are probably somewhat easier to point an intelligence at).
  Now back to the toy model. If we had some other mind (not our toy model) which generally structures its internal cognition around ~the same high-level concepts as humans, then one might in-principle be able to make a relatively-small change to that mind such that it optimized for (its concept of) ~~human values~~ corrigibility (which basically matches humans’ concept of ~~human values~~ corrigibility, by assumption). Conceptually, the key question is something like “is the concept of ~~human values~~ corrigibility within this mind the type of thing which a pointer in the mind can point at?”. But our toy model has nothing like that. Even with full access to the internals of the toy model, it’s just low-level physics; identifying ~~“human values”~~ “corrigibility” embedded in the toy model is no easier than identifying ~~“human values”~~ “corrigibility” embedded in the physics of our own world. So that’s reason #1 why the toy model doesn’t address the hard parts: the toy model doesn’t “understand” ~~human values~~ corrigibility in the sense of internally using ~the same concept of ~~human values~~ corrigibility as humans use.
  In some sense, the problem of “specifying ~~human values~~ corrigibility” and “aiming an intelligence at something” are just different facets of this same core hard problem:
  we need to somehow get a powerful mind to “have inside it” a concept which basically matches the corresponding human concept at which we want to aim
  “have inside it” cashes out to something roughly like “the concept needs to be the type of thing which a pointer in the mind can point to, and then the rest of the mind will then treat the pointed-to thing with the desired human-like semantics”; e.g. answering external natural-language queries doesn’t even begin to cut it
  … and then some pointer(s) in the mind’s search algorithms need to somehow be pointed at that concept.
  … and we could just as easily repeat this exercise with even weaker targets, like “don’t kill all the humans”. The core hard problem remains the same. On the MIRIish view, some targets (like corrigibility) might be easier than others (like human values) mainly because the easier targets are more likely to be “natural” concepts which an AI ends up using, so the step of “we need to somehow get a powerful mind to ‘have inside it’ a concept which basically matches the corresponding human concept at which we want to aim” is easier. But it’s still basically the same mental model, basically the same core hard steps which need to be overcome somehow.
  Why aren’t answers to natural language queries a good enough proxy for near-superhuman systems?
  My guess at your main remaining disagreement after all that: sure, answers to natural language queries about morality might not cut it under a lot of optimization pressure, but why aren’t answers to natural language queries a good enough proxy for near-superhuman systems?
  (On a MIRIish model) a couple reasons:
  - First, such systems are already superhuman, and already run into Goodheart-style problems to a significant degree. Heck, we’ve already seen Goodheart problems crop up here and there even in today’s generally-subhuman models!
  - Second, just making the near-superhuman system not immediately kill us is not the problem. The problem is to make the near-superhuman system aligned enough that the successors it produces (possibly with human help) converge to not kill us. That iterative successor-production is itself a process which applies a lot of optimization pressure.
  (I personally would give a bunch of other reasons here, but they’re not things I see MIRI folks discuss as much.)
  Going one level deeper: the same mental model as above is still the relevant thing to have in mind, even for near-superhuman (or even human-ish-level) intelligence. It’s still the same core hard problem, and answers to natural language queries are still basically-irrelevant for basically the same reasons.
  1. ^
    Specifically, this refers to the basin of attraction under the operation of the AI developing/helping develop a successor AI.
  2. ^
    “Ideal” is in scare quotes here because it’s not necessarily “ideal” in the same sense that any given reader would first think of it—for instance I don’t think Eliezer would imagine “mathematically proving the system is Good”, though I expect some people imagine that he imagines that.
  - Vladimir_Nesov 13 Oct 2023 22:30 UTC
    LW: 7 AF: 5
    2
    AF Parent
    
    The problem is to make the near-superhuman system aligned enough that the successors it produces (possibly with human help) converge to not kill us.
    
    What makes this concept confusing and probably a bad framing is that to the extent doom is likely, neither many individual humans nor humanity as a whole are aligned in this sense. Humanity is currently in the process of producing successors that fail to predictably have the property of converging to not kill us. (I agree that this is the MIRI referent of values/alignment and the correct thing to keep in mind as the central concern.)

Matthew Barnett comments on Evaluating the historical value misspecification argument

Why aren’t answers to natural language queries a good enough proxy for near-superhuman systems?