Rob Bensinger comments on Evaluating the historical value misspecification argument

Rob Bensinger 8 Oct 2023 6:15 UTC
12 points
3
I think the old school MIRI cauldron-filling problem pertained to pretty mundane, everyday tasks. No one said at the time that they didn’t really mean that it would be hard to get an AGI to do those things, that it was just an allegory for other stuff like the strawberry problem. They really seemed to believe, and said over and over again, that we didn’t know how to direct a general-purpose AI to do bounded, simple, everyday tasks without it wanting to take over the world. So this should be a big update to people who held that view, even if there are still arguably risks about OOD behavior.
As someone who worked closely with Eliezer and Nate at the time, including working with Eliezer and Nate on our main write-ups that used the cauldron example, I can say that this is definitely not what we were thinking at the time. Rather:
- The point was to illustrate a weird gap in the expressiveness and coherence of our theories of rational agency: “fill a bucket of water” seems like a simple enough task, but it’s bizarrely difficult to just write down a simple formal description of an optimization process that predictably does this (without any major side-effects, etc.).
  - (We can obviously stipulate “this thing is smart enough to do the thing we want, but too dumb to do anything dangerous”, but the relevant notion of “smart enough” is not itself formal; we don’t understand optimization well enough to formally define agents that have all the cognitive abilities we want and none of the abilities we don’t want.)
- The point of emphasizing “holy shit, this seems so easy and simple and yet we don’t see a way to do it!” wasn’t to issue a challenge to capabilities researches to go cobble together a real-world AI that can fill a bucket of water without destroying the world. The point was to emphasize that corrigibility, low-impact problem-solving, ‘real’ satisficing behavior, etc. seem conceptually simple, and yet the concepts have no known formalism.
  - The hope was that someone would see the simple toy problems and go ‘what, no way, this sounds easy’, get annoyed/nerdsniped, run off to write some equations on a whiteboard, and come back a week or a year later with a formalism (maybe from some niche mathematical field) that works totally fine for this, and makes it easier to formalize lots of other alignment problems in simplified settings (e.g., with unbounded computation).
  - Or failing that, the hope was that someone might at least come up with a clever math hack that solves the immediate ‘get the AI to fill the bucket and halt’ problem and replaces this dumb-sounding theory question with a slightly deeper theory question.
- By using a children’s cartoon to illustrate the toy problem, we hoped to make it clearer that the genre here is “toy problem to illustrate a weird conceptual issue in trying to define certain alignment properties”, not “robotics problem where we show a bunch of photos of factory robots and ask how we can build a good factory robot to refill water receptacles used in industrial applications”.
Nate’s version of the talk, which is mostly a more polished version of Eliezer’s talk, is careful to liberally sprinkle in tons of qualifications like (emphasis added)
- ″… for systems that are sufficiently good at modeling their environment”,
- ‘if the system is smart enough to recognize that shutdown will lower its score’,
- “Relevant safety measures that don’t assume we can always outthink and outmaneuver the system...”,
… to make it clearer that the general issue is powerful, strategic optimizers that have high levels of situational awareness, etc., not necessarily ‘every system capable enough to fill a bucket of water’ (or ‘every DL system...’).
- Rob Bensinger 8 Oct 2023 6:19 UTC
  5 points
  2
  Parent
  Remember that MIRI was in the business of poking at theoretical toy problems and trying to get less conceptually confused about how you could in principle cleanly design a reliable, aimable reasoner. MIRI wasn’t (and isn’t) in the business of issuing challenges to capabilities researchers to build a working water-bucket-filler as soon as possible, and wasn’t otherwise in the business of challenging people to race to AGI faster.
  It wouldn’t have occurred to me that someone might think ‘can a deep net fill a bucket of water, in real life, without being dangerously capable’ is a crucial question in this context; I’m not sure we ever even had the thought occur in our heads ‘when might such-and-such DL technique successfully fill a bucket?’. It would seem just as strange to me as going to check the literature to make sure no GOFAI system ever filled a bucket of water.
  (And while I think I understand why others see ChatGPT as a large positive update about alignment’s difficulty, I hope it’s also obvious why others, MIRI included, would not see it that way.)
  Hacky approaches to alignment do count just as much as clean, scrutable, principled approaches—the important thing is that the AGI transition goes well, not that it goes well and feels clean and tidy in the process. But in this case the messy empirical approach doesn’t look to me like it actually lets you build a corrigible AI that can help with a pivotal act.
  If general-ish DL methods were already empirically OK at filling water buckets in 2016, just as GOFAI already was in 2016, I suspect we still would have been happy to use the Fantasia example, because it’s a simple well-known story that can help make the abstract talk of utility functions and off-switch buttons easier to mentally visualize and manipulate.
  (Though now that I’ve seen the confusion the example causes, I’m more inclined to think that the strawberry problem is a better frame than the Fantasia example.)
- dsj 8 Oct 2023 7:24 UTC
  3 points
  2
  Parent
  I think this reply is mostly talking past my comment.
  I know that MIRI wasn’t claiming we didn’t know how to safely make deep learning systems, GOFAI systems, or what-have-you fill buckets of water, but my comment wasn’t about those systems. I also know that MIRI wasn’t issuing a water-bucket-filling challenge to capabilities researchers.
  My comment was specifically about directing an AGI (which I think GPT-4 roughly is), not deep learning systems or other software generally. I *do* think MIRI was claiming we didn’t know how to make AGI systems safely do mundane tasks.
  I think some of Nate’s qualifications are mainly about the distinction between AGI and other software, and others (such as “[i]f the system is trying to drive up the expectation of its scoring function and is smart enough to recognize that its being shut down will result in lower-scoring outcomes”) mostly serve to illustrate the conceptual frame MIRI was (and largely still is) stuck in about how an AGI would work: an argmaxer over expected utility.
  [Edited to add: I’m pretty sure GPT-4 is smart enough to know the consequences of its being shut down, and yet dumb enough that, if it really wanted to prevent that from one day happening, we’d know by now from various incompetent takeover attempts.]