Rob Bensinger comments on Evaluating the historical value misspecification argument

Rob Bensinger 5 Oct 2023 23:04 UTC
14 points
5
“Fill the cauldron” examples are examples where the cauldron-filler has the wrong utility function, not examples where it has the wrong beliefs. E.g., this is explicit in https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/
The idea of the “fill the cauldron” examples isn’t “the AI is bad at NLP and therefore doesn’t understand what we mean when we say ‘fill’, ‘cauldron’, etc.” It’s “even simple small-scale tasks are unnatural, in the sense that it’s hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn’t an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this”. (Including easier to aim via training.)
- Zack_M_Davis 5 Oct 2023 23:33 UTC
  48 points
  33
  Parent
  To this, the deep-learning-has-alignment-implications proponent replies: “But simple small-scale tasks don’t require maximizing a coherent preference ordering over world-states. We can already hook up an LLM to a robot and have it obey natural-language commands in a reasonable way.”
  
  To which you might reply, “Fine, cute trick, but that doesn’t help with the real alignment problem, which is that eventually someone will invent a powerful optimizer with a coherent preference ordering over world-states, which will kill us.”
  
  To which the other might reply, “Okay, I agree that we don’t know how to align an arbitrarily powerful optimizer with a coherent preference ordering over world-states, but if your theory predicts that we can’t aim AI systems at low-impact tasks via training, you have to be getting something wrong, because people are absolutely doing that right now, by treating it as a mundane engineering problem in the current paradigm.”
  
  To which you might reply, “We predict that the mundane engineering approach will break down once the systems are powerful enough to come up with plans that humans can’t supervise”?
  What links here?
  - Zack_M_Davis's comment on Evaluating the historical value misspecification argument by Matthew Barnett (29 Dec 2024 8:15 UTC; 2 points)
  - TAG 8 Oct 2023 0:00 UTC
    2 points
    −5
    Parent
    
    eventually someone will invent a powerful optimizer with a coherent preference ordering over world-states, which will kill us.
    
    It’s unlikely that any realistic AI will be perfectly coherent , or have exact preferences over works states. The first is roughly equivalent to the Frame Problem , the second is defeated by embededness.
  - gallabytes 6 Oct 2023 16:37 UTC
    1 point
    0
    Parent
    The obvious question here is to what degree do you need new techniques vs merely to train new models with the same techniques as you scale current approaches.
    One of the virtues of the deep learning paradigm is that you can usually test things at small scale (where the models are not and will never be especially smart) and there’s a smooth range of scaling regimes in between where things tend to generalize.
    If you need fundamentally different techniques at different scales, and the large scale techniques do not work at intermediate and small scales, then you might have a problem. If you need the same techniques as at medium or small scales for large scales, then engineering continues to be tractable even as algorithmic advances obsolete old approaches.
- Tor Økland Barstad 6 Oct 2023 0:50 UTC
  5 points
  0
  Parent
  Thanks for the reply :) Feel free to reply further if you want, but I hope you don’t feel obliged to do so^[1].
  “Fill the cauldron” examples are (...) not examples where it has the wrong beliefs.
  I have never ever been confused about that!
  It’s “even simple small-scale tasks are unnatural, in the sense that it’s hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn’t an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this”. (Including easier to aim via training.)
  That is well phrased. And what you write here doesn’t seem in contradiction with my previous impression of things.
  I think the feeling I had when first hearing “fill the bucket”-like examples was “interesting—you made a legit point/observation here”^[2].
  I’m having a hard time giving a crystalized/precise summary of why I nonetheless feel (and have felt^[3]) confused. I think some of it has to do with:
  - More “outer alignment”-like issues being given what seems/seemed to me like outsized focus compared to more “inner alignment”-like issues (although there has been a focus on both for as long as I can remember).
  - The attempts to think of “tricks” seeming to be focused on real-world optimization-targets to point at, rather than ways of extracting help with alignment somehow / trying to find techniques/paths/tricks for obtaining reliable oracles.
  - Having utility functions so prominently/commonly be the layer of abstraction that is used^[4].
    
    I remember Nate Soares once using the analogy of a very powerful function-optimizer (“I could put in some description of a mathematical function, and it would give me an input that made that function’s output really large”). Thinking of the problem at that layer of abstraction makes much more sense to me.
  It’s purposeful that I say “I’m confused”, and not “I understand all details of what you were thinking, and can clearly see that you were misguided”.
  When seeing e.g. Eliezer’s talk AI Alignment: Why It’s Hard, and Where to Start, I understand that I’m seeing a fairly small window into his thinking. So when it gives a sense of him not thinking about the problem quite like I would think about it, that is more of a suspicion that I get/got from it—not something I can conclude from it in a firm way.
  1. ^
    If I could steal a given amount of your time, I would not prioritize you replying to this.
  2. ^
    I can’t remember this point/observation being particularly salient to me (in the context of AI) before I first was exposed to Bostrom’s/Eliezer’s writings (in 2014).
    
    As a sidenote: I wasn’t that worried about technical alignment prior to reading Bostrom’s/Eliezer’s stuff, and became worried upon reading it.
  3. ^
    What has confused me has varied throughout time. If I tried to be very precise about what I think I thought when, this comment would become more convoluted. (Also, it’s sometimes hard for me to separate false memories from real ones.)
  4. ^
    I have read this tweet, which seemed in line with my interpretation of things.
  - Rob Bensinger 6 Oct 2023 1:41 UTC
    2 points
    0
    Parent
    More “outer alignment”-like issues being given what seems/seemed to me like outsized focus compared to more “inner alignment”-like issues (although there has been a focus on both for as long as I can remember).
    In retrospect I think we should have been more explicit about the importance of inner alignment; I think that we didn’t do that in our introduction to corrigibility because it wasn’t necessary for illustrating the problem and where we’d run into roadblocks.
    Maybe a missing piece here is some explanation of why having a formal understanding of corrigibility might be helpful for actually training corrigibility into a system? (Helpful at all, even if it’s not sufficient on its own.)
    The attempts to think of “tricks” seeming to be focused on real-world optimization-targets to point at, rather than ways of extracting help with alignment somehow / trying to find techniques/paths/tricks for obtaining reliable oracles.
    Aside from “concreteness can help make the example easier to think about when you’re new to the topic”, part of the explanation here might be “if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment or building a system that only outputs English-language sentence”.
    Having utility functions so prominently/commonly be the layer of abstraction that is used^[4].
    I mean, I think utility functions are an extremely useful and basic abstraction. I think it’s a lot harder to think about a lot of AI topics without invoking ideas like ‘this AI thinks outcome X is better than outcome Y’, or ‘this AI’s preference come with different weights, which can’t purely be reduced to what the AI believes’.
    - Tor Økland Barstad 6 Oct 2023 14:38 UTC
      1 point
      0
      Parent
      Thanks for the reply :) I’ll try to convey some of my thinking, but I don’t expect great success. I’m working on more digestible explainers, but this is a work in progress, and I have nothing good that I can point people to as of now.
      (...) part of the explanation here might be “if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment (...)
      Yeah, I guess this is where a lot of the differences in our perspective are located.
      if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech)
      Things have to cash out in terms of concrete actions in the world. Maybe a contention is the level of indirection we imagine in our heads (by which we try to obtain systems that can help us do concrete actions).
      Prominent in my mind are scenarios that involve a lot of iterative steps (but over a short amount of time) before we start evaluating systems by doing AGI-generated experiments. In the earlier steps, we avoid doing any actions in the “real world” that are influenced in a detailed way by AGI output, and we avoid having real humans be exposed to AGI-generated argumentation.
      Examples of stuff we might try to obtain:
      AGI “lie detector techniques” (maybe something that is in line with the ideas of Collin Burns)
      Argument/proof evaluators (this is an interest of mine, but making better explainers is still a work in progress, and I have some way to go)
      If we are good at program-search, this can itself be used to obtain programs that help us be better at program-search (finding functions that score well according to well-defined criteria).
      Some tasks can be considered to be inside of “test-range”^[1]:
      Predicting human answers to questions posed by other humans^[2].
      Outputting prime numbers^[3]
      Predicting experimental results from past experimental data^[4]
      Whether a chess-move is good^[5]
      Etc, etc
      Other tasks are outside of “test-range”:
      Predicting human answers to any question (including questions that involve being exposed to AGI-generated content)^[6]
      Whether a given instruction-plan actually results in machines that copy strawberries at the molecular level (and does so in accordance with “the spirit” of the request that was given)
      Etc, etc
      Most requests that actually are helpful to us are outside of test-range. And when the requirements that matter to us are outside of test-range, it is of course harder to test in a safe/reliable way if systems are giving us what we want.
      But we can have AGIs output programs that help us with tasks, and we can define requirements^[7]Be for these programs. And for these program-requirements, AGIs can help us explore stuff such as:
      Are there programs that satisfy the requirements but disagree about certain outputs? (be that outputs that are inside of test-range or outside of test-range)
      Are there programs that satisfy the requirements, but perform poorly for certain inputs that are inside of test-range?
      If we only require correct outputs within the entire “test-range”, is it nonetheless infeasible to make programs that satisfy the requirements while giving bad output for ≥1 inputs within the test-range?
      Phrased somewhat differently: We may come up with techniques for making hard to not give helpful/honest outputs^[8], and test how well these seem to generalize.
      And similar techniques can also be applied to program-requirement-requirements, program-requirement-requirement-requirements, etc^[9].
      Anyway, it’s hard to write about this kind of stuff in a concrete way without going off large tangents, and it’s hard to write about it in an abstract way without being vague and convoluted. For whatever it’s worth, these are some principles I think may be helpful (very much not a complete list):
      Generality as it correlates with measurables (such as e.g. source code length or use of computational resource use): When writing a program to do well for many outputs (while fulfilling certain requirements), intentionally having it fail for particular outputs will sometimes require it to be less general (e.g. adding extra if-sentences, resulting in a longer minimum source code length for the program to exhibit this behavior).
      Optimization-targets that incentivize modularity make it harder to make programs that intentionally fail for particular outputs: In a more modular system, it will less often be clear from sub-routine inputs what the macro-level task is^[10].
      When a requirement or optimization-target fails at “forcing” good outputs, AGIs that act so as to be selected for are “incentivized” to reveal this as being the case: Suppose some set of requirements is compatible with bad outputs inside “test-range”. Or suppose that some set of requirements allow for programs that give contradictory outputs. If we do “gradient descent”, looking for AGIs that help us explore spaces of programs defined by requirements that we specify, we will select for AGIs that point out this kind of stuff.
      Anyway, I’m not sure how much or little this attempt at conveying my thoughts achieves. It’s a bit of a dilemma for me if I am to try to convey anything at all, or wait to do that until I have completed better explainers that I can point people to (as the explainers I have now fall short).
      I guess an incomplete summary of my perspective might be:
      There are lots of techniques we may apply for getting help with alignment while greatly limiting the extent to which we interact with AGI putout in ways that are dangerous
      Not being able to directly evaluate outputs we are interested in does not necessarily preclude us from finding indirect methods of evaluation that are powerful.
      We may do program-search for AGIs that help us explore possible program-requirements, and the spaces of programs (and program outputs) that are compatible with those program-requirements (and program-requirement-requirements, etc). We may learn a lot from patterns we observe relating to these “spaces”.
      Sometimes there are possible patterns such that if we know ourselves to be sufficiently capable at program-search, P(requirement R1 “forces” outputs to be “good” | we observe pattern P1) reasonably could be deemed to be high. Even if we’re not opinionated in regards to P(requirement R1 “forces” outputs to be “good”).
      ^
      By which I mean something akin to: “We have a safe and reliable way of testing if the output fulfills the requirements in question”.
      
      Admittedly, if we haven’t solved some of the technical challenges related to containment, even playing tic tac toe can constitute some risk. But here, I mean to focus on risk from interacting with AGI output—not risk from just having AGI-systems system run (and them maybe finding a security vulnerability in the computer systems they are running at).
      ^
      (we can pick random predictions to test, and we can have AIs competitively point out predictions made by other AIs that they think are miscalibrated)
      ^
      (we can write programs that take alleged prime numbers as input, and test if they really are prime numbers)
      ^
      (or from experiments that haven’t been run, but that humans have come up with, such that there is little room for AGIs to somehow influence the experiment to itself be dangerous somehow)
      ^
      (we may not be capable of evaluating chess-moves ourselves, but we may safely have AGI-systems play chess against each other—indicating something akin to: “insofar as we feel confident that we are good at program-search, we can feel confident that we can obtain programs that output good chess-moves”)
      ^
      There are more and less dangerous ways to do this. With less dangerous ways of doing it, experiments (where actual humans answer questions) are done with humans that are temporarily isolated from the rest of the rest of the world (and who certainly aren’t AGI-system operators).
      ^
      Such requirements may relate to:
      - How the program is constructed
      - Observables relating to the source code (source code length, etc)
      - Restrictions the source code must adhere to
      - Whether the program is accepted by a given verifier (or any verifier that itself fulfills certain requirements)
      - “Proofs” of various kinds relating to the program
      - Tests of program outputs that the program must be robust in regards to
      - Etc
      ^
      By “making it hard” I means something like “hard to do while being the sort of program we select for when doing program-search”. Kind of like how it’s not “hard” for a chess program to output bad chess-moves, but it’s hard for it to do that while also being the kind of program we continue to select for while doing “gradient descent”.
      ^
      In my view of things, this is a very central technique (it may appear circular somehow, but when applied correctly, I don’t think it is). But it’s hard for me to talk about it in a concrete way without going off on tangents, and it’s hard for me to talk about it in an abstract way without being vague. Also, my texts become more convoluted when I try to write about this, and I think people often just glaze over it.
      ^
      One example of this: If we are trying to obtain argument evaluators, the argumentation/demonstrations/proofs these evaluators evaluate should be organized into small and modular pieces, such that it’s not car from any given piece what the macro-level conclusion is.