TurnTrout comments on Will Capabilities Generalise More?

TurnTrout 29 Aug 2022 22:04 UTC
LW: 1 AF: 2
0
AF
The first thing you could try is tic-tac-toe in the real world (i.e., the same scenario as above but don’t think of a Platonic game but a real world implementation). Does that still seem fine?
Hm, no, not really.
This function I would claim “captures what I really want” from a digit-classifier (at least for some contexts of use, like where I am going to use it with a camera at that resolution in an OCR task)
I mean, there are several true mechanistic facts which get swept under the rug by phrases like “captures what I really want” (no fault to you, as I asked for an explanation of this phrase!):
- This function provides exact gradients to desired network outputs, thus providing “exactly the gradients we want”
- This function would not be safe to “optimize for”, in that, for sufficiently expressive architectures and a fixed initial condition (e.g. the start of an ML experiment), not all interpolating models are safe,
  - Furthermore, a model which (by IMO unrealistic assumption) searched over plans to minimize the time-average-EV of the number stored in the loss register, would kill everyone and negative-wirehead,
- For every input image, you can use this function as a classifier to achieve the human-desired behavior.
There are several claims which are not true about this function:
- The function does not “represent” our desires/goals for good classification over 96x96 grayscale images, in the sense of having the same type signature as those desires,
- Similarly, the function cannot be “aligned” or “unaligned” with our desires/goals, except insofar as it tends to provide cognitive updates which push agents towards their human-intended purposes (like classifying images).
I messaged you two docs which I’ve written on the subject recently.
- Ramana Kumar 30 Aug 2022 13:35 UTC
  LW: 3 AF: 1
  3
  AF Parent
  Hm, no, not really.
  OK let’s start here then. If what I really want is an AI that plays tic-tac-toe (TTT) in the real world well, what exactly is wrong with saying the reward function I described above captures what I really want?
  There are several claims which are not true about this function:
  Neither of those claims seemed right to me. Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is? [I presume the problem you’re getting at isn’t as simple as wanting desires to look like (image, digit-label, goodness) tuples as opposed to(image, correct digit-label) tuples.]
  - TurnTrout 6 Sep 2022 23:10 UTC
    LW: 1 AF: 1
    1
    AF Parent
    what exactly is wrong with saying the reward function I described above captures what I really want?
    Well, first of all, that reward function is not outer aligned to TTT, by the following definition:
    “My definition says that an objective function r is outer aligned if all models optimal under r in the limit of perfect optimization and unlimited data are aligned.”
    -- Evan Hubinger, commenting on “Inner Alignment Failures” Which Are Actually Outer Alignment Failures
    There exist models which just wirehead or set the reward to +1 or show themselves a win observation over and over, satisfying that definition and yet not actually playing TTT in any real sense. Even restricted to training, a deceptive agent can play perfect TTT and then, in deployment, kill everyone. (So the TTT-alignment problem is unsolved! Uh oh! But that’s not a problem in reality.)
    So, since reward functions don’t have the type of “goal”, what does it mean to say the real-life reward function “captures” what you want re: TTT, besides the empirical fact that training current models on that reward signal+curriculum will make them play good TTT and nothing else?
    Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is?
    I don’t know, but it’s not that of the loss function! I think “what is the type signature?” isn’t relevant to “the type signature is not that of the loss function”, which is the point I was making. That said—maybe some of my values more strongly bid for plans where the AI has certain kinds of classification behavior?
    My main point is that this “reward/loss indicates what we want” framing just breaks down if you scrutinize it carefully. Reward/loss just gives cognitive updates. It doesn’t have to indicate what we really want, and wishing for such a situation seems incoherent/wrong/misleading as to what we have to solve in alignment.