Reality hits back on the models we train via loss functions based on reality-generated data. But alignment also hits back on models we train, because we also use loss functions (based on preference data). These seem to be symmetrically powerful forces.
Alignment doesn’t hit back, the loss function hits back and the loss function doesn’t capture what you really want (eg because killing the humans and taking control of a reward button will max reward, deceiving human raters will increase ratings, etc). If what we wanted was exactly captured in a loss function, alignment would be easier. Not easy because outer optimization doesn’t create good inner alignment, but easier than the present case.
the loss function doesn’t capture what you really want
This seems like a type error to me. What does it mean for a reward function to “capture what I really want”? Can anyone give even a handwavy operationalization of such a scenario, so I can try to imagine something concrete?
Sure, one concrete example is the reward function in the tic-tac-toe environment (from X’s perspective) that returns −1 when the game is over and O has won, returns +1 when the game is over and X has won, and returns 0 on every other turn (including a game over draw), presuming what I really want is for X to win in as few turns as possible.
I can probably illustrate something outside of such a clean game context too, but I’m curious what your response to this one is first, and to make sure this example is as clear as it needs to be.
What about the real world is important here? The first thing you could try is tic-tac-toe in the real world (i.e., the same scenario as above but don’t think of a Platonic game but a real world implementation). Does that still seem fine?
Another aspect of the real world is that we don’t necessarily have compact specifications of what we want. Consider the (Platonic) function that assigns to every 96x96 grayscale (8 bits per pixel) image a label from {0, 1, …, 9, X} and correctly labels unambiguous images of digits (with X for the non-digit or ambiguous images). This function I would claim “captures what I really want” from a digit-classifier (at least for some contexts of use, like where I am going to use it with a camera at that resolution in an OCR task), although I don’t know how to implement it. A smaller dataset of images with labels in agreement with that function, and training losses derived from that dataset I would say inherit this property of “capturing what I really want”, though imperfectly due to the possibilities of suboptimality and of generalisation failure.
The first thing you could try is tic-tac-toe in the real world (i.e., the same scenario as above but don’t think of a Platonic game but a real world implementation). Does that still seem fine?
Hm, no, not really.
This function I would claim “captures what I really want” from a digit-classifier (at least for some contexts of use, like where I am going to use it with a camera at that resolution in an OCR task)
I mean, there are several true mechanistic facts which get swept under the rug by phrases like “captures what I really want” (no fault to you, as I asked for an explanation of this phrase!):
This function provides exact gradients to desired network outputs, thus providing “exactly the gradients we want”
This function would not be safe to “optimize for”, in that, for sufficiently expressive architectures and a fixed initial condition (e.g. the start of an ML experiment), not all interpolating models are safe,
Furthermore, a model which (by IMO unrealistic assumption) searched over plans to minimize the time-average-EV of the number stored in the loss register, would kill everyone and negative-wirehead,
For every input image, you can use this function as a classifier to achieve the human-desired behavior.
There are several claims which are not true about this function:
The function does not “represent” our desires/goals for good classification over 96x96 grayscale images, in the sense of having the same type signature as those desires,
Similarly, the function cannot be “aligned” or “unaligned” with our desires/goals, except insofar as it tends to provide cognitive updates which push agents towards their human-intended purposes (like classifying images).
I messaged you two docs which I’ve written on the subject recently.
OK let’s start here then. If what I really want is an AI that plays tic-tac-toe (TTT) in the real world well, what exactly is wrong with saying the reward function I described above captures what I really want?
There are several claims which are not true about this function:
Neither of those claims seemed right to me. Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is? [I presume the problem you’re getting at isn’t as simple as wanting desires to look like (image, digit-label, goodness) tuples as opposed to(image, correct digit-label) tuples.]
what exactly is wrong with saying the reward function I described above captures what I really want?
Well, first of all, that reward function is not outer aligned to TTT, by the following definition:
“My definition says that an objective function r is outer aligned if all models optimal under r in the limit of perfect optimization and unlimited data are aligned.”
There exist models which just wirehead or set the reward to +1 or show themselves a win observation over and over, satisfying that definition and yet not actually playing TTT in any real sense. Even restricted to training, a deceptive agent can play perfect TTT and then, in deployment, kill everyone. (So the TTT-alignment problem is unsolved! Uh oh! But that’s not a problem in reality.)
So, since reward functions don’t have the type of “goal”, what does it mean to say the real-life reward function “captures” what you want re: TTT, besides the empirical fact that training current models on that reward signal+curriculum will make them play good TTT and nothing else?
Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is?
I don’t know, but it’s not that of the loss function! I think “what is the type signature?” isn’t relevant to “the type signature is not that of the loss function”, which is the point I was making. That said—maybe some of my values more strongly bid for plans where the AI has certain kinds of classification behavior?
My main point is that this “reward/loss indicates what we want” framing just breaks down if you scrutinize it carefully. Reward/loss just gives cognitive updates. It doesn’t have to indicate what we really want, and wishing for such a situation seems incoherent/wrong/misleading as to what we have to solve in alignment.
It’s not clear to me why you think the concept of reward functions “breaks down” when applied to more complicated environments. I think maybe you mean to ask for something else.
Isn’t that literally the alignment problem? Come up with a loss function that captures what we want an AI to do in the real world, and then it’s easy enough to make an AI that does what we want it to do.
Not at all. That’s part of what makes it hard. You still have to engineer an AI to maximize that loss function and not some intermediate target, using the actual ML methods that have yet to be pioneered, even if you have such a literal utility function to measure out rewards with. If after your training loop you create some sort of mesa-optimizer that optimizes not-quite-that-loss function, you lose.
Alignment doesn’t hit back, the loss function hits back and the loss function doesn’t capture what you really want (eg because killing the humans and taking control of a reward button will max reward, deceiving human raters will increase ratings, etc). If what we wanted was exactly captured in a loss function, alignment would be easier. Not easy because outer optimization doesn’t create good inner alignment, but easier than the present case.
This seems like a type error to me. What does it mean for a reward function to “capture what I really want”? Can anyone give even a handwavy operationalization of such a scenario, so I can try to imagine something concrete?
Sure, one concrete example is the reward function in the tic-tac-toe environment (from X’s perspective) that returns −1 when the game is over and O has won, returns +1 when the game is over and X has won, and returns 0 on every other turn (including a game over draw), presuming what I really want is for X to win in as few turns as possible.
I can probably illustrate something outside of such a clean game context too, but I’m curious what your response to this one is first, and to make sure this example is as clear as it needs to be.
Yes, I can imagine that for a simple game like tic-tac-toe. I want an example which is not for a Platonic game, but for the real world.
What about the real world is important here? The first thing you could try is tic-tac-toe in the real world (i.e., the same scenario as above but don’t think of a Platonic game but a real world implementation). Does that still seem fine?
Another aspect of the real world is that we don’t necessarily have compact specifications of what we want. Consider the (Platonic) function that assigns to every 96x96 grayscale (8 bits per pixel) image a label from {0, 1, …, 9, X} and correctly labels unambiguous images of digits (with X for the non-digit or ambiguous images). This function I would claim “captures what I really want” from a digit-classifier (at least for some contexts of use, like where I am going to use it with a camera at that resolution in an OCR task), although I don’t know how to implement it. A smaller dataset of images with labels in agreement with that function, and training losses derived from that dataset I would say inherit this property of “capturing what I really want”, though imperfectly due to the possibilities of suboptimality and of generalisation failure.
Hm, no, not really.
I mean, there are several true mechanistic facts which get swept under the rug by phrases like “captures what I really want” (no fault to you, as I asked for an explanation of this phrase!):
This function provides exact gradients to desired network outputs, thus providing “exactly the gradients we want”
This function would not be safe to “optimize for”, in that, for sufficiently expressive architectures and a fixed initial condition (e.g. the start of an ML experiment), not all interpolating models are safe,
Furthermore, a model which (by IMO unrealistic assumption) searched over plans to minimize the time-average-EV of the number stored in the loss register, would kill everyone and negative-wirehead,
For every input image, you can use this function as a classifier to achieve the human-desired behavior.
There are several claims which are not true about this function:
The function does not “represent” our desires/goals for good classification over 96x96 grayscale images, in the sense of having the same type signature as those desires,
Similarly, the function cannot be “aligned” or “unaligned” with our desires/goals, except insofar as it tends to provide cognitive updates which push agents towards their human-intended purposes (like classifying images).
I messaged you two docs which I’ve written on the subject recently.
OK let’s start here then. If what I really want is an AI that plays tic-tac-toe (TTT) in the real world well, what exactly is wrong with saying the reward function I described above captures what I really want?
Neither of those claims seemed right to me. Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is? [I presume the problem you’re getting at isn’t as simple as wanting desires to look like (image, digit-label, goodness) tuples as opposed to(image, correct digit-label) tuples.]
Well, first of all, that reward function is not outer aligned to TTT, by the following definition:
There exist models which just wirehead or set the reward to +1 or show themselves a win observation over and over, satisfying that definition and yet not actually playing TTT in any real sense. Even restricted to training, a deceptive agent can play perfect TTT and then, in deployment, kill everyone. (So the TTT-alignment problem is unsolved! Uh oh! But that’s not a problem in reality.)
So, since reward functions don’t have the type of “goal”, what does it mean to say the real-life reward function “captures” what you want re: TTT, besides the empirical fact that training current models on that reward signal+curriculum will make them play good TTT and nothing else?
I don’t know, but it’s not that of the loss function! I think “what is the type signature?” isn’t relevant to “the type signature is not that of the loss function”, which is the point I was making. That said—maybe some of my values more strongly bid for plans where the AI has certain kinds of classification behavior?
My main point is that this “reward/loss indicates what we want” framing just breaks down if you scrutinize it carefully. Reward/loss just gives cognitive updates. It doesn’t have to indicate what we really want, and wishing for such a situation seems incoherent/wrong/misleading as to what we have to solve in alignment.
It’s not clear to me why you think the concept of reward functions “breaks down” when applied to more complicated environments. I think maybe you mean to ask for something else.
Isn’t that literally the alignment problem? Come up with a loss function that captures what we want an AI to do in the real world, and then it’s easy enough to make an AI that does what we want it to do.
Not at all. That’s part of what makes it hard. You still have to engineer an AI to maximize that loss function and not some intermediate target, using the actual ML methods that have yet to be pioneered, even if you have such a literal utility function to measure out rewards with. If after your training loop you create some sort of mesa-optimizer that optimizes not-quite-that-loss function, you lose.