The first thing you could try is tic-tac-toe in the real world (i.e., the same scenario as above but don’t think of a Platonic game but a real world implementation). Does that still seem fine?
Hm, no, not really.
This function I would claim “captures what I really want” from a digit-classifier (at least for some contexts of use, like where I am going to use it with a camera at that resolution in an OCR task)
I mean, there are several true mechanistic facts which get swept under the rug by phrases like “captures what I really want” (no fault to you, as I asked for an explanation of this phrase!):
This function provides exact gradients to desired network outputs, thus providing “exactly the gradients we want”
This function would not be safe to “optimize for”, in that, for sufficiently expressive architectures and a fixed initial condition (e.g. the start of an ML experiment), not all interpolating models are safe,
Furthermore, a model which (by IMO unrealistic assumption) searched over plans to minimize the time-average-EV of the number stored in the loss register, would kill everyone and negative-wirehead,
For every input image, you can use this function as a classifier to achieve the human-desired behavior.
There are several claims which are not true about this function:
The function does not “represent” our desires/goals for good classification over 96x96 grayscale images, in the sense of having the same type signature as those desires,
Similarly, the function cannot be “aligned” or “unaligned” with our desires/goals, except insofar as it tends to provide cognitive updates which push agents towards their human-intended purposes (like classifying images).
I messaged you two docs which I’ve written on the subject recently.
OK let’s start here then. If what I really want is an AI that plays tic-tac-toe (TTT) in the real world well, what exactly is wrong with saying the reward function I described above captures what I really want?
There are several claims which are not true about this function:
Neither of those claims seemed right to me. Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is? [I presume the problem you’re getting at isn’t as simple as wanting desires to look like (image, digit-label, goodness) tuples as opposed to(image, correct digit-label) tuples.]
what exactly is wrong with saying the reward function I described above captures what I really want?
Well, first of all, that reward function is not outer aligned to TTT, by the following definition:
“My definition says that an objective function r is outer aligned if all models optimal under r in the limit of perfect optimization and unlimited data are aligned.”
There exist models which just wirehead or set the reward to +1 or show themselves a win observation over and over, satisfying that definition and yet not actually playing TTT in any real sense. Even restricted to training, a deceptive agent can play perfect TTT and then, in deployment, kill everyone. (So the TTT-alignment problem is unsolved! Uh oh! But that’s not a problem in reality.)
So, since reward functions don’t have the type of “goal”, what does it mean to say the real-life reward function “captures” what you want re: TTT, besides the empirical fact that training current models on that reward signal+curriculum will make them play good TTT and nothing else?
Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is?
I don’t know, but it’s not that of the loss function! I think “what is the type signature?” isn’t relevant to “the type signature is not that of the loss function”, which is the point I was making. That said—maybe some of my values more strongly bid for plans where the AI has certain kinds of classification behavior?
My main point is that this “reward/loss indicates what we want” framing just breaks down if you scrutinize it carefully. Reward/loss just gives cognitive updates. It doesn’t have to indicate what we really want, and wishing for such a situation seems incoherent/wrong/misleading as to what we have to solve in alignment.
Hm, no, not really.
I mean, there are several true mechanistic facts which get swept under the rug by phrases like “captures what I really want” (no fault to you, as I asked for an explanation of this phrase!):
This function provides exact gradients to desired network outputs, thus providing “exactly the gradients we want”
This function would not be safe to “optimize for”, in that, for sufficiently expressive architectures and a fixed initial condition (e.g. the start of an ML experiment), not all interpolating models are safe,
Furthermore, a model which (by IMO unrealistic assumption) searched over plans to minimize the time-average-EV of the number stored in the loss register, would kill everyone and negative-wirehead,
For every input image, you can use this function as a classifier to achieve the human-desired behavior.
There are several claims which are not true about this function:
The function does not “represent” our desires/goals for good classification over 96x96 grayscale images, in the sense of having the same type signature as those desires,
Similarly, the function cannot be “aligned” or “unaligned” with our desires/goals, except insofar as it tends to provide cognitive updates which push agents towards their human-intended purposes (like classifying images).
I messaged you two docs which I’ve written on the subject recently.
OK let’s start here then. If what I really want is an AI that plays tic-tac-toe (TTT) in the real world well, what exactly is wrong with saying the reward function I described above captures what I really want?
Neither of those claims seemed right to me. Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is? [I presume the problem you’re getting at isn’t as simple as wanting desires to look like (image, digit-label, goodness) tuples as opposed to(image, correct digit-label) tuples.]
Well, first of all, that reward function is not outer aligned to TTT, by the following definition:
There exist models which just wirehead or set the reward to +1 or show themselves a win observation over and over, satisfying that definition and yet not actually playing TTT in any real sense. Even restricted to training, a deceptive agent can play perfect TTT and then, in deployment, kill everyone. (So the TTT-alignment problem is unsolved! Uh oh! But that’s not a problem in reality.)
So, since reward functions don’t have the type of “goal”, what does it mean to say the real-life reward function “captures” what you want re: TTT, besides the empirical fact that training current models on that reward signal+curriculum will make them play good TTT and nothing else?
I don’t know, but it’s not that of the loss function! I think “what is the type signature?” isn’t relevant to “the type signature is not that of the loss function”, which is the point I was making. That said—maybe some of my values more strongly bid for plans where the AI has certain kinds of classification behavior?
My main point is that this “reward/loss indicates what we want” framing just breaks down if you scrutinize it carefully. Reward/loss just gives cognitive updates. It doesn’t have to indicate what we really want, and wishing for such a situation seems incoherent/wrong/misleading as to what we have to solve in alignment.