OK let’s start here then. If what I really want is an AI that plays tic-tac-toe (TTT) in the real world well, what exactly is wrong with saying the reward function I described above captures what I really want?
There are several claims which are not true about this function:
Neither of those claims seemed right to me. Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is? [I presume the problem you’re getting at isn’t as simple as wanting desires to look like (image, digit-label, goodness) tuples as opposed to(image, correct digit-label) tuples.]
what exactly is wrong with saying the reward function I described above captures what I really want?
Well, first of all, that reward function is not outer aligned to TTT, by the following definition:
“My definition says that an objective function r is outer aligned if all models optimal under r in the limit of perfect optimization and unlimited data are aligned.”
There exist models which just wirehead or set the reward to +1 or show themselves a win observation over and over, satisfying that definition and yet not actually playing TTT in any real sense. Even restricted to training, a deceptive agent can play perfect TTT and then, in deployment, kill everyone. (So the TTT-alignment problem is unsolved! Uh oh! But that’s not a problem in reality.)
So, since reward functions don’t have the type of “goal”, what does it mean to say the real-life reward function “captures” what you want re: TTT, besides the empirical fact that training current models on that reward signal+curriculum will make them play good TTT and nothing else?
Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is?
I don’t know, but it’s not that of the loss function! I think “what is the type signature?” isn’t relevant to “the type signature is not that of the loss function”, which is the point I was making. That said—maybe some of my values more strongly bid for plans where the AI has certain kinds of classification behavior?
My main point is that this “reward/loss indicates what we want” framing just breaks down if you scrutinize it carefully. Reward/loss just gives cognitive updates. It doesn’t have to indicate what we really want, and wishing for such a situation seems incoherent/wrong/misleading as to what we have to solve in alignment.
OK let’s start here then. If what I really want is an AI that plays tic-tac-toe (TTT) in the real world well, what exactly is wrong with saying the reward function I described above captures what I really want?
Neither of those claims seemed right to me. Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is? [I presume the problem you’re getting at isn’t as simple as wanting desires to look like (image, digit-label, goodness) tuples as opposed to(image, correct digit-label) tuples.]
Well, first of all, that reward function is not outer aligned to TTT, by the following definition:
There exist models which just wirehead or set the reward to +1 or show themselves a win observation over and over, satisfying that definition and yet not actually playing TTT in any real sense. Even restricted to training, a deceptive agent can play perfect TTT and then, in deployment, kill everyone. (So the TTT-alignment problem is unsolved! Uh oh! But that’s not a problem in reality.)
So, since reward functions don’t have the type of “goal”, what does it mean to say the real-life reward function “captures” what you want re: TTT, besides the empirical fact that training current models on that reward signal+curriculum will make them play good TTT and nothing else?
I don’t know, but it’s not that of the loss function! I think “what is the type signature?” isn’t relevant to “the type signature is not that of the loss function”, which is the point I was making. That said—maybe some of my values more strongly bid for plans where the AI has certain kinds of classification behavior?
My main point is that this “reward/loss indicates what we want” framing just breaks down if you scrutinize it carefully. Reward/loss just gives cognitive updates. It doesn’t have to indicate what we really want, and wishing for such a situation seems incoherent/wrong/misleading as to what we have to solve in alignment.