This comment made the MIRI-style pessimist’s position clearer to me—I think? -- so thank you for it.
I want to try my hand at a kind of disagreement / response, and then at predicting your response to my response, to see how my model of MIRI-style pessimism stands up, if you’re up for it.
Response: You state that reality “bites back” for wrong beliefs but not wrong preferences. This seems like it is only contingently true; reality will “bite back” from whatever loss function whatsoever that I put into my system, with whatever relative weightings I give it. If I want to reward my LLM (or other AI) for doing the right thing in a multitude of examples that constitute 50% of my training set, 50% of my test set, and 50% of two different validation sets, then from the perspective of the LLM (or other AI) reality bites back just as much for learning the wrong preferences just as it does for learning false facts about the world. So we should expect it to learn to act in ways that I like.
Predicted response to response: This will work for shallow, relatively stupid AIs, trained purely in a supervised fashion, like we currently have. BUT once we have LLM / AIs that can do complex things, like predict macroeconomic world states, they’ll have abilities to reason and update their own beliefs in a complex fashion. This will remain uniformly rewarded by reality—but we will no longer have the capacity to give feedback on this higher-level process because (????) so it breaks.
Or response—This will work for shallow, stupid AIs trained like the ones we currently have. But once we have LLMs / AIs that can do compex things, like predict macroeconomic world states, then they’re going to be able to go out of domain in a very high dimensional space of action, from the perspective of our training / test set. And this out-of-domainness is unavoidable because that’s what solving complex problems in the world means—it means problems that aren’t simply contained in the training set. And this means that in some corner of the world, we’re guaranteed to find that they’ve been reinforced to want something that doesn’t accord with our preferences.
Meh, I doubt that’s gonna pass an ITT, but wanted to give it a shot.
Suppose that I’m trying to build a smarter-than-human AI that has a bunch of capabilities (including, e.g., ‘be good at Atari games’), and that has the goal ‘maximize the amount of diamond in the universe’. It’s true that current techniques let you provide greater than zero pressure in the direction of ‘maximize the amount of diamond in the universe’, but there are several important senses in which reality doesn’t ‘bite back’ here:
If the AI acquires an unrelated goal (e.g., calculate as many digits of pi as possible), and acquires the belief ‘I will better achieve my true goal if I maximize the amount of diamond’ (e.g,, because it infers that its programmer wants that, or just because an SGD-ish process nudged it in the direction of having such a belief), then there’s no way in which reality punishes or selects against that AGI (relative to one that actually has the intended goal).
Things that make the AI better at some Atari games, will tend to make it better at other Atari games, but won’t tend to make it care more about maximizing diamonds. More generally, things that make AI more capable tend to go together (especially once you get to higher levels of difficulty, generality, non-brittleness, etc.), whereas none of them go together with “terminally value a universe full of diamond”.
If we succeed in partly instilling the goal into the AI (e.g., it now likes carbon atoms a lot), then this doesn’t provide additional pressure for the AI to internalize the rest of the goal. There’s no attractor basin where if you have half of human values, you’re under more pressure to acquire the other half. In contrast, if you give AI high levels of capability in half the capabilities, it will tend to want all the rest of the capabilities too; and whatever keeps it from succeeding on general reasoning and problem-solving will also tend to keep it from succeeding on the narrow task you’re trying to get it to perform. (More so to the extent the task is hard.)
(There are also separate issues, like ‘we can’t provide a training signal where we thumbs-down the AI destroying the world, because we die in those worlds’.)
I’m still quite unconvinced, which of course you’d predict. Like, regarding 3:
“There’s no attractor basin where if you have half of human values, you’re under more pressure to acquire the other half.”
Sure there is—over course of learning anything you get better and better feedback from training as your mistakes get more fine-grained. If you acquire a “don’t lie” principle without acquiring also “but it’s ok to lie to Nazis” then you’ll be punished, for instance. After you learn the more basic things, you’ll be pushed to acquire the less basic ones, so the reinforcement you get becomes more and more detailed. This is just like an RL model learns to stumble forward before it learns to walk cleanly or LLMs learn associations before learning higher-order correlations.
The there is no attractor basin in the world for ML, apart from actual mechanisms by which there are attractor basins for a thing! MIRI always talks as if there’s an abstract basin that rules things that gives us instrumental convergence, without reference to a particular training technique! But we control literally all the gradients our training techniques. “Don’t hurl coffee across the kitchen at the human when they ask for it” sits in the same high-dimensional basin as “Don’t kill all humans when they ask for a cure for cancer.”
In contrast, if you give AI high levels of capability in half the capabilities, it will tend to want all the rest of the capabilities too.
ML doesn’t acquire wants over the space of training techniques that are used to give it capabilities, it acquires “wants” from reinforced behaviors within the space of training techniques. These reinforced behaviors can be literally as human-morality-sensitive as we’d like. If we don’t put it in a circumstance where a particular kind coherence is rewarded, it just won’t get that kind of coherence; the ease with which we’ll be able to do this is of course emphasized by how blind most ML systems are.
This comment made the MIRI-style pessimist’s position clearer to me—I think? -- so thank you for it.
I want to try my hand at a kind of disagreement / response, and then at predicting your response to my response, to see how my model of MIRI-style pessimism stands up, if you’re up for it.
Response: You state that reality “bites back” for wrong beliefs but not wrong preferences. This seems like it is only contingently true; reality will “bite back” from whatever loss function whatsoever that I put into my system, with whatever relative weightings I give it. If I want to reward my LLM (or other AI) for doing the right thing in a multitude of examples that constitute 50% of my training set, 50% of my test set, and 50% of two different validation sets, then from the perspective of the LLM (or other AI) reality bites back just as much for learning the wrong preferences just as it does for learning false facts about the world. So we should expect it to learn to act in ways that I like.
Predicted response to response: This will work for shallow, relatively stupid AIs, trained purely in a supervised fashion, like we currently have. BUT once we have LLM / AIs that can do complex things, like predict macroeconomic world states, they’ll have abilities to reason and update their own beliefs in a complex fashion. This will remain uniformly rewarded by reality—but we will no longer have the capacity to give feedback on this higher-level process because (????) so it breaks.
Or response—This will work for shallow, stupid AIs trained like the ones we currently have. But once we have LLMs / AIs that can do compex things, like predict macroeconomic world states, then they’re going to be able to go out of domain in a very high dimensional space of action, from the perspective of our training / test set. And this out-of-domainness is unavoidable because that’s what solving complex problems in the world means—it means problems that aren’t simply contained in the training set. And this means that in some corner of the world, we’re guaranteed to find that they’ve been reinforced to want something that doesn’t accord with our preferences.
Meh, I doubt that’s gonna pass an ITT, but wanted to give it a shot.
Suppose that I’m trying to build a smarter-than-human AI that has a bunch of capabilities (including, e.g., ‘be good at Atari games’), and that has the goal ‘maximize the amount of diamond in the universe’. It’s true that current techniques let you provide greater than zero pressure in the direction of ‘maximize the amount of diamond in the universe’, but there are several important senses in which reality doesn’t ‘bite back’ here:
If the AI acquires an unrelated goal (e.g., calculate as many digits of pi as possible), and acquires the belief ‘I will better achieve my true goal if I maximize the amount of diamond’ (e.g,, because it infers that its programmer wants that, or just because an SGD-ish process nudged it in the direction of having such a belief), then there’s no way in which reality punishes or selects against that AGI (relative to one that actually has the intended goal).
Things that make the AI better at some Atari games, will tend to make it better at other Atari games, but won’t tend to make it care more about maximizing diamonds. More generally, things that make AI more capable tend to go together (especially once you get to higher levels of difficulty, generality, non-brittleness, etc.), whereas none of them go together with “terminally value a universe full of diamond”.
If we succeed in partly instilling the goal into the AI (e.g., it now likes carbon atoms a lot), then this doesn’t provide additional pressure for the AI to internalize the rest of the goal. There’s no attractor basin where if you have half of human values, you’re under more pressure to acquire the other half. In contrast, if you give AI high levels of capability in half the capabilities, it will tend to want all the rest of the capabilities too; and whatever keeps it from succeeding on general reasoning and problem-solving will also tend to keep it from succeeding on the narrow task you’re trying to get it to perform. (More so to the extent the task is hard.)
(There are also separate issues, like ‘we can’t provide a training signal where we thumbs-down the AI destroying the world, because we die in those worlds’.)
Thanks for the response.
I’m still quite unconvinced, which of course you’d predict. Like, regarding 3:
Sure there is—over course of learning anything you get better and better feedback from training as your mistakes get more fine-grained. If you acquire a “don’t lie” principle without acquiring also “but it’s ok to lie to Nazis” then you’ll be punished, for instance. After you learn the more basic things, you’ll be pushed to acquire the less basic ones, so the reinforcement you get becomes more and more detailed. This is just like an RL model learns to stumble forward before it learns to walk cleanly or LLMs learn associations before learning higher-order correlations.
The there is no attractor basin in the world for ML, apart from actual mechanisms by which there are attractor basins for a thing! MIRI always talks as if there’s an abstract basin that rules things that gives us instrumental convergence, without reference to a particular training technique! But we control literally all the gradients our training techniques. “Don’t hurl coffee across the kitchen at the human when they ask for it” sits in the same high-dimensional basin as “Don’t kill all humans when they ask for a cure for cancer.”
ML doesn’t acquire wants over the space of training techniques that are used to give it capabilities, it acquires “wants” from reinforced behaviors within the space of training techniques. These reinforced behaviors can be literally as human-morality-sensitive as we’d like. If we don’t put it in a circumstance where a particular kind coherence is rewarded, it just won’t get that kind of coherence; the ease with which we’ll be able to do this is of course emphasized by how blind most ML systems are.