EY argues that human values are hard to learn. Katja uses human faces as an analogy, pointing out that ML systems learn natural concepts far easier than EY 2009 expected.
The analogy is between A: a function which maps noise to realistic images of human faces and B: a function which maps predicted future world states to utility scores similar to how a human would score them. The lesson is that since ML systems can learn A very well, they can probably also learn B.
Function A (human face generator) does not even use max-likelihood sampling and it isn’t even an optimizer, so your operationalization is just confused. Nor is function B an optimizer itself.
I claim that A and B are in fact very disanalogous objects, and that the claim that A can be learned well does not imply that B can probably be learned well. I am very confused by your claims about the functions A and B not being optimizers, because to me this is true but also irrelevant.
The reason we want a function B that can map world states to utilities is so that we can optimize on that number. We want to select for world states that we think will have high utility using B; otherwise function B is pretty useless. Therefore, this function has to be reliable enough that putting lots of optimization pressure on it does not break it. This is not the same as claiming that the function itself is an optimizer or anything like that. Making something reliable against lots of optimization pressure is a lot harder than making it reliable in the training distribution.
The function A effectively allows you to sample from the distribution of faces. Function A does not have to be robust against adversarial optimization to approximate the distribution. The analogous function in the domain of human values would be a function that lets you sample from some prior distribution of world states, not one that scores utility of states.
More generally, I think the confusion here stems from the fact that a) robustness against optimization is far harder than modelling typical elements of a distribution, and b) distributions over states are fundamentally different objects from utility functions over states.
Nate’s analogy is confused: diffusion models do not generate convincing samples of faces by maximizing for faciness—see how they actually work, and make sure we agree there. This is important because previous systems (such as deepdream) could be described as maximizing for X, such that nate’s critique would be more relevant.
Your comment here about “optimizing for X-ness” indicates you also were adopting the wrong model of how diffusion models operate:
It’s the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you’re trying to write a really good essay, you don’t care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.
That simply isn’t out how diffusion models work. A diffusion model for essays would sample from realistic essays that summarize to some short prompt; so they absolutely do care about high likelihood from the distribution of human essays.
Now that being said I do partially agree that A (face generator function) and B (human utility function) are somewhat different ..
The reason we want a function B that can map world states to utilities is so that we can optimize on that number.
Yes sort of—or at least that is the fairly default view of how a utility function would be used. But that isn’t the only possibility—one could also solve planning using a diffusion model[1], which would make A and B very similar. The face generator diffusion model combines an unconditional generative model of images with an image to text discriminator, the planning diffusion model combines an unconditional generative future world model with a discriminator (the utility function part, although one could also imagine it being more like an image to text model).
Therefore, this function has to be reliable enough that putting lots of optimization pressure on it does not break it. This is not the same as claiming that the function itself is an optimizer or anything like that. Making something reliable against lots of optimization pressure is a lot harder than making it reliable in the training distribution.
Applying more optimization pressure results in better outputs, according to the objective. Optimization pressure doesn’t break the objective function (what would that even mean?) and you have to create fairly contrived scenarios where more optimization power results in worse outcomes.
So i’m assuming you mean distribution shift robustness: we’ll initially train the human utility function component on some samples of possible future worlds, but then as the AI plans farther ahead and time progresses shit gets wierd and the distribution shifts, so that the initial utility function no longer works well.
So let’s apply that to the image diffusion model analogy—it’s equivalent to massively retraining/scaling up the unconditional generative model (which models images or simulates futures), without likewise improving the discriminative model.
The points from Katja’s analogy are:
It’s actually pretty easy and natural to retrain/scale them together, and
It’s also surprisingly easy/effective to scale up and even combine generative models and get better results with the same discriminator
I almost didn’t want to mention this analogy because i’m not sure that planning via diffusion has been tried yet, and it seems like the kind of thing that could work. But it’s also somewhat obvious, so I bet there are probably people trying this now if it hasn’t already been published (haven’t checked).
I object to your characterization that I am claiming that diffusion models work by maximizing faciness, or that I am confused about how diffusion models work. I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I’m confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
(Unimportant nitpicking: This Person Does Not Exist doesn’t actually use a diffusion model, but rather a StyleGAN trained on a face dataset.)
You’re also eliding over the difference between training an unconditional diffusion model on a face dataset and training an unconditional diffusion model over a general image dataset and doing classifier based guidance. I’ve been talking about unconditional models on a face dataset, which does not optimize for faciness, but when you do classifier-based guidance this changes the setup. I don’t think this difference is crucial, and my point can be made with either, so I will talk using your setup instead.
In fact, the setup you describe in the linked comment does in fact put optimization pressure on faciness, regularized by distance from the prior. Note that when I say “optimization pressure” I don’t mean necessarily literally getting the sample that maxes out the objective. In the essay example, this would be like doing RLHF for essay quality with a KL penalty to stay close to the text distribution. You are correct in stating that this regularization helps to stay on the manifold of realistic images and that removing it results in terrible nightmare images, and this applies directly to the essay example as well.
However the core problem with this approach is that the reason the regularization works is that you trade off quality for typicality. In the face case this is mostly fine because faces are pretty typical of the original distribution anyways, but I would make the concrete prediction that if you tried to get faces using classifier-based guidance out of a diffusion model specifically trained on all images except those containing faces, it would be really fiddly or impossible to get good quality faces that aren’t weird and nightmarish out of it. It seems possible that you are talking past me/Nate in that you have in mind that such regularization isn’t a big problem to put on our AGI, mild optimization is a good thing because we don’t want really weird worlds, etc. I believe this is fatally flawed for planning, partly because this means we can’t really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can’t do things that are too different from things that have been done before. I’m not going to argue this out because I think the following is actually a much larger crux and until we agree on it, arguing over my previous claim will be very difficult:
Applying more optimization pressure results in better outputs, according to the objective. Optimization pressure doesn’t break the objective function (what would that even mean?) and you have to create fairly contrived scenarios where more optimization power results in worse outcomes.
When I say “breaks” the objective, I mean reward hacking/reward gaming/goodharting it. I’m surprised that this wasn’t obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren’t actually what we want. I think that this could be a major crux underlying everything else.
(It may also be relevant that at least from my perspective, notwithstanding anything Eliezer may or may not have said, the “learning human values is hard” argument primarily applies to argue why human values won’t be simple/natural in cases where simplicity/naturalness determine what is easier to learn. I have no doubt that a sufficiently powerful AGI could figure out our values if it wanted to, the hard part is making it want to do so. I think Eliezer may be particularly more pessimistic about neural networks’ robustness.)
Nonetheless, let me lay out some (non-exhaustive) concrete reasons why I expect just scaling up the discriminator and its training data to not work.
Obviously, when we optimize really hard on our learned discriminator, we get the out of distribution stuff, as you agree. But let’s just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers. I claim that optimizing any neural network for achieving world states that these labellers find good doesn’t just lead to extremely bad outcomes in unlikely contrived scenarios, but rather happens by default. Even if you think the following can be avoided using diffusion planning / typicality regularization / etc, I still think it is necessary to first agree that this comes up when you don’t do that regularization, and only then discuss whether it still comes up with regularization.
Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn’t. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.
Telling whether a plan leads to a good world state is nontrivial. You don’t have a perfect world model. You can’t tell very reliably whether a proposed plan leads to good outcomes.
First a reply to interpretations of previous words:
I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I’m confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
I hope we agree that a discriminator which is trained only to recognize good essays robustly probably does not contain enough information to generate good essays, for the same reasons that an image discriminator does not contain enough information to generate good images—because the discriminator only learns the boundaries of words/categories over images, not the more complex embedded distribution of realistic images.
Optimizing only for faciness via a discriminator does not work well—that’s the old deepdream approach. Optimizing only for “good essayness” probably does not work well either. These approaches do not actually get high utility.
So when you say ” I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility”, that just seems confused to me, because diffusion models do get high utility, and not via optimizing just for faciness (which results in low utility).
When you earlier said:
In other words, if you’re trying to write a really good essay, you don’t care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.
The obvious interpretation there still seems to be optimizing only for the discriminator objective—and I’m surprised you are surprised I interpreted that otherwise?. Especially when I replied that the only way to actually get a good essay is to sample from the distribution of essays conditioned on goodness—ie the distribution of good essays.
Anyway, here you are making a somewhat different point:
The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
I still think this is not quite right, in that a diffusion model works by combining the ability to write typical essays with a discriminator to condition on good essays, such that both abilities matter, but I see your point is basically “the discriminator or utility function is the hard part for AGI”, and move on to the more cruxy part.
The Crux?
When I say “breaks” the objective, I mean reward hacking/reward gaming/goodharting it. I’m surprised that this wasn’t obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren’t actually what we want.
Ok, so part of the problem here is we may be assuming different models for AGI. I am assuming a more brain-like pure ANN, which uses fully learned planning more like a diffusion planning model (which is closer to what the brain probably uses), rather than the more common older assumed approach of combining a learned world model and utility function with some explicit planning algorithm like MCT or whatever.
So there are several different optimization layers that can be scaled:
The agent optimizing the world (can scale up planning horizon, etc)
Optimizing/training the learned world/action/planning model(s)
Optimizing/training the learned discriminator/utility model
You can scale these independently but only within limits, and it probably doesn’t make much sense to differentially scale them too far.
But let’s just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers.
I really wouldn’t call that the ground truth. The ground truth would be brain-sims (which is part of the rationale for brain-like AGI) combined with complete detailed understanding of the brain and especially its utility/planning system equivalents. That being said I am probably more optimistic about ’the army of perfect human labellers” approach.
Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn’t. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.
Why is the AI generating Potemkin villages? Deceptive alignment? I’m assuming use of proper sandbox sims to prevent deception. But I’m also independently optimistic about simpler more automatable altruistic utility functions like maximization of human empowerment.
Telling whether a plan leads to a good world state is nontrivial. You don’t have a perfect world model. You can’t tell very reliably whether a proposed plan leads to good outcomes.
I don’t see why imperfect planning is more likely to lead to bad rather than good outcomes, all else being equal, and regardless you don’t need anything near a perfect world model to match human intelligence. Furthermore the assumption that the world model isn’t good enough to be useful for utility evaluations contradicts the assumption of superintelligence.
I believe this is fatally flawed for planning, partly because this means we can’t really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can’t do things that are too different from things that have been done before.
The world/action/planning model does need to be retrained on its own rollouts which will cause it to eventually learn to do things that are different and novel. Humans don’t seem to have much difficulty planning out wierd future world states.
Wouldn’t a better analogy be A: noise to faces judged as realistic and B: noise to plans judged to have good consequences?
As for whether B breaks under competitive pressure: does A break under competitive pressure? B does introduce safe exploration concerns not relevant to A, but the answer for A seems like a clear “no” to me.
EY argues that human values are hard to learn. Katja uses human faces as an analogy, pointing out that ML systems learn natural concepts far easier than EY 2009 expected.
The analogy is between A: a function which maps noise to realistic images of human faces and B: a function which maps predicted future world states to utility scores similar to how a human would score them. The lesson is that since ML systems can learn A very well, they can probably also learn B.
Function A (human face generator) does not even use max-likelihood sampling and it isn’t even an optimizer, so your operationalization is just confused. Nor is function B an optimizer itself.
I claim that A and B are in fact very disanalogous objects, and that the claim that A can be learned well does not imply that B can probably be learned well. I am very confused by your claims about the functions A and B not being optimizers, because to me this is true but also irrelevant.
The reason we want a function B that can map world states to utilities is so that we can optimize on that number. We want to select for world states that we think will have high utility using B; otherwise function B is pretty useless. Therefore, this function has to be reliable enough that putting lots of optimization pressure on it does not break it. This is not the same as claiming that the function itself is an optimizer or anything like that. Making something reliable against lots of optimization pressure is a lot harder than making it reliable in the training distribution.
The function A effectively allows you to sample from the distribution of faces. Function A does not have to be robust against adversarial optimization to approximate the distribution. The analogous function in the domain of human values would be a function that lets you sample from some prior distribution of world states, not one that scores utility of states.
More generally, I think the confusion here stems from the fact that a) robustness against optimization is far harder than modelling typical elements of a distribution, and b) distributions over states are fundamentally different objects from utility functions over states.
Nate’s analogy is confused: diffusion models do not generate convincing samples of faces by maximizing for faciness—see how they actually work, and make sure we agree there. This is important because previous systems (such as deepdream) could be described as maximizing for X, such that nate’s critique would be more relevant.
Your comment here about “optimizing for X-ness” indicates you also were adopting the wrong model of how diffusion models operate:
That simply isn’t out how diffusion models work. A diffusion model for essays would sample from realistic essays that summarize to some short prompt; so they absolutely do care about high likelihood from the distribution of human essays.
Now that being said I do partially agree that A (face generator function) and B (human utility function) are somewhat different ..
Yes sort of—or at least that is the fairly default view of how a utility function would be used. But that isn’t the only possibility—one could also solve planning using a diffusion model[1], which would make A and B very similar. The face generator diffusion model combines an unconditional generative model of images with an image to text discriminator, the planning diffusion model combines an unconditional generative future world model with a discriminator (the utility function part, although one could also imagine it being more like an image to text model).
Applying more optimization pressure results in better outputs, according to the objective. Optimization pressure doesn’t break the objective function (what would that even mean?) and you have to create fairly contrived scenarios where more optimization power results in worse outcomes.
So i’m assuming you mean distribution shift robustness: we’ll initially train the human utility function component on some samples of possible future worlds, but then as the AI plans farther ahead and time progresses shit gets wierd and the distribution shifts, so that the initial utility function no longer works well.
So let’s apply that to the image diffusion model analogy—it’s equivalent to massively retraining/scaling up the unconditional generative model (which models images or simulates futures), without likewise improving the discriminative model.
The points from Katja’s analogy are:
It’s actually pretty easy and natural to retrain/scale them together, and
It’s also surprisingly easy/effective to scale up and even combine generative models and get better results with the same discriminator
I almost didn’t want to mention this analogy because i’m not sure that planning via diffusion has been tried yet, and it seems like the kind of thing that could work. But it’s also somewhat obvious, so I bet there are probably people trying this now if it hasn’t already been published (haven’t checked).
I object to your characterization that I am claiming that diffusion models work by maximizing faciness, or that I am confused about how diffusion models work. I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I’m confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
(Unimportant nitpicking: This Person Does Not Exist doesn’t actually use a diffusion model, but rather a StyleGAN trained on a face dataset.)
You’re also eliding over the difference between training an unconditional diffusion model on a face dataset and training an unconditional diffusion model over a general image dataset and doing classifier based guidance. I’ve been talking about unconditional models on a face dataset, which does not optimize for faciness, but when you do classifier-based guidance this changes the setup. I don’t think this difference is crucial, and my point can be made with either, so I will talk using your setup instead.
In fact, the setup you describe in the linked comment does in fact put optimization pressure on faciness, regularized by distance from the prior. Note that when I say “optimization pressure” I don’t mean necessarily literally getting the sample that maxes out the objective. In the essay example, this would be like doing RLHF for essay quality with a KL penalty to stay close to the text distribution. You are correct in stating that this regularization helps to stay on the manifold of realistic images and that removing it results in terrible nightmare images, and this applies directly to the essay example as well.
However the core problem with this approach is that the reason the regularization works is that you trade off quality for typicality. In the face case this is mostly fine because faces are pretty typical of the original distribution anyways, but I would make the concrete prediction that if you tried to get faces using classifier-based guidance out of a diffusion model specifically trained on all images except those containing faces, it would be really fiddly or impossible to get good quality faces that aren’t weird and nightmarish out of it. It seems possible that you are talking past me/Nate in that you have in mind that such regularization isn’t a big problem to put on our AGI, mild optimization is a good thing because we don’t want really weird worlds, etc. I believe this is fatally flawed for planning, partly because this means we can’t really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can’t do things that are too different from things that have been done before. I’m not going to argue this out because I think the following is actually a much larger crux and until we agree on it, arguing over my previous claim will be very difficult:
When I say “breaks” the objective, I mean reward hacking/reward gaming/goodharting it. I’m surprised that this wasn’t obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren’t actually what we want. I think that this could be a major crux underlying everything else.
(It may also be relevant that at least from my perspective, notwithstanding anything Eliezer may or may not have said, the “learning human values is hard” argument primarily applies to argue why human values won’t be simple/natural in cases where simplicity/naturalness determine what is easier to learn. I have no doubt that a sufficiently powerful AGI could figure out our values if it wanted to, the hard part is making it want to do so. I think Eliezer may be particularly more pessimistic about neural networks’ robustness.)
Nonetheless, let me lay out some (non-exhaustive) concrete reasons why I expect just scaling up the discriminator and its training data to not work.
Obviously, when we optimize really hard on our learned discriminator, we get the out of distribution stuff, as you agree. But let’s just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers. I claim that optimizing any neural network for achieving world states that these labellers find good doesn’t just lead to extremely bad outcomes in unlikely contrived scenarios, but rather happens by default. Even if you think the following can be avoided using diffusion planning / typicality regularization / etc, I still think it is necessary to first agree that this comes up when you don’t do that regularization, and only then discuss whether it still comes up with regularization.
Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn’t. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.
Telling whether a plan leads to a good world state is nontrivial. You don’t have a perfect world model. You can’t tell very reliably whether a proposed plan leads to good outcomes.
Interpretations
First a reply to interpretations of previous words:
I hope we agree that a discriminator which is trained only to recognize good essays robustly probably does not contain enough information to generate good essays, for the same reasons that an image discriminator does not contain enough information to generate good images—because the discriminator only learns the boundaries of words/categories over images, not the more complex embedded distribution of realistic images.
Optimizing only for faciness via a discriminator does not work well—that’s the old deepdream approach. Optimizing only for “good essayness” probably does not work well either. These approaches do not actually get high utility.
So when you say ” I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility”, that just seems confused to me, because diffusion models do get high utility, and not via optimizing just for faciness (which results in low utility).
When you earlier said:
The obvious interpretation there still seems to be optimizing only for the discriminator objective—and I’m surprised you are surprised I interpreted that otherwise?. Especially when I replied that the only way to actually get a good essay is to sample from the distribution of essays conditioned on goodness—ie the distribution of good essays.
Anyway, here you are making a somewhat different point:
I still think this is not quite right, in that a diffusion model works by combining the ability to write typical essays with a discriminator to condition on good essays, such that both abilities matter, but I see your point is basically “the discriminator or utility function is the hard part for AGI”, and move on to the more cruxy part.
The Crux?
Ok, so part of the problem here is we may be assuming different models for AGI. I am assuming a more brain-like pure ANN, which uses fully learned planning more like a diffusion planning model (which is closer to what the brain probably uses), rather than the more common older assumed approach of combining a learned world model and utility function with some explicit planning algorithm like MCT or whatever.
So there are several different optimization layers that can be scaled:
The agent optimizing the world (can scale up planning horizon, etc)
Optimizing/training the learned world/action/planning model(s)
Optimizing/training the learned discriminator/utility model
You can scale these independently but only within limits, and it probably doesn’t make much sense to differentially scale them too far.
I really wouldn’t call that the ground truth. The ground truth would be brain-sims (which is part of the rationale for brain-like AGI) combined with complete detailed understanding of the brain and especially its utility/planning system equivalents. That being said I am probably more optimistic about ’the army of perfect human labellers” approach.
Why is the AI generating Potemkin villages? Deceptive alignment? I’m assuming use of proper sandbox sims to prevent deception. But I’m also independently optimistic about simpler more automatable altruistic utility functions like maximization of human empowerment.
I don’t see why imperfect planning is more likely to lead to bad rather than good outcomes, all else being equal, and regardless you don’t need anything near a perfect world model to match human intelligence. Furthermore the assumption that the world model isn’t good enough to be useful for utility evaluations contradicts the assumption of superintelligence.
The world/action/planning model does need to be retrained on its own rollouts which will cause it to eventually learn to do things that are different and novel. Humans don’t seem to have much difficulty planning out wierd future world states.
The claim that every increase in regularisation makes performance worse is extraordinary, given everything I know about machine learning.
FYI: Planning with diffusion is being tried and seemingly works.
Wouldn’t a better analogy be A: noise to faces judged as realistic and B: noise to plans judged to have good consequences?
As for whether B breaks under competitive pressure: does A break under competitive pressure? B does introduce safe exploration concerns not relevant to A, but the answer for A seems like a clear “no” to me.