I object to your characterization that I am claiming that diffusion models work by maximizing faciness, or that I am confused about how diffusion models work. I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I’m confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
(Unimportant nitpicking: This Person Does Not Exist doesn’t actually use a diffusion model, but rather a StyleGAN trained on a face dataset.)
You’re also eliding over the difference between training an unconditional diffusion model on a face dataset and training an unconditional diffusion model over a general image dataset and doing classifier based guidance. I’ve been talking about unconditional models on a face dataset, which does not optimize for faciness, but when you do classifier-based guidance this changes the setup. I don’t think this difference is crucial, and my point can be made with either, so I will talk using your setup instead.
In fact, the setup you describe in the linked comment does in fact put optimization pressure on faciness, regularized by distance from the prior. Note that when I say “optimization pressure” I don’t mean necessarily literally getting the sample that maxes out the objective. In the essay example, this would be like doing RLHF for essay quality with a KL penalty to stay close to the text distribution. You are correct in stating that this regularization helps to stay on the manifold of realistic images and that removing it results in terrible nightmare images, and this applies directly to the essay example as well.
However the core problem with this approach is that the reason the regularization works is that you trade off quality for typicality. In the face case this is mostly fine because faces are pretty typical of the original distribution anyways, but I would make the concrete prediction that if you tried to get faces using classifier-based guidance out of a diffusion model specifically trained on all images except those containing faces, it would be really fiddly or impossible to get good quality faces that aren’t weird and nightmarish out of it. It seems possible that you are talking past me/Nate in that you have in mind that such regularization isn’t a big problem to put on our AGI, mild optimization is a good thing because we don’t want really weird worlds, etc. I believe this is fatally flawed for planning, partly because this means we can’t really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can’t do things that are too different from things that have been done before. I’m not going to argue this out because I think the following is actually a much larger crux and until we agree on it, arguing over my previous claim will be very difficult:
Applying more optimization pressure results in better outputs, according to the objective. Optimization pressure doesn’t break the objective function (what would that even mean?) and you have to create fairly contrived scenarios where more optimization power results in worse outcomes.
When I say “breaks” the objective, I mean reward hacking/reward gaming/goodharting it. I’m surprised that this wasn’t obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren’t actually what we want. I think that this could be a major crux underlying everything else.
(It may also be relevant that at least from my perspective, notwithstanding anything Eliezer may or may not have said, the “learning human values is hard” argument primarily applies to argue why human values won’t be simple/natural in cases where simplicity/naturalness determine what is easier to learn. I have no doubt that a sufficiently powerful AGI could figure out our values if it wanted to, the hard part is making it want to do so. I think Eliezer may be particularly more pessimistic about neural networks’ robustness.)
Nonetheless, let me lay out some (non-exhaustive) concrete reasons why I expect just scaling up the discriminator and its training data to not work.
Obviously, when we optimize really hard on our learned discriminator, we get the out of distribution stuff, as you agree. But let’s just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers. I claim that optimizing any neural network for achieving world states that these labellers find good doesn’t just lead to extremely bad outcomes in unlikely contrived scenarios, but rather happens by default. Even if you think the following can be avoided using diffusion planning / typicality regularization / etc, I still think it is necessary to first agree that this comes up when you don’t do that regularization, and only then discuss whether it still comes up with regularization.
Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn’t. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.
Telling whether a plan leads to a good world state is nontrivial. You don’t have a perfect world model. You can’t tell very reliably whether a proposed plan leads to good outcomes.
First a reply to interpretations of previous words:
I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I’m confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
I hope we agree that a discriminator which is trained only to recognize good essays robustly probably does not contain enough information to generate good essays, for the same reasons that an image discriminator does not contain enough information to generate good images—because the discriminator only learns the boundaries of words/categories over images, not the more complex embedded distribution of realistic images.
Optimizing only for faciness via a discriminator does not work well—that’s the old deepdream approach. Optimizing only for “good essayness” probably does not work well either. These approaches do not actually get high utility.
So when you say ” I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility”, that just seems confused to me, because diffusion models do get high utility, and not via optimizing just for faciness (which results in low utility).
When you earlier said:
In other words, if you’re trying to write a really good essay, you don’t care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.
The obvious interpretation there still seems to be optimizing only for the discriminator objective—and I’m surprised you are surprised I interpreted that otherwise?. Especially when I replied that the only way to actually get a good essay is to sample from the distribution of essays conditioned on goodness—ie the distribution of good essays.
Anyway, here you are making a somewhat different point:
The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
I still think this is not quite right, in that a diffusion model works by combining the ability to write typical essays with a discriminator to condition on good essays, such that both abilities matter, but I see your point is basically “the discriminator or utility function is the hard part for AGI”, and move on to the more cruxy part.
The Crux?
When I say “breaks” the objective, I mean reward hacking/reward gaming/goodharting it. I’m surprised that this wasn’t obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren’t actually what we want.
Ok, so part of the problem here is we may be assuming different models for AGI. I am assuming a more brain-like pure ANN, which uses fully learned planning more like a diffusion planning model (which is closer to what the brain probably uses), rather than the more common older assumed approach of combining a learned world model and utility function with some explicit planning algorithm like MCT or whatever.
So there are several different optimization layers that can be scaled:
The agent optimizing the world (can scale up planning horizon, etc)
Optimizing/training the learned world/action/planning model(s)
Optimizing/training the learned discriminator/utility model
You can scale these independently but only within limits, and it probably doesn’t make much sense to differentially scale them too far.
But let’s just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers.
I really wouldn’t call that the ground truth. The ground truth would be brain-sims (which is part of the rationale for brain-like AGI) combined with complete detailed understanding of the brain and especially its utility/planning system equivalents. That being said I am probably more optimistic about ’the army of perfect human labellers” approach.
Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn’t. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.
Why is the AI generating Potemkin villages? Deceptive alignment? I’m assuming use of proper sandbox sims to prevent deception. But I’m also independently optimistic about simpler more automatable altruistic utility functions like maximization of human empowerment.
Telling whether a plan leads to a good world state is nontrivial. You don’t have a perfect world model. You can’t tell very reliably whether a proposed plan leads to good outcomes.
I don’t see why imperfect planning is more likely to lead to bad rather than good outcomes, all else being equal, and regardless you don’t need anything near a perfect world model to match human intelligence. Furthermore the assumption that the world model isn’t good enough to be useful for utility evaluations contradicts the assumption of superintelligence.
I believe this is fatally flawed for planning, partly because this means we can’t really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can’t do things that are too different from things that have been done before.
The world/action/planning model does need to be retrained on its own rollouts which will cause it to eventually learn to do things that are different and novel. Humans don’t seem to have much difficulty planning out wierd future world states.
I object to your characterization that I am claiming that diffusion models work by maximizing faciness, or that I am confused about how diffusion models work. I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I’m confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
(Unimportant nitpicking: This Person Does Not Exist doesn’t actually use a diffusion model, but rather a StyleGAN trained on a face dataset.)
You’re also eliding over the difference between training an unconditional diffusion model on a face dataset and training an unconditional diffusion model over a general image dataset and doing classifier based guidance. I’ve been talking about unconditional models on a face dataset, which does not optimize for faciness, but when you do classifier-based guidance this changes the setup. I don’t think this difference is crucial, and my point can be made with either, so I will talk using your setup instead.
In fact, the setup you describe in the linked comment does in fact put optimization pressure on faciness, regularized by distance from the prior. Note that when I say “optimization pressure” I don’t mean necessarily literally getting the sample that maxes out the objective. In the essay example, this would be like doing RLHF for essay quality with a KL penalty to stay close to the text distribution. You are correct in stating that this regularization helps to stay on the manifold of realistic images and that removing it results in terrible nightmare images, and this applies directly to the essay example as well.
However the core problem with this approach is that the reason the regularization works is that you trade off quality for typicality. In the face case this is mostly fine because faces are pretty typical of the original distribution anyways, but I would make the concrete prediction that if you tried to get faces using classifier-based guidance out of a diffusion model specifically trained on all images except those containing faces, it would be really fiddly or impossible to get good quality faces that aren’t weird and nightmarish out of it. It seems possible that you are talking past me/Nate in that you have in mind that such regularization isn’t a big problem to put on our AGI, mild optimization is a good thing because we don’t want really weird worlds, etc. I believe this is fatally flawed for planning, partly because this means we can’t really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can’t do things that are too different from things that have been done before. I’m not going to argue this out because I think the following is actually a much larger crux and until we agree on it, arguing over my previous claim will be very difficult:
When I say “breaks” the objective, I mean reward hacking/reward gaming/goodharting it. I’m surprised that this wasn’t obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren’t actually what we want. I think that this could be a major crux underlying everything else.
(It may also be relevant that at least from my perspective, notwithstanding anything Eliezer may or may not have said, the “learning human values is hard” argument primarily applies to argue why human values won’t be simple/natural in cases where simplicity/naturalness determine what is easier to learn. I have no doubt that a sufficiently powerful AGI could figure out our values if it wanted to, the hard part is making it want to do so. I think Eliezer may be particularly more pessimistic about neural networks’ robustness.)
Nonetheless, let me lay out some (non-exhaustive) concrete reasons why I expect just scaling up the discriminator and its training data to not work.
Obviously, when we optimize really hard on our learned discriminator, we get the out of distribution stuff, as you agree. But let’s just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers. I claim that optimizing any neural network for achieving world states that these labellers find good doesn’t just lead to extremely bad outcomes in unlikely contrived scenarios, but rather happens by default. Even if you think the following can be avoided using diffusion planning / typicality regularization / etc, I still think it is necessary to first agree that this comes up when you don’t do that regularization, and only then discuss whether it still comes up with regularization.
Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn’t. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.
Telling whether a plan leads to a good world state is nontrivial. You don’t have a perfect world model. You can’t tell very reliably whether a proposed plan leads to good outcomes.
Interpretations
First a reply to interpretations of previous words:
I hope we agree that a discriminator which is trained only to recognize good essays robustly probably does not contain enough information to generate good essays, for the same reasons that an image discriminator does not contain enough information to generate good images—because the discriminator only learns the boundaries of words/categories over images, not the more complex embedded distribution of realistic images.
Optimizing only for faciness via a discriminator does not work well—that’s the old deepdream approach. Optimizing only for “good essayness” probably does not work well either. These approaches do not actually get high utility.
So when you say ” I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility”, that just seems confused to me, because diffusion models do get high utility, and not via optimizing just for faciness (which results in low utility).
When you earlier said:
The obvious interpretation there still seems to be optimizing only for the discriminator objective—and I’m surprised you are surprised I interpreted that otherwise?. Especially when I replied that the only way to actually get a good essay is to sample from the distribution of essays conditioned on goodness—ie the distribution of good essays.
Anyway, here you are making a somewhat different point:
I still think this is not quite right, in that a diffusion model works by combining the ability to write typical essays with a discriminator to condition on good essays, such that both abilities matter, but I see your point is basically “the discriminator or utility function is the hard part for AGI”, and move on to the more cruxy part.
The Crux?
Ok, so part of the problem here is we may be assuming different models for AGI. I am assuming a more brain-like pure ANN, which uses fully learned planning more like a diffusion planning model (which is closer to what the brain probably uses), rather than the more common older assumed approach of combining a learned world model and utility function with some explicit planning algorithm like MCT or whatever.
So there are several different optimization layers that can be scaled:
The agent optimizing the world (can scale up planning horizon, etc)
Optimizing/training the learned world/action/planning model(s)
Optimizing/training the learned discriminator/utility model
You can scale these independently but only within limits, and it probably doesn’t make much sense to differentially scale them too far.
I really wouldn’t call that the ground truth. The ground truth would be brain-sims (which is part of the rationale for brain-like AGI) combined with complete detailed understanding of the brain and especially its utility/planning system equivalents. That being said I am probably more optimistic about ’the army of perfect human labellers” approach.
Why is the AI generating Potemkin villages? Deceptive alignment? I’m assuming use of proper sandbox sims to prevent deception. But I’m also independently optimistic about simpler more automatable altruistic utility functions like maximization of human empowerment.
I don’t see why imperfect planning is more likely to lead to bad rather than good outcomes, all else being equal, and regardless you don’t need anything near a perfect world model to match human intelligence. Furthermore the assumption that the world model isn’t good enough to be useful for utility evaluations contradicts the assumption of superintelligence.
The world/action/planning model does need to be retrained on its own rollouts which will cause it to eventually learn to do things that are different and novel. Humans don’t seem to have much difficulty planning out wierd future world states.
The claim that every increase in regularisation makes performance worse is extraordinary, given everything I know about machine learning.