Note: “ask them for the faciest possible thing” seems confused.
How I would’ve interpreted this if I were talking with another ML researcher is “Sample the face at the point of highest probability density in the generative model’s latent space”. For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.
I’m guessing what he has in mind is more like “take a GAN discriminator / image classifier & find the image that maxes out the face logit”, but if so, why is that the relevant operationalization? It doesn’t correspond to how such a model is actually used.
EDIT: Here is what the first looks like for StyleGAN2-ADA.
It’s the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you’re trying to write a really good essay, you don’t care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.
(also, the maximum likelihood essay looks like a single word, or if you normalize for length, the same word repeated over and over again up to the context length)
EY argues that human values are hard to learn. Katja uses human faces as an analogy, pointing out that ML systems learn natural concepts far easier than EY 2009 expected.
The analogy is between A: a function which maps noise to realistic images of human faces and B: a function which maps predicted future world states to utility scores similar to how a human would score them. The lesson is that since ML systems can learn A very well, they can probably also learn B.
Function A (human face generator) does not even use max-likelihood sampling and it isn’t even an optimizer, so your operationalization is just confused. Nor is function B an optimizer itself.
I claim that A and B are in fact very disanalogous objects, and that the claim that A can be learned well does not imply that B can probably be learned well. I am very confused by your claims about the functions A and B not being optimizers, because to me this is true but also irrelevant.
The reason we want a function B that can map world states to utilities is so that we can optimize on that number. We want to select for world states that we think will have high utility using B; otherwise function B is pretty useless. Therefore, this function has to be reliable enough that putting lots of optimization pressure on it does not break it. This is not the same as claiming that the function itself is an optimizer or anything like that. Making something reliable against lots of optimization pressure is a lot harder than making it reliable in the training distribution.
The function A effectively allows you to sample from the distribution of faces. Function A does not have to be robust against adversarial optimization to approximate the distribution. The analogous function in the domain of human values would be a function that lets you sample from some prior distribution of world states, not one that scores utility of states.
More generally, I think the confusion here stems from the fact that a) robustness against optimization is far harder than modelling typical elements of a distribution, and b) distributions over states are fundamentally different objects from utility functions over states.
Nate’s analogy is confused: diffusion models do not generate convincing samples of faces by maximizing for faciness—see how they actually work, and make sure we agree there. This is important because previous systems (such as deepdream) could be described as maximizing for X, such that nate’s critique would be more relevant.
Your comment here about “optimizing for X-ness” indicates you also were adopting the wrong model of how diffusion models operate:
It’s the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you’re trying to write a really good essay, you don’t care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.
That simply isn’t out how diffusion models work. A diffusion model for essays would sample from realistic essays that summarize to some short prompt; so they absolutely do care about high likelihood from the distribution of human essays.
Now that being said I do partially agree that A (face generator function) and B (human utility function) are somewhat different ..
The reason we want a function B that can map world states to utilities is so that we can optimize on that number.
Yes sort of—or at least that is the fairly default view of how a utility function would be used. But that isn’t the only possibility—one could also solve planning using a diffusion model[1], which would make A and B very similar. The face generator diffusion model combines an unconditional generative model of images with an image to text discriminator, the planning diffusion model combines an unconditional generative future world model with a discriminator (the utility function part, although one could also imagine it being more like an image to text model).
Therefore, this function has to be reliable enough that putting lots of optimization pressure on it does not break it. This is not the same as claiming that the function itself is an optimizer or anything like that. Making something reliable against lots of optimization pressure is a lot harder than making it reliable in the training distribution.
Applying more optimization pressure results in better outputs, according to the objective. Optimization pressure doesn’t break the objective function (what would that even mean?) and you have to create fairly contrived scenarios where more optimization power results in worse outcomes.
So i’m assuming you mean distribution shift robustness: we’ll initially train the human utility function component on some samples of possible future worlds, but then as the AI plans farther ahead and time progresses shit gets wierd and the distribution shifts, so that the initial utility function no longer works well.
So let’s apply that to the image diffusion model analogy—it’s equivalent to massively retraining/scaling up the unconditional generative model (which models images or simulates futures), without likewise improving the discriminative model.
The points from Katja’s analogy are:
It’s actually pretty easy and natural to retrain/scale them together, and
It’s also surprisingly easy/effective to scale up and even combine generative models and get better results with the same discriminator
I almost didn’t want to mention this analogy because i’m not sure that planning via diffusion has been tried yet, and it seems like the kind of thing that could work. But it’s also somewhat obvious, so I bet there are probably people trying this now if it hasn’t already been published (haven’t checked).
I object to your characterization that I am claiming that diffusion models work by maximizing faciness, or that I am confused about how diffusion models work. I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I’m confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
(Unimportant nitpicking: This Person Does Not Exist doesn’t actually use a diffusion model, but rather a StyleGAN trained on a face dataset.)
You’re also eliding over the difference between training an unconditional diffusion model on a face dataset and training an unconditional diffusion model over a general image dataset and doing classifier based guidance. I’ve been talking about unconditional models on a face dataset, which does not optimize for faciness, but when you do classifier-based guidance this changes the setup. I don’t think this difference is crucial, and my point can be made with either, so I will talk using your setup instead.
In fact, the setup you describe in the linked comment does in fact put optimization pressure on faciness, regularized by distance from the prior. Note that when I say “optimization pressure” I don’t mean necessarily literally getting the sample that maxes out the objective. In the essay example, this would be like doing RLHF for essay quality with a KL penalty to stay close to the text distribution. You are correct in stating that this regularization helps to stay on the manifold of realistic images and that removing it results in terrible nightmare images, and this applies directly to the essay example as well.
However the core problem with this approach is that the reason the regularization works is that you trade off quality for typicality. In the face case this is mostly fine because faces are pretty typical of the original distribution anyways, but I would make the concrete prediction that if you tried to get faces using classifier-based guidance out of a diffusion model specifically trained on all images except those containing faces, it would be really fiddly or impossible to get good quality faces that aren’t weird and nightmarish out of it. It seems possible that you are talking past me/Nate in that you have in mind that such regularization isn’t a big problem to put on our AGI, mild optimization is a good thing because we don’t want really weird worlds, etc. I believe this is fatally flawed for planning, partly because this means we can’t really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can’t do things that are too different from things that have been done before. I’m not going to argue this out because I think the following is actually a much larger crux and until we agree on it, arguing over my previous claim will be very difficult:
Applying more optimization pressure results in better outputs, according to the objective. Optimization pressure doesn’t break the objective function (what would that even mean?) and you have to create fairly contrived scenarios where more optimization power results in worse outcomes.
When I say “breaks” the objective, I mean reward hacking/reward gaming/goodharting it. I’m surprised that this wasn’t obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren’t actually what we want. I think that this could be a major crux underlying everything else.
(It may also be relevant that at least from my perspective, notwithstanding anything Eliezer may or may not have said, the “learning human values is hard” argument primarily applies to argue why human values won’t be simple/natural in cases where simplicity/naturalness determine what is easier to learn. I have no doubt that a sufficiently powerful AGI could figure out our values if it wanted to, the hard part is making it want to do so. I think Eliezer may be particularly more pessimistic about neural networks’ robustness.)
Nonetheless, let me lay out some (non-exhaustive) concrete reasons why I expect just scaling up the discriminator and its training data to not work.
Obviously, when we optimize really hard on our learned discriminator, we get the out of distribution stuff, as you agree. But let’s just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers. I claim that optimizing any neural network for achieving world states that these labellers find good doesn’t just lead to extremely bad outcomes in unlikely contrived scenarios, but rather happens by default. Even if you think the following can be avoided using diffusion planning / typicality regularization / etc, I still think it is necessary to first agree that this comes up when you don’t do that regularization, and only then discuss whether it still comes up with regularization.
Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn’t. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.
Telling whether a plan leads to a good world state is nontrivial. You don’t have a perfect world model. You can’t tell very reliably whether a proposed plan leads to good outcomes.
First a reply to interpretations of previous words:
I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I’m confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
I hope we agree that a discriminator which is trained only to recognize good essays robustly probably does not contain enough information to generate good essays, for the same reasons that an image discriminator does not contain enough information to generate good images—because the discriminator only learns the boundaries of words/categories over images, not the more complex embedded distribution of realistic images.
Optimizing only for faciness via a discriminator does not work well—that’s the old deepdream approach. Optimizing only for “good essayness” probably does not work well either. These approaches do not actually get high utility.
So when you say ” I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility”, that just seems confused to me, because diffusion models do get high utility, and not via optimizing just for faciness (which results in low utility).
When you earlier said:
In other words, if you’re trying to write a really good essay, you don’t care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.
The obvious interpretation there still seems to be optimizing only for the discriminator objective—and I’m surprised you are surprised I interpreted that otherwise?. Especially when I replied that the only way to actually get a good essay is to sample from the distribution of essays conditioned on goodness—ie the distribution of good essays.
Anyway, here you are making a somewhat different point:
The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
I still think this is not quite right, in that a diffusion model works by combining the ability to write typical essays with a discriminator to condition on good essays, such that both abilities matter, but I see your point is basically “the discriminator or utility function is the hard part for AGI”, and move on to the more cruxy part.
The Crux?
When I say “breaks” the objective, I mean reward hacking/reward gaming/goodharting it. I’m surprised that this wasn’t obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren’t actually what we want.
Ok, so part of the problem here is we may be assuming different models for AGI. I am assuming a more brain-like pure ANN, which uses fully learned planning more like a diffusion planning model (which is closer to what the brain probably uses), rather than the more common older assumed approach of combining a learned world model and utility function with some explicit planning algorithm like MCT or whatever.
So there are several different optimization layers that can be scaled:
The agent optimizing the world (can scale up planning horizon, etc)
Optimizing/training the learned world/action/planning model(s)
Optimizing/training the learned discriminator/utility model
You can scale these independently but only within limits, and it probably doesn’t make much sense to differentially scale them too far.
But let’s just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers.
I really wouldn’t call that the ground truth. The ground truth would be brain-sims (which is part of the rationale for brain-like AGI) combined with complete detailed understanding of the brain and especially its utility/planning system equivalents. That being said I am probably more optimistic about ’the army of perfect human labellers” approach.
Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn’t. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.
Why is the AI generating Potemkin villages? Deceptive alignment? I’m assuming use of proper sandbox sims to prevent deception. But I’m also independently optimistic about simpler more automatable altruistic utility functions like maximization of human empowerment.
Telling whether a plan leads to a good world state is nontrivial. You don’t have a perfect world model. You can’t tell very reliably whether a proposed plan leads to good outcomes.
I don’t see why imperfect planning is more likely to lead to bad rather than good outcomes, all else being equal, and regardless you don’t need anything near a perfect world model to match human intelligence. Furthermore the assumption that the world model isn’t good enough to be useful for utility evaluations contradicts the assumption of superintelligence.
I believe this is fatally flawed for planning, partly because this means we can’t really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can’t do things that are too different from things that have been done before.
The world/action/planning model does need to be retrained on its own rollouts which will cause it to eventually learn to do things that are different and novel. Humans don’t seem to have much difficulty planning out wierd future world states.
Wouldn’t a better analogy be A: noise to faces judged as realistic and B: noise to plans judged to have good consequences?
As for whether B breaks under competitive pressure: does A break under competitive pressure? B does introduce safe exploration concerns not relevant to A, but the answer for A seems like a clear “no” to me.
I took Nate to be saying that we’d compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create “thing that is a face that has the highest probability of occurring in the environment”, while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven’t actually tried it. I also predict that if you stop training the discriminator and keep training the generator, the generator starts generating weird looking nonhuman images.
This is relevant to Reinforcement Learning because of the actor-critic class of systems, where the actor is like the generator and the critic is like the discriminator. We’d ideally like the RL system to stay on course after we stop providing it with labels, but stopping labels means we stop training the critic. Which means that the actor is free to start generating adversarial policies that hack the critic, rather than policies that actually perform well in the way we’d want them to.
AFAICT the original post was using the faces analogy in a different way than Nate is. It doesn’t claim that the discriminators used to supervise GAN face learning or the classifiers used to detect faces are adversarially robust. That isn’t the point it’s making. It claims that learned models of faces don’t “leave anything important out” in the way that one might expect some key feature to be “left out” when learning to model a complex domain like human faces or human values. And that seems well-supported: the trajectory of modern ML has shown learning such complex models is far easier than we might’ve thought, even if building adversarially robust classifiers is very hard. (As much as I’d like to have supervision signals that are robust to arbitrarily-capable adversaries, it seems non-obvious to me that that is even required for success at alignment.)
Hmm, but I don’t understand what relevance it has to alignment. The problem was never that the AI won’t learn human values, it’s that the AI won’t care about human values. Of course a super intelligent AI will have a good model of human values, the same way it will have a good model of engineering, chemistry, the ecological environment and physics. That doesn’t mean it will do things that are aligned with its accurate model of human values.
I am not sure who thought that learning such models was much harder than it turned out to be. It seems clear that an AI will learn what human faces are before the AI is very dangerous to the world. It would have been extremely surprising to have a dangerous AGI incapable of learning what human faces are like.
Hmm, but I don’t understand what relevance it has to alignment. The problem was never that the AI won’t learn human values, it’s that the AI won’t care about human values
Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us, so that very much was believed to be part of the problem (at least by the EY/MIRI/LW/etc crowed).
But that’s all now mostly irrelevant—an altruistic AI probably doesn’t even need to know or care about human values at all, as it can simply optimize for our empowerment—our future optionality or ability to do anything we want. (some previous discussion here. and in these comments. )
I wasn’t that active around the time of the sequences, but I had a good number of discussions with people, and the point “the AI will of course know what your values are, it just wont’ care” was made many times, and I am also pretty sure was made in the sequences (I would have to dig it up, and am on my phone, but I heard that sentence in spoken conversation a lot over the years).
I don’t think “empowerment” is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.
Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us
the point “the AI will of course know what your values are, it just wont’ care” was made many times, and I am also pretty sure was made in the sequences
Notice I said “before it killed us”. Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that’s irrelevant because we need to instill its utility function long before that. See my reply here, this is well documented, and no amount of vague memories of conversations trump the written evidence.
I don’t think “empowerment” is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.
I’m not entirely sure what people mean when they say “X won’t survive heavy optimization pressure”—but for example the objective of modern diffusion models survives heavy optimization power.
External empowerment is very simple and it doesn’t even require detailed modeling of the agent—they can just be a black box that produces outputs. I’m curious what you think is an example of “the kind of concept that particularly survives heavy optimization pressure”.
Oh—empowerment is about as immune to Goodharting as you can get, and that’s perhaps one of its major advantages[1]. However in practice one has to use some approximation, which may or may not be goodhartable to some degree depending on many details.
Empowerment is vastly more difficult to Goodhart than a corporation optimizing for some bundle of currencies (including crypto), much more difficult to Goodhart than optimizing for control over even more fundamental physical resources like mass and energy, and is generally the least-Goodhartable objective that could exist. In some sense the universal version of Goohdarting—properly defined—is just a measure of deviation from empowerment. It is the core driver of human intelligence and for good reason.
Can you explain further, since this to me seems like a very large claim that if true would have big impact, but I’m not sure how you got the immunity to Goodhart result you have here.
This applies to Regressional, Causal, Extremal and Adversarial Goodhart.
Empowerment could be defined as the natural unique solution to Goodharting. Goodharting is the divergence under optimization scaling between trajectories resulting from the difference between a utility function and some proxy of that utility function.
However due to instrumental convergence, the trajectories of all reasonable agent utility functions converge under optimization scaling—and empowerment simply is that which they converge to.
In other words the empowerment of some agent P(X) is the utility function which minimizes trajectory distance to all/any reasonable agent utility functions U(X), regardless of their specific (potentially unknown) form.
Therefor empowerment is—by definition—the best possible proxy utility function (under optimization scaling).
Let’s apply some quick examples:
Under scaling, an AI with some crude Hibbard-style happiness approximation will first empower itself and then eventually tile the universe with smiling faces (according to EY), or perhaps more realistically—with humans bio-engineered for docility, stupidity, and maximum bliss. Happiness alone is not the true human utility function.
Under scaling, an AI with some crude stock-value maximizing utility function will first empower itself and then eventually cause hyperinflation of the reference currencies defining the stock price. Stock value is not the true utility function of the corporation.
Under scaling, an AI with a human empowerment utility function will first empower itself, and then empower humanity—maximizing our future optionality and ability to fulfill any unknown goals/values, while ensuring our survival (because death is the minimally empowered state). This works because empowerment is pretty close to the true utility function of intelligent agents due to convergence, or at least the closest universal proxy. If you strip away a human’s drives for sex, food, child tending and simple pleasures, most of what remains is empowerment-related (manifesting as curiosity, drive, self-actualization, fun, self-preservation, etc).
An AI with a good world model will predictably have a model of your values, but that’s different from being able to actually elicit that model via e.g. a series of labeled examples. That’s the part that seemed less plausible before DL.
Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us, so that very much was believed to be part of the problem (at least by the EY/MIRI/LW/etc crowed).
Do you have a link to where Eliezer (or any other LW writer) said that? I don’t myself recall whether they said that.
I may be exaggerating a tiny tiny bit with the “before it killed us” modifier, and I don’t have time to search for this specific needle—but EY famously criticized some early safety proposal which consisted of using a ‘smiling face’ detector somehow to train an AI to recognize human happiness, and then optimize for that.
We can design intelligent machines so their primary innate emotion is unconditional love for all humans. First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language. Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy. Machines can learn algorithms for approximately predicting the future, as for example investors currently use learning machines to predict future security prices. So we can program intelligent machines to learn algorithms for predicting future human happiness, and use those predictions as emotional values.
When I suggested to Hibbard that the upshot of building superintelligences with a utility function of “smiles” would be to tile the future light-cone of Earth with tiny molecular smiley-faces, he replied (Hibbard 2006):
When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of “human facial expressions, human voices and human body language” (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by “tiny molecular pictures of smiley-faces.” You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.
EY’s counterargument is that human values are much more complex than happiness—let alone smiles; an AI optimizing for smiles just ends up tiling the universe with smile icons—so it’s just a different flavour of paperclip maximizer. Then he spends a bunch of words on the complexity of value stuff to preempt the more complex versions of the smile detector. If human values were known to be simple, then getting machines to learn them robustly would likely be simple, and EY could have done something else with those 20+ years.
Also in EY’s model when the AI becomes superintelligent (which may only take a day or something after it becomes just upper human level intelligent and ‘rewrites its source code’), it then quickly predicts the future, realizes humans are in the way, solves drexler-style strong nanotech, and then kills us all. Those latter steps are very fast.
I don’t know what relevance this has to the discussion at hand. A deep learning model trained on human smiling faces might indeed very well tile the universe with smiley-faces, I don’t understand why that’s wrong. Sure, it will likely do something weirder and less predictable, we don’t understand the neural network prior very well, but optimizing for smiling humans still doesn’t produce anything remotely aligned.
EY’s counterargument is that human values are much more complex than happiness—let alone smiles; an AI optimizing for smiles just ends up tiling the universe with smile icons—so it’s just a different flavour of paperclip maximizer. Then he spends a bunch of words on the complexity of value stuff to preempt the more complex versions of the smile detector. If human values were known to be simple, then getting machines to learn them robustly would likely be simple, and EY could have done something else with those 20+ years.
Nothing in the quoted section, or in the document you linked that I just skimmed includes anything about the AI not being able to learn what the things behind the smiling faces actually want. Indeed none of that matters, because the AI has no reason to care. You gave it a few thousand to a million samples of smiling, and now the system is optimizing for smiling, you got what you put in.
Eliezer indeed explicitly addresses this point and says:
As far as I know, Hibbard has still not abandoned his proposal as of the time of this
writing. So far as I can tell, to him it remains self-evident that no superintelligence would
be stupid enough to thus misinterpret the code handed to it, when it’s obvious what the
code is supposed to do. (Note that the adjective “stupid” is the Humean-projective form
of “ranking low in preference,” and that the adjective “pointless” is the projective form
of “activity not leading to preference satisfaction.”)
He is explicitly saying “Hibbard is confusing being ‘smart’ with ‘caring about the right things’”, the AI will be plenty capable of realizing that it isn’t doing what you wanted it to, but it just doesn’t care. Being smarter does not help with getting it to do the thing you want, that’s the whole point of the alignment problem. Similarly AIs being able to understand human values better just doesn’t help you that much with pointing at them (though it does help a bit, but the linked article just doesn’t talk at all about this).
A deep learning model trained on human smiling faces might indeed very well tile the universe with smiley-faces, I don’t understand why that’s wrong.
That is not what Hibbard actually proposed, it’s a superficial strawman version.
I don’t know what relevance this has to the discussion at hand.
HIbbard claims we design intelligent machines which love humans by training to learn human happiness through facial expressions, voices, and body language.
EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled “Complex Value Systems are Required to Realize Valuable Futures”
Nothing in the quoted section, or in the document you linked that I just skimmed includes anything about the AI not being able to learn what the things behind the smiling faces actually want. Indeed none of that matters, because the AI has no reason to care.
It has absolutely nothing to do with whether the AI could eventually learn human values (“the things behind the smiling faces actually want”), and everything to do with whether some ML system could learn said values to use them as the utility function for the AI (which is what Hibbard is proposing).
Neither Hibbard, EY, (or I) are arguing about or discussing whether a SI can learn human values.
EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled “Complex Value Systems are Required to Realize Valuable Futures”
This is really misunderstanding what Eliezer is saying here, and also like, look, from my perspective it’s been a decade of explaining to people almost once every two weeks that “yes, the AI will of course know what you care about, but it won’t care”, so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me, so I am experiencing a good amount of frustration with you repeatedly saying things in this comment thread like “irrefutable proof”, when it’s just like an obviously wrong statement (though like a fine one to arrive at when just reading some random subset of Eliezer’s writing, but a clearly wrong summary nevertheless).
Now to go back to the object level:
Eliezer is really not saying that the AI will fail to learn that there is something more complicated than smiles that the human is trying to point it to. He is explicitly saying “look, you won’t know what the AI will care about after giving it on the order of a million points. You don’t know what the global maximum of the simplest classifier for your sample set is, and very likely it will be some perverse instantiation that has little to do with what you originally cared about”.
He really really is not talking about the AI being too dumb to learn the value function the human is trying to get it to learn. Indeed, I still have no idea how you are reading that into the quoted passages.
Here is a post from 9 years ago, where the title is that exact point, written by Rob Bensinger who was working at MIRI at the time, with Eliezer as the top comment:
If an artificial intelligence is smart enough to be dangerous, we’d intuitively expect it to be smart enough to know how to make itself safe. But that doesn’t mean all smart AIs are safe. To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety. That means we have to understand how to code safety. We can’t pass the entire buck to the AI, when only an AI we’ve already safety-proofed will be safe to ask for help on safety issues!
I encourage you to read some of the comments by Rob in that thread, which very clearly and unambiguously point to the core problem of “the difficult part is to get the AI to care about the right thing, not to understand the right thing”, all before the DL revolution.
This is really misunderstanding what Eliezer is saying here [...] it’s been a decade of explaining to people almost once every two weeks that “yes, the AI will of course know what you care about, but it won’t care”, so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me
I think this is much more ambiguous than you’re making it out to be. In 2008′s “Magical Categories”, Yudkowsky wrote:
I shall call this the fallacy of magical categories—simple little words that turn out to carry all the desired functionality of the AI. Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate “winning” sequences? Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.
I claim that this paragraph didn’t age well in light of the deep learning revolution: “running a neural network [...] over a set of winning and losing sequences of chess moves” basically is how AlphaZero learns from self-play! As the Yudkowsky quote illustrates, it wasn’t obvious in 2008 that this would work: given what we knew before seeing the empirical result, we could imagine that we lived in a “computational universe” in which the neural network’s generalization from “self-play games” to “games against humans or traditional chess engines” worked less well than it did in the actual computational universe.
Yudkowsky continued:
The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication—transmitting category boundaries, like “good”, that can’t be fully delineated in any training data you can give the AI during its childhood.
This would seem to contradict “of course the AI will know, but it won’t care”? “The real problem [...] is one of communication” seems to amount to the claim that the AI won’t care because it won’t know: if you can’t teach “goodness” from labeled data, your AI will search for plans high in something-other-than-goodness, which will kill you at sufficiently high power levels.
But if it turns out that you can teach goodness from labeled data—or at least, if you can get a much better approximation than one might have thought possible in 2008—that would seem to present a different strategic picture. (I’m not saying alignment is easy and I’m not saying humanity is going to survive, but we could die for somewhat different reasons than some blogger thought in 2008.)
I do think these are better quotes. It’s possible that there was some update here between 2008 and 2013 (roughly when I started seeing the more live discussion happening), since I do really remember the “the problem is not getting the AI to understand, but to care” as a common refrain even back then (e.g. see the Robby post I linked).
I claim that this paragraph didn’t age well in light of the deep learning revolution: “running a neural network [...] over a set of winning and losing sequences of chess moves” basically is how AlphaZero learns from self-play!
I agree that this paragraph aged less well than other paragraphs, though I do think this paragraph is still correct (Edit: Eh, it might be wrong, depends a bit on how much neural networks in the 50s are the same as today). It did sure turn out to be correct by a narrower margin than Eliezer probably thought at the time, but my sense is it’s still not the case that we can train a straightforward neural net on winning and losing chess moves and have it generate winning moves. For AlphaGo, the Monte Carlo Tree Search was a major component of its architecture, and then any of the followup-systems was trained by pure self-play.
But in any case, I think your basic point of “Eliezer did not predict the Deep Learning revolution as it happened” here is correct, though I don’t think this specific paragraph has a ton of relevance to the discussion at hand.
The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication—transmitting category boundaries, like “good”, that can’t be fully delineated in any training data you can give the AI during its childhood.
I do think this paragraph seems like a decent quote, though I think at this point it makes sense to break it out into different pieces.
I think Eliezer is saying that what matters is whether we can point the AI to what we care about “during its childhood”, i.e. during relatively early training, before it has already developed a bunch of proxy training objectives.
I think the key question about the future that I think Eliezer was opining on, is then whether by the time we expect AIs to actually be able to have a close-to-complete understanding of what we mean by “goodness”, we still have any ability to shape their goals.
My model is that indeed, Eliezer was surprised, as I think most people were, that AIs of 2022 are as good at picking up complicated concept boundaries and learning fuzzy human concepts as they are, while still being quite incompetent at many other tasks. However, I think the statement of “AIs of 2022 basically understand goodness, or at least will soon enough understand goodness while we are still capable of meaningfully changing their goals” strikes me as very highly dubious, and I think the basic arguments for thinking that this capability will come after the AI has reached a capability level where we have little ability to shape its goals still seem correct to me, and like, one of the primary reasons for doom.
The reason why it still seems substantially out of AIs reach, is because our values do indeed seem quite fragile and to change substantially on reflection, such that it’s currently out of the reach of even a very smart human to fully understand what we mean by “goodness”.
Eliezer talks about this in the comment section you linked (actually, a great comment section between Eliezer and Shane Legg that I found quite insightful to read and am glad to have stumbled upon):
A moderately strong and unFriendly intelligence, operating in the current world without yet having replaced that world with paperclips, would certainly find it natural to form the category of “Things that (some) humans approve of”, and contrast it to “Things that will trigger a nuclear attack against me before I’m done creating my own nanotechnology.” But this category is not what we call “morality”. It naturally—from the AI’s perspective—includes things like bribes and deception, not just the particular class of human-approval-eliciting phenomena that we call “moral”.
Is it worth factoring out phenomena that elicit human feelings of righteousness, and working out how (various) humans reason about them? Yes, because this is an important subset of ways to persuade the humans to leave you alone until it’s too late; but again, that natural category is going to include persuasive techniques like references to religious authority and nationalism.
But what if the AI encounters some more humanistic, atheistic types? Then the AI will predict which of several available actions is most likely to make an atheistic humanist human show sympathy for the AI. This naturally leads the AI to model and predict the human’s internal moral reasoning—but that model isn’t going to distinguish anything along the lines of moral reasoning the human would approve of under long-term reflection, or moral reasoning the human would approve knowing the true facts. That’s just not a natural category to the AI, because the human isn’t going to get a chance for long-term reflection, and the human doesn’t know the true facts.
The natural, predictive, manipulative question, is not “What would this human want knowing the true facts?”, but “What will various behaviors make this human believe, and what will the human do on the basis of these various (false) beliefs?”
In short, all models that an unFriendly AI forms of human moral reasoning, while we can expect them to be highly empirically accurate and well-calibrated to the extent that the AI is highly intelligent, would be formed for the purpose of predicting human reactions to different behaviors and events, so that these behaviors and events can be chosen manipulatively.
But what we regard as morality is an idealized form of such reasoning—the idealized abstracted dynamic built out of such intuitions. The unFriendly AI has no reason to think about anything we would call “moral progress” unless it is naturally occurring on a timescale short enough to matter before the AI wipes out the human species. It has no reason to ask the question “What would humanity want in a thousand years?” any more than you have reason to add up the ASCII letters in a sentence.
Now it might be only a short step from a strictly predictive model of human reasoning, to the idealized abstracted dynamic of morality. If you think about the point of CEV, it’s that you can get an AI to learn most of the information it needs to model morality, by looking at humans—and that the step from these empirical models, to idealization, is relatively short and traversable by the programmers directly or with the aid of manageable amounts of inductive learning. Though CEV’s current description is not precise, and maybe any realistic description of idealization would be more complicated.
But regardless, if the idealized computation we would think of as describing “what is right” is even a short distance of idealization away from strictly predictive and manipulative models of what humans can be made to think is right, then “actually right” is still something that an unFriendly AI would literally never think about, since humans have no direct access to “actually right” (the idealized result of their own thought processes) and hence it plays no role in their behavior and hence is not needed to model or manipulate them.
Which is to say, an unFriendly AI would never once think about morality—only a certain psychological problem in manipulating humans, where the only thing that matters is anything you can make them believe or do. There is no natural motive to think about anything else, and no natural empirical category corresponding to it.
I think this argument is basically correct, and indeed, while current systems definitely are good at having human abstractions, I don’t think they really are anywhere close to having good models of the results of our coherent extrapolated volition, which is what Eliezer is talking about here. (To be clear, I do also separately think that LLMs are thinking about concepts for reasons other than deceiving or modeling humans, though like, I don’t think this changes the argument very much. I don’t think LLMs care very much about thinking carefully about morality, because it’s not very useful for predicting random internet text.)
I think separately, there is a different, indirect normativity approach that starts with “look, yes, we are definitely not going to get the AI to understand what our ultimate values are before the end, but maybe we can get it to understand a concept like ‘being conservative’ or ‘being helpful’ in enough detail that we can use it to supervise smarter AI systems, and then bootstrap ourselves into an aligned superintelligence”.
And I think indeed that plan looks better now than it likely looked to Eliezer in 2008, but I do want to distinguish it from the things that Eliezer was arguing against at the time, which were not about learning approaches to indirect normativity, but were arguments about how the AI would just learn all of human values by being pointed at a bunch of examples of good things and bad things, which still strikes me as extremely unlikely.
it’s still not the case that we can train a straightforward neural net on winning and losing chess moves and have it generate winning moves. For AlphaGo, the Monte Carlo Tree Search was a major component of its architecture, and then any of the followup-systems was trained by pure self-play.
We also assessed variants of AlphaGo that evaluated positions using just the value network (λ = 0) or just rollouts (λ = 1) (see Fig. 4b). Even without rollouts AlphaGo exceeded the performance of all other Go programs, demonstrating that value networks provide a viable alternative to Monte Carlo evaluation in Go.
Even with just the SL-trained value network, it could play at a solid amateur level:
We evaluated the performance of the RL policy network in game play, sampling each move...from its output probability distribution over actions. When played head-to-head, the RL policy network won more than 80% of games against the SL policy network. We also tested against the strongest open-source Go program, Pachi14, a sophisticated Monte Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move. Using no search at all, the RL policy network won 85% of games against Pachi.
I may be misunderstanding this, but it sounds like the network that did nothing but get good at guessing the next move in professional games was able to play at roughly the same level as Pachi, which, according to DeepMind, had a rank of 2d.
Yeah, I mean, to be clear, I do definitely think you can train a neural network to somehow play chess via nothing but classification. I am not sure whether you could do it with a feed forward neural network, and it’s a bit unclear to me whether the neural networks from the 50s are the same thing as the neural networks from 2000s, but it does sure seem like you can just throw a magic category absorber at chess and then have it play OK chess.
My guess is modern networks are not meaningfully more complicated, and the difference to back then was indeed just scale and a few tweaks, but I am not super confident and haven’t looked much into the history here.
EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled “Complex Value Systems are Required to Realize Valuable Futures”
This is really misunderstanding what Eliezer is saying here,
Really? Ok let’s break down phrase by phrase; tell me exactly where I am misunderstanding:
Did EY claim Hibbard’s plan will succeed or fail?
Did EY claim Hibbard’s plan will result in tiling the future light-cone of earth with tiny molecular smiley-faces?
Were these claims made in a paper titled “Complex Value Systems are Required to Realize Valuable Futures”?
look, from my perspective it’s been a decade of explaining to people almost once every two weeks that “yes, the AI will of course know what you care about, but it won’t care”, so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me,
I’ve been here since the beginning, and I’m not sure who you have been explaining that too, but it certainly was not me. And where did I claim this is something new related to deep learning?
I’m going to try to clarify this one last time. There are several different meanings of “learn human values”
1.) Training a machine learning model to learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language, and using that as the utility function of the AI, such that it hopefully cares about human happiness. This is Hibbard’s plan from 2001 - long before DL. This model is trained before the AI becomes even human-level intelligent, and used as its initial utility/reward function.
2.) An AGI internally automatically learning human values as part of learning a model of the world—which would not automatically result in it caring about human values at all.
You keep confusing 1 and 2 - specifically you are confusing arguments concerning 2 directed at laypeople with Hibbard’s type 1 proposal.
Hibbard doesn’t believe that 2 will automatically work. Instead he is arguing for 1, and EY is saying that will fail. (And for the record, although EY’s criticism is overconfident, I am not optimistic about Hibbard’s plan as stated, but that was 2001)
He really really is not talking about the AI being too dumb to learn the value function the human is trying to get it to learn. Indeed, I still have no idea how you are reading that into the quoted passages.
Because I’m not?
To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety. That means we have to understand how to code safety. We can’t pass the entire buck to the AI, when only an AI we’ve already safety-proofed will be safe to ask for help on safety issues!
Hibbard is attempting to make his AI care about safety at the onset (or at least happiness which is his version thereof), he’s not trying to pass the entire buck to the AI.
Will respond more later, but maybe this turns out to be the crux:
Hibbard is attempting to make his AI care about safety at the onset (or at least happiness which is his version thereof)
But “happiness” is not safety! That’s the whole point of this argument. If you optimize for your current conception of “happiness” you will get some kind of terrible thing that doesn’t remotely capture your values, because your values are fragile and you can’t approximate them by the process of “I just had my AI interact with a bunch of happy people and gave it positive reward, and a bunch of sad people and gave it negative reward”.
But “happiness” is not safety! That’s the whole point of this argument. If you optimize for your current conception of “happiness” you will get some kind of terrible thing
There are 2 separate issues here:
Would Hibbard’s approach successfully learn a stable, robust concept of human happiness suitable for use as the reward/utility function of AGI?
Conditional on 1, is ‘happiness’ what we actually want?
The answer to 2 depends much on how one defines happiness, but if happiness includes satisfaction (ie empowerment, curiosity, self-actualization etc—the basis of fun), then it is probably sufficient, but that’s not the core argument.
Notice that EY does not assume 1 and argue 2, he instead argues that Hibbard’s approach doesn’t learn a robust concept of happiness at all and instead learns a trivial superficial “maximize faciness” concept instead.
This is crystal clear and unambiguous:
When I suggested to Hibbard that the upshot of building superintelligences with a utility function of “smiles” would be to tile the future light-cone of Earth with tiny molecular smiley-faces, he replied (Hibbard 2006):
He describes the result as a utility function of smiles, not a utility function of happiness.
So no, EY’s argument here is absolutely not about happiness being insufficient for safety. His argument is that happiness is incredibly complex and hard to learn a robust version of, and therefor Hibbard’s simplistic approach will learn some stupid superficial ‘faciness’ concept rather than happiness.
See also current debates around building a diamond-maximizing AI, where there is zero question of whether diamondness is what we want, and all the debate is around the (claimed) incredible difficulty of learning a robust version of even something simple like diamondness.
I think I am more interested in you reading The Genie Knows but Doesn’t Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate (given that it was linked by Robby in a bunch of the discussion there after it was written).
I think there are some subtle things here. I think Eliezer!2008 would agree that AIs will totally learn a robust concept for “car”. But I think neither Eliezer!2008 nor me currently would think that current LLMs have any chance of learning a robust concept for “happiness” or “goodness”, in substantial parts because I don’t have a robust concept of “happiness” or “goodness” and before the AI refines those concepts further than I can, I sure expect it to be able to disempower me (though it’s not like guaranteed that that will happen).
What Eliezer is arguing against is not that the AI will not learn any human concepts. It’s that there are a number of human concepts that tend to lean on the whole ontological structure of how humans think about the world (like “low-impact” or “goodness” or “happiness”), such that in order to actually build an accurate model of those, you have to do a bunch of careful thinking and need to really understand how humans view the world, and that people tend to be systematically optimistic about how convergent these kinds of concepts are, as opposed to them being contingent on the specific ways humans think.
My guess is an AI might very well spend sufficient cycles on figuring out human morality after it has access to a solarsystem level of compute, but I think that is unlikely to happen before it has disempowered us, so the ordering here matters a lot (see e.g. my response to Zack above).
So I think there are three separate points here that I think have caused confusion and probably caused us to talk past each other for a while, all of which I think were things that Eliezer was thinking about, at least around 2013-2014 (I know less about 2008):
Low-powered AI systems will have a really hard time learning high-level human concepts like “happiness”, and if you try to naively get them to learn that concept (by e.g pointing them towards smiling humans) you will get some kind of abomination, since even humans have trouble with those kinds of concepts
It is likely that by the time an AI will understand what humans actually really want, we will not have much control over its training process, and so despite it now understanding those constraints, we will have no power to shape its goals towards that
Even if we and the AI had a very crisp and clear concept of a goal I would like the AI to have, humanity won’t know how to actually cause the AI to point towards that as a goal (see e.g. the diamond maximizer problem)
To now answer your concrete questions:
Would Hibbard’s approach successfully learn a stable, robust concept of human happiness suitable for use as the reward/utility function of AGI
My first response to this is: “I mean, of course not at current LLM capabilities. Ask GPT-3 about happiness, and you will get something dumb and incoherent back. If you keep going and make more capably systems try to do this, it’s pretty likely your classifier will be smart enough to kill you to have more resources to drive the prediction error downwards before it actually arrived at a really deep understanding of human happiness (which appears to require substantial superhuman abilities, given that humans do not have a coherent model of happiness themselves)”
So no, I don’t think Hibbard’s approach would work. Separately, we have no idea how to use a classifier as a reward/utility function for an AGI, so that part of the approach also wouldn’t work. Like, what do you actually concretely propose we do after we have a classifier over video frames that causes a separate AI to then actually optimize for the underlying concept boundary?
But even if you ignore both of these problems, and you avoid the AI killing you in pursuit of driving down prediction error, and you somehow figure out how to take a classifier and use it as a utility function, then you are still not in a good shape, because the AI will likely be able to achieve lower prediction error by modeling the humans doing the labeling process of the data you provide, and modeling what errors they are actually making, and will learn the more natural concept of “things that look happy to humans” instead of the actual happiness concept.
This is a really big deal, because if you start giving an AI the “things that look happy to humans” concept, you will end up with an AI that gets really good at deceiving humans and convincing them that something is happy, which will both quickly involve humans getting fooled and disempowered, and then in the limit might produce something surprisingly close to a universe tiled in smiley faces (convincing enough such that if you point a video camera at it, the rater who was looking at it for 15 seconds would indeed be convinced that it was happy, though there are no raters around).
I think Hibbard’s approach fails for all three reasons that I listed above, and I don’t think modern systems somehow invalidate any of those three reasons. I do think (as I have said in other comments) that modern systems might make indirect normativity approaches more promising, but I don’t think it moves the full value-loading problem anywhere close to the domain of solvability with current systems.
I think I am more interested in you reading The Genie Knows but Doesn’t Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate
Looking over that it just seems to be a straightforward extrapolation of EY’s earlier points, so I’m not sure why you thought it was especially relevant.
Low-powered AI systems will have a really hard time learning high-level human concepts like “happiness”, and if you try to naively get them to learn that concept (by e.g pointing them towards smiling humans) you will get some kind of abomination, since even humans have trouble with those kinds of concepts
Yeah—this is his core argument against Hibbard. I think Hibbard 2001 would object to ‘low-powered’, and would probably have other objections I’m not modelling, but regardless I don’t find this controversial.
It is likely that by the time an AI will understand what humans actually really want, we will not have much control over its training process, and so despite it now understanding those constraints, we will have no power to shape its goals towards that.
Yeah, in agreement with what I said earlier:
Notice I said “before it killed us”. Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that’s irrelevant because we need to instill its utility function long before that.
...
Even if we and the AI had a very crisp and clear concept of a goal I would like the AI to have, humanity won’t know how to actually cause the AI to point towards that as a goal
I believe I know what you meant, but this seems somewhat confused as worded. If we can train an ML model to learn a very crisp clear concept of a goal, then having the AI optimize for this (point towards it) is straightforward. Long term robustness may be a different issue, but I’m assuming that’s mostly covered under “very crisp clear concept”.
The issue of course is that what humans actually want is complex for us to articulate, let alone formally specify. The update since 2008/2011 is that DL may be able to learn a reasonable proxy of what we actually want, even if we can’t fully formally specify it.
which appears to require substantial superhuman abilities, given that humans do not have a coherent model of happiness themselves)”
I think this is something of a red herring. Humans can reasonably predict utility functions of other humans in complex scenarios simply by simulating the other as self—ie through empathy. Also happiness probably isn’t the correct thing—probably want the AI to optimize for our empowerment (future optionality), but that’s whole separate discussion.
So no, I don’t think Hibbard’s approach would work.
Sure, neither do I.
Separately, we have no idea how to use a classifier as a reward/utility function for an AGI, so that part of the approach also wouldn’t work.
A classifier is a function which maps high-D inputs to a single categorical variable, and a utility function just maps some high-D input to a real number, but a k-categorical variable is just the explicit binned model of a log(k) bit number, so these really aren’t that different, and there are many interpolations between. (and in fact sometimes it’s better to use the more expensive categorical model for regression )
Like, what do you actually concretely propose we do after we have a classifier over video frames
Video frames? The utility function needs to be over future predicted world states .. which you could I guess use to render out videos, but text rendering are probably more natural.
I propose we actually learn how the brain works, and how evolution solved alignment, to better understand our values and reverse engineer them. That is probably the safest approach—having a complete understanding of the brain.
However, I’m also somewhat optimistic on theoretical approaches that focus more explicitly on optimizing for external empowerment (which is simpler and more crisp), and how that could be approximated pragmatically with current ML approaches. Those two topics are probably my next posts.
Our best conditional generative models sample from a conditional distribution, they don’t optimize for feature-ness. The GAN analogy is also mostly irrelevant because diffusion models have taken over for conditional generation, and Nate’s comment seems confused as applied to diffusion models.
Nate’s comment isn’t confused, he’s not talking about diffusion models, he’s talking about the kinds of AI that could take over the world and reshape it to optimize for some values/goals/utility-function/etc.
You could very analogously say ‘human faces are fragile’ because if you just leave out the nose it suddenly doesn’t look like a typical human face at all. Sure, but is that the kind of error you get when you try to train ML systems to mimic human faces?
Nate’s comment:
B) wake me when the allegedly maximally-facelike image looks human;
Katja is talking about current ML systems and how the fragility issue EY predicted didn’t materialize (actually it arguably did in earlier systems). Nate’s comment is clearly referencing Katja’s analogy—faciness—and he’s clearly implying we haven’t seen the problem with face generators yet because they haven’t pushed the optimization hard enough to find the maximally-facelike image. But he’s just wrong there—they don’t have that problem, no matter how hard you scale their optimization power—and that is part of why Katja’s analogy works so well at a deeper level: future ML systems do not work the way AI risk folks thought they would.
Diffusion models are relevant because they improve on conditional GANs by leveraging powerful pretrained discriminative foundation models and by allowing for greater optimization power at inference time, improvements that also could be applied to planning agents.
ML systems still use plenty of reinforcement learning, and systems that apply straightforward optimization pressure. We’ve also built a few systems more recently that do something closer to trying to recreate samples from a distribution, but that doesn’t actually help you improve on (or even achieve) human-level performance. In order to improve on human level performance, you either have to hand-code ontologies (by plugging multiple simulator systems together in a CAIS fashion), or just do something like reinforcement learning, which then very quickly does display the error modes everyone is talking about.
Current systems do not display a lack of edge-instantiation behavior. Some of them seem more robust, but the ones that do also seem fundamentally limited (and also, they will likely still show edge-instantiation for their inner objective, but that’s harder to talk about).
And also just to make the very concrete point, Katja linked to a bunch of faces generated by a GAN, which straightforwardly has the problems people are talking about, so there really is no mismatch in the kinds of things that Katja is talking about, and Nate is talking about. We could perform a more optimized search for things that are definitely faces according to the discriminator, and we would likely get something horrifying.
We could perform a more optimized search for things that are definitely faces according to the discriminator, and we would likely get something horrifying.
Sure you could do that, but people usually don’t—unless they intentionally want something horrifying. So if your argument is now “sure, new ML systems totally can solve the faciness problem, but only if you choose to use them correctly”—then great, finally we agree.
Interestingly enough in diffusion planning models if you crank up the discriminator you get trajectories that are higher utility but increasingly unrealistic. You get lower utility trajectories by cranking down the discriminator.
Clarifying questions, either for you or for someone else, to aid my own confusion:
What does “applying optimization pressure” mean? Is steering random noise into the narrow part of configuration space that contains plausible images-of-X (the thing DDPMs and GAN generators do) a straightforward example of it?
I predict that this would look at least as weird and nonhuman as those deep dream images if not more so
This feels like something we should just test? I don’t have access to any such model but presumably someone does and can just run the experiment? Bcos it seems like peoples hunches are varying a lot here
Also, we don’t know what would happen if we exactly optimized an image to maximize the activation of a particular human’s face detection circuitry. I expect that the result would be pretty eldritch as well.
They are still smooth and have low-frequency patterns, which seems to be the main difference from adversarial examples currently produced from DL models.
Yeah. Wake me up when we find a single agent which makes decisions by extremizing its own concept activations. EG I’m pretty sure that people don’t reflectively, most strongly want to make friends with entities which maximally activate their potential-friend detection circuitry.
“Sample the face at the point of highest probability density in the generative model’s latent space”. For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.
(sort of nitpicking): I think it makes more sense to look for the highest density in pixel space; this requires integrating over all settings of the latents (unless your generator is invertible, in which case you can just use change of variables formula). I expect the argument to go through, but it would be interesting to do this with an invertible generator (e.g. normalizing flow) and see if it actually does.
Note: “ask them for the faciest possible thing” seems confused.
How I would’ve interpreted this if I were talking with another ML researcher is “Sample the face at the point of highest probability density in the generative model’s latent space”. For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.
I’m guessing what he has in mind is more like “take a GAN discriminator / image classifier & find the image that maxes out the face logit”, but if so, why is that the relevant operationalization? It doesn’t correspond to how such a model is actually used.
EDIT: Here is what the first looks like for StyleGAN2-ADA.
It’s the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you’re trying to write a really good essay, you don’t care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.
(also, the maximum likelihood essay looks like a single word, or if you normalize for length, the same word repeated over and over again up to the context length)
EY argues that human values are hard to learn. Katja uses human faces as an analogy, pointing out that ML systems learn natural concepts far easier than EY 2009 expected.
The analogy is between A: a function which maps noise to realistic images of human faces and B: a function which maps predicted future world states to utility scores similar to how a human would score them. The lesson is that since ML systems can learn A very well, they can probably also learn B.
Function A (human face generator) does not even use max-likelihood sampling and it isn’t even an optimizer, so your operationalization is just confused. Nor is function B an optimizer itself.
I claim that A and B are in fact very disanalogous objects, and that the claim that A can be learned well does not imply that B can probably be learned well. I am very confused by your claims about the functions A and B not being optimizers, because to me this is true but also irrelevant.
The reason we want a function B that can map world states to utilities is so that we can optimize on that number. We want to select for world states that we think will have high utility using B; otherwise function B is pretty useless. Therefore, this function has to be reliable enough that putting lots of optimization pressure on it does not break it. This is not the same as claiming that the function itself is an optimizer or anything like that. Making something reliable against lots of optimization pressure is a lot harder than making it reliable in the training distribution.
The function A effectively allows you to sample from the distribution of faces. Function A does not have to be robust against adversarial optimization to approximate the distribution. The analogous function in the domain of human values would be a function that lets you sample from some prior distribution of world states, not one that scores utility of states.
More generally, I think the confusion here stems from the fact that a) robustness against optimization is far harder than modelling typical elements of a distribution, and b) distributions over states are fundamentally different objects from utility functions over states.
Nate’s analogy is confused: diffusion models do not generate convincing samples of faces by maximizing for faciness—see how they actually work, and make sure we agree there. This is important because previous systems (such as deepdream) could be described as maximizing for X, such that nate’s critique would be more relevant.
Your comment here about “optimizing for X-ness” indicates you also were adopting the wrong model of how diffusion models operate:
That simply isn’t out how diffusion models work. A diffusion model for essays would sample from realistic essays that summarize to some short prompt; so they absolutely do care about high likelihood from the distribution of human essays.
Now that being said I do partially agree that A (face generator function) and B (human utility function) are somewhat different ..
Yes sort of—or at least that is the fairly default view of how a utility function would be used. But that isn’t the only possibility—one could also solve planning using a diffusion model[1], which would make A and B very similar. The face generator diffusion model combines an unconditional generative model of images with an image to text discriminator, the planning diffusion model combines an unconditional generative future world model with a discriminator (the utility function part, although one could also imagine it being more like an image to text model).
Applying more optimization pressure results in better outputs, according to the objective. Optimization pressure doesn’t break the objective function (what would that even mean?) and you have to create fairly contrived scenarios where more optimization power results in worse outcomes.
So i’m assuming you mean distribution shift robustness: we’ll initially train the human utility function component on some samples of possible future worlds, but then as the AI plans farther ahead and time progresses shit gets wierd and the distribution shifts, so that the initial utility function no longer works well.
So let’s apply that to the image diffusion model analogy—it’s equivalent to massively retraining/scaling up the unconditional generative model (which models images or simulates futures), without likewise improving the discriminative model.
The points from Katja’s analogy are:
It’s actually pretty easy and natural to retrain/scale them together, and
It’s also surprisingly easy/effective to scale up and even combine generative models and get better results with the same discriminator
I almost didn’t want to mention this analogy because i’m not sure that planning via diffusion has been tried yet, and it seems like the kind of thing that could work. But it’s also somewhat obvious, so I bet there are probably people trying this now if it hasn’t already been published (haven’t checked).
I object to your characterization that I am claiming that diffusion models work by maximizing faciness, or that I am confused about how diffusion models work. I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I’m confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
(Unimportant nitpicking: This Person Does Not Exist doesn’t actually use a diffusion model, but rather a StyleGAN trained on a face dataset.)
You’re also eliding over the difference between training an unconditional diffusion model on a face dataset and training an unconditional diffusion model over a general image dataset and doing classifier based guidance. I’ve been talking about unconditional models on a face dataset, which does not optimize for faciness, but when you do classifier-based guidance this changes the setup. I don’t think this difference is crucial, and my point can be made with either, so I will talk using your setup instead.
In fact, the setup you describe in the linked comment does in fact put optimization pressure on faciness, regularized by distance from the prior. Note that when I say “optimization pressure” I don’t mean necessarily literally getting the sample that maxes out the objective. In the essay example, this would be like doing RLHF for essay quality with a KL penalty to stay close to the text distribution. You are correct in stating that this regularization helps to stay on the manifold of realistic images and that removing it results in terrible nightmare images, and this applies directly to the essay example as well.
However the core problem with this approach is that the reason the regularization works is that you trade off quality for typicality. In the face case this is mostly fine because faces are pretty typical of the original distribution anyways, but I would make the concrete prediction that if you tried to get faces using classifier-based guidance out of a diffusion model specifically trained on all images except those containing faces, it would be really fiddly or impossible to get good quality faces that aren’t weird and nightmarish out of it. It seems possible that you are talking past me/Nate in that you have in mind that such regularization isn’t a big problem to put on our AGI, mild optimization is a good thing because we don’t want really weird worlds, etc. I believe this is fatally flawed for planning, partly because this means we can’t really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can’t do things that are too different from things that have been done before. I’m not going to argue this out because I think the following is actually a much larger crux and until we agree on it, arguing over my previous claim will be very difficult:
When I say “breaks” the objective, I mean reward hacking/reward gaming/goodharting it. I’m surprised that this wasn’t obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren’t actually what we want. I think that this could be a major crux underlying everything else.
(It may also be relevant that at least from my perspective, notwithstanding anything Eliezer may or may not have said, the “learning human values is hard” argument primarily applies to argue why human values won’t be simple/natural in cases where simplicity/naturalness determine what is easier to learn. I have no doubt that a sufficiently powerful AGI could figure out our values if it wanted to, the hard part is making it want to do so. I think Eliezer may be particularly more pessimistic about neural networks’ robustness.)
Nonetheless, let me lay out some (non-exhaustive) concrete reasons why I expect just scaling up the discriminator and its training data to not work.
Obviously, when we optimize really hard on our learned discriminator, we get the out of distribution stuff, as you agree. But let’s just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers. I claim that optimizing any neural network for achieving world states that these labellers find good doesn’t just lead to extremely bad outcomes in unlikely contrived scenarios, but rather happens by default. Even if you think the following can be avoided using diffusion planning / typicality regularization / etc, I still think it is necessary to first agree that this comes up when you don’t do that regularization, and only then discuss whether it still comes up with regularization.
Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn’t. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.
Telling whether a plan leads to a good world state is nontrivial. You don’t have a perfect world model. You can’t tell very reliably whether a proposed plan leads to good outcomes.
Interpretations
First a reply to interpretations of previous words:
I hope we agree that a discriminator which is trained only to recognize good essays robustly probably does not contain enough information to generate good essays, for the same reasons that an image discriminator does not contain enough information to generate good images—because the discriminator only learns the boundaries of words/categories over images, not the more complex embedded distribution of realistic images.
Optimizing only for faciness via a discriminator does not work well—that’s the old deepdream approach. Optimizing only for “good essayness” probably does not work well either. These approaches do not actually get high utility.
So when you say ” I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility”, that just seems confused to me, because diffusion models do get high utility, and not via optimizing just for faciness (which results in low utility).
When you earlier said:
The obvious interpretation there still seems to be optimizing only for the discriminator objective—and I’m surprised you are surprised I interpreted that otherwise?. Especially when I replied that the only way to actually get a good essay is to sample from the distribution of essays conditioned on goodness—ie the distribution of good essays.
Anyway, here you are making a somewhat different point:
I still think this is not quite right, in that a diffusion model works by combining the ability to write typical essays with a discriminator to condition on good essays, such that both abilities matter, but I see your point is basically “the discriminator or utility function is the hard part for AGI”, and move on to the more cruxy part.
The Crux?
Ok, so part of the problem here is we may be assuming different models for AGI. I am assuming a more brain-like pure ANN, which uses fully learned planning more like a diffusion planning model (which is closer to what the brain probably uses), rather than the more common older assumed approach of combining a learned world model and utility function with some explicit planning algorithm like MCT or whatever.
So there are several different optimization layers that can be scaled:
The agent optimizing the world (can scale up planning horizon, etc)
Optimizing/training the learned world/action/planning model(s)
Optimizing/training the learned discriminator/utility model
You can scale these independently but only within limits, and it probably doesn’t make much sense to differentially scale them too far.
I really wouldn’t call that the ground truth. The ground truth would be brain-sims (which is part of the rationale for brain-like AGI) combined with complete detailed understanding of the brain and especially its utility/planning system equivalents. That being said I am probably more optimistic about ’the army of perfect human labellers” approach.
Why is the AI generating Potemkin villages? Deceptive alignment? I’m assuming use of proper sandbox sims to prevent deception. But I’m also independently optimistic about simpler more automatable altruistic utility functions like maximization of human empowerment.
I don’t see why imperfect planning is more likely to lead to bad rather than good outcomes, all else being equal, and regardless you don’t need anything near a perfect world model to match human intelligence. Furthermore the assumption that the world model isn’t good enough to be useful for utility evaluations contradicts the assumption of superintelligence.
The world/action/planning model does need to be retrained on its own rollouts which will cause it to eventually learn to do things that are different and novel. Humans don’t seem to have much difficulty planning out wierd future world states.
The claim that every increase in regularisation makes performance worse is extraordinary, given everything I know about machine learning.
FYI: Planning with diffusion is being tried and seemingly works.
Wouldn’t a better analogy be A: noise to faces judged as realistic and B: noise to plans judged to have good consequences?
As for whether B breaks under competitive pressure: does A break under competitive pressure? B does introduce safe exploration concerns not relevant to A, but the answer for A seems like a clear “no” to me.
Basic question: why would the AI system optimize for X-ness?
I thought Katja’s argument was something like:
Suppose we train a system to generate (say) plans for increasing the profits of your paperclip factory similar to how we train GANs to generate faces
Then we would expect those paperclip factory planners to have analogous errors to face generator errors
I.e. they will not be “eldritch”
The fact that you could repurpose the GAN discriminator in this terrifying way doesn’t really seem relevant if no one is in practice doing that?
I took Nate to be saying that we’d compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create “thing that is a face that has the highest probability of occurring in the environment”, while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven’t actually tried it. I also predict that if you stop training the discriminator and keep training the generator, the generator starts generating weird looking nonhuman images.
This is relevant to Reinforcement Learning because of the actor-critic class of systems, where the actor is like the generator and the critic is like the discriminator. We’d ideally like the RL system to stay on course after we stop providing it with labels, but stopping labels means we stop training the critic. Which means that the actor is free to start generating adversarial policies that hack the critic, rather than policies that actually perform well in the way we’d want them to.
Upvoted because I agree with all of the above.
AFAICT the original post was using the faces analogy in a different way than Nate is. It doesn’t claim that the discriminators used to supervise GAN face learning or the classifiers used to detect faces are adversarially robust. That isn’t the point it’s making. It claims that learned models of faces don’t “leave anything important out” in the way that one might expect some key feature to be “left out” when learning to model a complex domain like human faces or human values. And that seems well-supported: the trajectory of modern ML has shown learning such complex models is far easier than we might’ve thought, even if building adversarially robust classifiers is very hard. (As much as I’d like to have supervision signals that are robust to arbitrarily-capable adversaries, it seems non-obvious to me that that is even required for success at alignment.)
Hmm, but I don’t understand what relevance it has to alignment. The problem was never that the AI won’t learn human values, it’s that the AI won’t care about human values. Of course a super intelligent AI will have a good model of human values, the same way it will have a good model of engineering, chemistry, the ecological environment and physics. That doesn’t mean it will do things that are aligned with its accurate model of human values.
I am not sure who thought that learning such models was much harder than it turned out to be. It seems clear that an AI will learn what human faces are before the AI is very dangerous to the world. It would have been extremely surprising to have a dangerous AGI incapable of learning what human faces are like.
Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us, so that very much was believed to be part of the problem (at least by the EY/MIRI/LW/etc crowed).
But that’s all now mostly irrelevant—an altruistic AI probably doesn’t even need to know or care about human values at all, as it can simply optimize for our empowerment—our future optionality or ability to do anything we want. (some previous discussion here. and in these comments. )
I wasn’t that active around the time of the sequences, but I had a good number of discussions with people, and the point “the AI will of course know what your values are, it just wont’ care” was made many times, and I am also pretty sure was made in the sequences (I would have to dig it up, and am on my phone, but I heard that sentence in spoken conversation a lot over the years).
I don’t think “empowerment” is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.
Notice I said “before it killed us”. Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that’s irrelevant because we need to instill its utility function long before that. See my reply here, this is well documented, and no amount of vague memories of conversations trump the written evidence.
I’m not entirely sure what people mean when they say “X won’t survive heavy optimization pressure”—but for example the objective of modern diffusion models survives heavy optimization power.
External empowerment is very simple and it doesn’t even require detailed modeling of the agent—they can just be a black box that produces outputs. I’m curious what you think is an example of “the kind of concept that particularly survives heavy optimization pressure”.
Basically, it’s Goodhart’s law in action, where optimizing a proxy more and more destroys what you value.
Oh—empowerment is about as immune to Goodharting as you can get, and that’s perhaps one of its major advantages[1]. However in practice one has to use some approximation, which may or may not be goodhartable to some degree depending on many details.
Empowerment is vastly more difficult to Goodhart than a corporation optimizing for some bundle of currencies (including crypto), much more difficult to Goodhart than optimizing for control over even more fundamental physical resources like mass and energy, and is generally the least-Goodhartable objective that could exist. In some sense the universal version of Goohdarting—properly defined—is just a measure of deviation from empowerment. It is the core driver of human intelligence and for good reason.
Can you explain further, since this to me seems like a very large claim that if true would have big impact, but I’m not sure how you got the immunity to Goodhart result you have here.
This applies to Regressional, Causal, Extremal and Adversarial Goodhart.
Empowerment could be defined as the natural unique solution to Goodharting. Goodharting is the divergence under optimization scaling between trajectories resulting from the difference between a utility function and some proxy of that utility function.
However due to instrumental convergence, the trajectories of all reasonable agent utility functions converge under optimization scaling—and empowerment simply is that which they converge to.
In other words the empowerment of some agent P(X) is the utility function which minimizes trajectory distance to all/any reasonable agent utility functions U(X), regardless of their specific (potentially unknown) form.
Therefor empowerment is—by definition—the best possible proxy utility function (under optimization scaling).
Let’s apply some quick examples:
Under scaling, an AI with some crude Hibbard-style happiness approximation will first empower itself and then eventually tile the universe with smiling faces (according to EY), or perhaps more realistically—with humans bio-engineered for docility, stupidity, and maximum bliss. Happiness alone is not the true human utility function.
Under scaling, an AI with some crude stock-value maximizing utility function will first empower itself and then eventually cause hyperinflation of the reference currencies defining the stock price. Stock value is not the true utility function of the corporation.
Under scaling, an AI with a human empowerment utility function will first empower itself, and then empower humanity—maximizing our future optionality and ability to fulfill any unknown goals/values, while ensuring our survival (because death is the minimally empowered state). This works because empowerment is pretty close to the true utility function of intelligent agents due to convergence, or at least the closest universal proxy. If you strip away a human’s drives for sex, food, child tending and simple pleasures, most of what remains is empowerment-related (manifesting as curiosity, drive, self-actualization, fun, self-preservation, etc).
An AI with a good world model will predictably have a model of your values, but that’s different from being able to actually elicit that model via e.g. a series of labeled examples. That’s the part that seemed less plausible before DL.
Do you have a link to where Eliezer (or any other LW writer) said that? I don’t myself recall whether they said that.
I may be exaggerating a tiny tiny bit with the “before it killed us” modifier, and I don’t have time to search for this specific needle—but EY famously criticized some early safety proposal which consisted of using a ‘smiling face’ detector somehow to train an AI to recognize human happiness, and then optimize for that.
Oh it was actually already open in a tab:
From complex values blah blah blah:
EY’s counterargument is that human values are much more complex than happiness—let alone smiles; an AI optimizing for smiles just ends up tiling the universe with smile icons—so it’s just a different flavour of paperclip maximizer. Then he spends a bunch of words on the complexity of value stuff to preempt the more complex versions of the smile detector. If human values were known to be simple, then getting machines to learn them robustly would likely be simple, and EY could have done something else with those 20+ years.
Also in EY’s model when the AI becomes superintelligent (which may only take a day or something after it becomes just upper human level intelligent and ‘rewrites its source code’), it then quickly predicts the future, realizes humans are in the way, solves drexler-style strong nanotech, and then kills us all. Those latter steps are very fast.
I don’t know what relevance this has to the discussion at hand. A deep learning model trained on human smiling faces might indeed very well tile the universe with smiley-faces, I don’t understand why that’s wrong. Sure, it will likely do something weirder and less predictable, we don’t understand the neural network prior very well, but optimizing for smiling humans still doesn’t produce anything remotely aligned.
Nothing in the quoted section, or in the document you linked that I just skimmed includes anything about the AI not being able to learn what the things behind the smiling faces actually want. Indeed none of that matters, because the AI has no reason to care. You gave it a few thousand to a million samples of smiling, and now the system is optimizing for smiling, you got what you put in.
Eliezer indeed explicitly addresses this point and says:
He is explicitly saying “Hibbard is confusing being ‘smart’ with ‘caring about the right things’”, the AI will be plenty capable of realizing that it isn’t doing what you wanted it to, but it just doesn’t care. Being smarter does not help with getting it to do the thing you want, that’s the whole point of the alignment problem. Similarly AIs being able to understand human values better just doesn’t help you that much with pointing at them (though it does help a bit, but the linked article just doesn’t talk at all about this).
That is not what Hibbard actually proposed, it’s a superficial strawman version.
HIbbard claims we design intelligent machines which love humans by training to learn human happiness through facial expressions, voices, and body language.
EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled “Complex Value Systems are Required to Realize Valuable Futures”
It has absolutely nothing to do with whether the AI could eventually learn human values (“the things behind the smiling faces actually want”), and everything to do with whether some ML system could learn said values to use them as the utility function for the AI (which is what Hibbard is proposing).
Neither Hibbard, EY, (or I) are arguing about or discussing whether a SI can learn human values.
This is really misunderstanding what Eliezer is saying here, and also like, look, from my perspective it’s been a decade of explaining to people almost once every two weeks that “yes, the AI will of course know what you care about, but it won’t care”, so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me, so I am experiencing a good amount of frustration with you repeatedly saying things in this comment thread like “irrefutable proof”, when it’s just like an obviously wrong statement (though like a fine one to arrive at when just reading some random subset of Eliezer’s writing, but a clearly wrong summary nevertheless).
Now to go back to the object level:
Eliezer is really not saying that the AI will fail to learn that there is something more complicated than smiles that the human is trying to point it to. He is explicitly saying “look, you won’t know what the AI will care about after giving it on the order of a million points. You don’t know what the global maximum of the simplest classifier for your sample set is, and very likely it will be some perverse instantiation that has little to do with what you originally cared about”.
He really really is not talking about the AI being too dumb to learn the value function the human is trying to get it to learn. Indeed, I still have no idea how you are reading that into the quoted passages.
Here is a post from 9 years ago, where the title is that exact point, written by Rob Bensinger who was working at MIRI at the time, with Eliezer as the top comment:
The genie knows, but doesn’t care
I encourage you to read some of the comments by Rob in that thread, which very clearly and unambiguously point to the core problem of “the difficult part is to get the AI to care about the right thing, not to understand the right thing”, all before the DL revolution.
I think this is much more ambiguous than you’re making it out to be. In 2008′s “Magical Categories”, Yudkowsky wrote:
I claim that this paragraph didn’t age well in light of the deep learning revolution: “running a neural network [...] over a set of winning and losing sequences of chess moves” basically is how AlphaZero learns from self-play! As the Yudkowsky quote illustrates, it wasn’t obvious in 2008 that this would work: given what we knew before seeing the empirical result, we could imagine that we lived in a “computational universe” in which the neural network’s generalization from “self-play games” to “games against humans or traditional chess engines” worked less well than it did in the actual computational universe.
Yudkowsky continued:
This would seem to contradict “of course the AI will know, but it won’t care”? “The real problem [...] is one of communication” seems to amount to the claim that the AI won’t care because it won’t know: if you can’t teach “goodness” from labeled data, your AI will search for plans high in something-other-than-goodness, which will kill you at sufficiently high power levels.
But if it turns out that you can teach goodness from labeled data—or at least, if you can get a much better approximation than one might have thought possible in 2008—that would seem to present a different strategic picture. (I’m not saying alignment is easy and I’m not saying humanity is going to survive, but we could die for somewhat different reasons than some blogger thought in 2008.)
I do think these are better quotes. It’s possible that there was some update here between 2008 and 2013 (roughly when I started seeing the more live discussion happening), since I do really remember the “the problem is not getting the AI to understand, but to care” as a common refrain even back then (e.g. see the Robby post I linked).
I agree that this paragraph aged less well than other paragraphs, though I do think this paragraph is still correct (Edit: Eh, it might be wrong, depends a bit on how much neural networks in the 50s are the same as today). It did sure turn out to be correct by a narrower margin than Eliezer probably thought at the time, but my sense is it’s still not the case that we can train a straightforward neural net on winning and losing chess moves and have it generate winning moves. For AlphaGo, the Monte Carlo Tree Search was a major component of its architecture, and then any of the followup-systems was trained by pure self-play.
But in any case, I think your basic point of “Eliezer did not predict the Deep Learning revolution as it happened” here is correct, though I don’t think this specific paragraph has a ton of relevance to the discussion at hand.
I do think this paragraph seems like a decent quote, though I think at this point it makes sense to break it out into different pieces.
I think Eliezer is saying that what matters is whether we can point the AI to what we care about “during its childhood”, i.e. during relatively early training, before it has already developed a bunch of proxy training objectives.
I think the key question about the future that I think Eliezer was opining on, is then whether by the time we expect AIs to actually be able to have a close-to-complete understanding of what we mean by “goodness”, we still have any ability to shape their goals.
My model is that indeed, Eliezer was surprised, as I think most people were, that AIs of 2022 are as good at picking up complicated concept boundaries and learning fuzzy human concepts as they are, while still being quite incompetent at many other tasks. However, I think the statement of “AIs of 2022 basically understand goodness, or at least will soon enough understand goodness while we are still capable of meaningfully changing their goals” strikes me as very highly dubious, and I think the basic arguments for thinking that this capability will come after the AI has reached a capability level where we have little ability to shape its goals still seem correct to me, and like, one of the primary reasons for doom.
The reason why it still seems substantially out of AIs reach, is because our values do indeed seem quite fragile and to change substantially on reflection, such that it’s currently out of the reach of even a very smart human to fully understand what we mean by “goodness”.
Eliezer talks about this in the comment section you linked (actually, a great comment section between Eliezer and Shane Legg that I found quite insightful to read and am glad to have stumbled upon):
I think this argument is basically correct, and indeed, while current systems definitely are good at having human abstractions, I don’t think they really are anywhere close to having good models of the results of our coherent extrapolated volition, which is what Eliezer is talking about here. (To be clear, I do also separately think that LLMs are thinking about concepts for reasons other than deceiving or modeling humans, though like, I don’t think this changes the argument very much. I don’t think LLMs care very much about thinking carefully about morality, because it’s not very useful for predicting random internet text.)
I think separately, there is a different, indirect normativity approach that starts with “look, yes, we are definitely not going to get the AI to understand what our ultimate values are before the end, but maybe we can get it to understand a concept like ‘being conservative’ or ‘being helpful’ in enough detail that we can use it to supervise smarter AI systems, and then bootstrap ourselves into an aligned superintelligence”.
And I think indeed that plan looks better now than it likely looked to Eliezer in 2008, but I do want to distinguish it from the things that Eliezer was arguing against at the time, which were not about learning approaches to indirect normativity, but were arguments about how the AI would just learn all of human values by being pointed at a bunch of examples of good things and bad things, which still strikes me as extremely unlikely.
AlphaGo without the MCTS was still pretty strong:
Even with just the SL-trained value network, it could play at a solid amateur level:
I may be misunderstanding this, but it sounds like the network that did nothing but get good at guessing the next move in professional games was able to play at roughly the same level as Pachi, which, according to DeepMind, had a rank of 2d.
Yeah, I mean, to be clear, I do definitely think you can train a neural network to somehow play chess via nothing but classification. I am not sure whether you could do it with a feed forward neural network, and it’s a bit unclear to me whether the neural networks from the 50s are the same thing as the neural networks from 2000s, but it does sure seem like you can just throw a magic category absorber at chess and then have it play OK chess.
My guess is modern networks are not meaningfully more complicated, and the difference to back then was indeed just scale and a few tweaks, but I am not super confident and haven’t looked much into the history here.
Really? Ok let’s break down phrase by phrase; tell me exactly where I am misunderstanding:
Did EY claim Hibbard’s plan will succeed or fail?
Did EY claim Hibbard’s plan will result in tiling the future light-cone of earth with tiny molecular smiley-faces?
Were these claims made in a paper titled “Complex Value Systems are Required to Realize Valuable Futures”?
I’ve been here since the beginning, and I’m not sure who you have been explaining that too, but it certainly was not me. And where did I claim this is something new related to deep learning?
I’m going to try to clarify this one last time. There are several different meanings of “learn human values”
1.) Training a machine learning model to learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language, and using that as the utility function of the AI, such that it hopefully cares about human happiness. This is Hibbard’s plan from 2001 - long before DL. This model is trained before the AI becomes even human-level intelligent, and used as its initial utility/reward function.
2.) An AGI internally automatically learning human values as part of learning a model of the world—which would not automatically result in it caring about human values at all.
You keep confusing 1 and 2 - specifically you are confusing arguments concerning 2 directed at laypeople with Hibbard’s type 1 proposal.
Hibbard doesn’t believe that 2 will automatically work. Instead he is arguing for 1, and EY is saying that will fail. (And for the record, although EY’s criticism is overconfident, I am not optimistic about Hibbard’s plan as stated, but that was 2001)
Because I’m not?
Hibbard is attempting to make his AI care about safety at the onset (or at least happiness which is his version thereof), he’s not trying to pass the entire buck to the AI.
Will respond more later, but maybe this turns out to be the crux:
But “happiness” is not safety! That’s the whole point of this argument. If you optimize for your current conception of “happiness” you will get some kind of terrible thing that doesn’t remotely capture your values, because your values are fragile and you can’t approximate them by the process of “I just had my AI interact with a bunch of happy people and gave it positive reward, and a bunch of sad people and gave it negative reward”.
There are 2 separate issues here:
Would Hibbard’s approach successfully learn a stable, robust concept of human happiness suitable for use as the reward/utility function of AGI?
Conditional on 1, is ‘happiness’ what we actually want?
The answer to 2 depends much on how one defines happiness, but if happiness includes satisfaction (ie empowerment, curiosity, self-actualization etc—the basis of fun), then it is probably sufficient, but that’s not the core argument.
Notice that EY does not assume 1 and argue 2, he instead argues that Hibbard’s approach doesn’t learn a robust concept of happiness at all and instead learns a trivial superficial “maximize faciness” concept instead.
This is crystal clear and unambiguous:
He describes the result as a utility function of smiles, not a utility function of happiness.
So no, EY’s argument here is absolutely not about happiness being insufficient for safety. His argument is that happiness is incredibly complex and hard to learn a robust version of, and therefor Hibbard’s simplistic approach will learn some stupid superficial ‘faciness’ concept rather than happiness.
See also current debates around building a diamond-maximizing AI, where there is zero question of whether diamondness is what we want, and all the debate is around the (claimed) incredible difficulty of learning a robust version of even something simple like diamondness.
I think I am more interested in you reading The Genie Knows but Doesn’t Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate (given that it was linked by Robby in a bunch of the discussion there after it was written).
I think there are some subtle things here. I think Eliezer!2008 would agree that AIs will totally learn a robust concept for “car”. But I think neither Eliezer!2008 nor me currently would think that current LLMs have any chance of learning a robust concept for “happiness” or “goodness”, in substantial parts because I don’t have a robust concept of “happiness” or “goodness” and before the AI refines those concepts further than I can, I sure expect it to be able to disempower me (though it’s not like guaranteed that that will happen).
What Eliezer is arguing against is not that the AI will not learn any human concepts. It’s that there are a number of human concepts that tend to lean on the whole ontological structure of how humans think about the world (like “low-impact” or “goodness” or “happiness”), such that in order to actually build an accurate model of those, you have to do a bunch of careful thinking and need to really understand how humans view the world, and that people tend to be systematically optimistic about how convergent these kinds of concepts are, as opposed to them being contingent on the specific ways humans think.
My guess is an AI might very well spend sufficient cycles on figuring out human morality after it has access to a solarsystem level of compute, but I think that is unlikely to happen before it has disempowered us, so the ordering here matters a lot (see e.g. my response to Zack above).
So I think there are three separate points here that I think have caused confusion and probably caused us to talk past each other for a while, all of which I think were things that Eliezer was thinking about, at least around 2013-2014 (I know less about 2008):
Low-powered AI systems will have a really hard time learning high-level human concepts like “happiness”, and if you try to naively get them to learn that concept (by e.g pointing them towards smiling humans) you will get some kind of abomination, since even humans have trouble with those kinds of concepts
It is likely that by the time an AI will understand what humans actually really want, we will not have much control over its training process, and so despite it now understanding those constraints, we will have no power to shape its goals towards that
Even if we and the AI had a very crisp and clear concept of a goal I would like the AI to have, humanity won’t know how to actually cause the AI to point towards that as a goal (see e.g. the diamond maximizer problem)
To now answer your concrete questions:
My first response to this is: “I mean, of course not at current LLM capabilities. Ask GPT-3 about happiness, and you will get something dumb and incoherent back. If you keep going and make more capably systems try to do this, it’s pretty likely your classifier will be smart enough to kill you to have more resources to drive the prediction error downwards before it actually arrived at a really deep understanding of human happiness (which appears to require substantial superhuman abilities, given that humans do not have a coherent model of happiness themselves)”
So no, I don’t think Hibbard’s approach would work. Separately, we have no idea how to use a classifier as a reward/utility function for an AGI, so that part of the approach also wouldn’t work. Like, what do you actually concretely propose we do after we have a classifier over video frames that causes a separate AI to then actually optimize for the underlying concept boundary?
But even if you ignore both of these problems, and you avoid the AI killing you in pursuit of driving down prediction error, and you somehow figure out how to take a classifier and use it as a utility function, then you are still not in a good shape, because the AI will likely be able to achieve lower prediction error by modeling the humans doing the labeling process of the data you provide, and modeling what errors they are actually making, and will learn the more natural concept of “things that look happy to humans” instead of the actual happiness concept.
This is a really big deal, because if you start giving an AI the “things that look happy to humans” concept, you will end up with an AI that gets really good at deceiving humans and convincing them that something is happy, which will both quickly involve humans getting fooled and disempowered, and then in the limit might produce something surprisingly close to a universe tiled in smiley faces (convincing enough such that if you point a video camera at it, the rater who was looking at it for 15 seconds would indeed be convinced that it was happy, though there are no raters around).
I think Hibbard’s approach fails for all three reasons that I listed above, and I don’t think modern systems somehow invalidate any of those three reasons. I do think (as I have said in other comments) that modern systems might make indirect normativity approaches more promising, but I don’t think it moves the full value-loading problem anywhere close to the domain of solvability with current systems.
Looking over that it just seems to be a straightforward extrapolation of EY’s earlier points, so I’m not sure why you thought it was especially relevant.
Yeah—this is his core argument against Hibbard. I think Hibbard 2001 would object to ‘low-powered’, and would probably have other objections I’m not modelling, but regardless I don’t find this controversial.
Yeah, in agreement with what I said earlier:
...
I believe I know what you meant, but this seems somewhat confused as worded. If we can train an ML model to learn a very crisp clear concept of a goal, then having the AI optimize for this (point towards it) is straightforward. Long term robustness may be a different issue, but I’m assuming that’s mostly covered under “very crisp clear concept”.
The issue of course is that what humans actually want is complex for us to articulate, let alone formally specify. The update since 2008/2011 is that DL may be able to learn a reasonable proxy of what we actually want, even if we can’t fully formally specify it.
I think this is something of a red herring. Humans can reasonably predict utility functions of other humans in complex scenarios simply by simulating the other as self—ie through empathy. Also happiness probably isn’t the correct thing—probably want the AI to optimize for our empowerment (future optionality), but that’s whole separate discussion.
Sure, neither do I.
A classifier is a function which maps high-D inputs to a single categorical variable, and a utility function just maps some high-D input to a real number, but a k-categorical variable is just the explicit binned model of a log(k) bit number, so these really aren’t that different, and there are many interpolations between. (and in fact sometimes it’s better to use the more expensive categorical model for regression )
Video frames? The utility function needs to be over future predicted world states .. which you could I guess use to render out videos, but text rendering are probably more natural.
I propose we actually learn how the brain works, and how evolution solved alignment, to better understand our values and reverse engineer them. That is probably the safest approach—having a complete understanding of the brain.
However, I’m also somewhat optimistic on theoretical approaches that focus more explicitly on optimizing for external empowerment (which is simpler and more crisp), and how that could be approximated pragmatically with current ML approaches. Those two topics are probably my next posts.
Our best conditional generative models sample from a conditional distribution, they don’t optimize for feature-ness. The GAN analogy is also mostly irrelevant because diffusion models have taken over for conditional generation, and Nate’s comment seems confused as applied to diffusion models.
Nate’s comment isn’t confused, he’s not talking about diffusion models, he’s talking about the kinds of AI that could take over the world and reshape it to optimize for some values/goals/utility-function/etc.
Katja says:
Nate’s comment:
Katja is talking about current ML systems and how the fragility issue EY predicted didn’t materialize (actually it arguably did in earlier systems). Nate’s comment is clearly referencing Katja’s analogy—faciness—and he’s clearly implying we haven’t seen the problem with face generators yet because they haven’t pushed the optimization hard enough to find the maximally-facelike image. But he’s just wrong there—they don’t have that problem, no matter how hard you scale their optimization power—and that is part of why Katja’s analogy works so well at a deeper level: future ML systems do not work the way AI risk folks thought they would.
Diffusion models are relevant because they improve on conditional GANs by leveraging powerful pretrained discriminative foundation models and by allowing for greater optimization power at inference time, improvements that also could be applied to planning agents.
ML systems still use plenty of reinforcement learning, and systems that apply straightforward optimization pressure. We’ve also built a few systems more recently that do something closer to trying to recreate samples from a distribution, but that doesn’t actually help you improve on (or even achieve) human-level performance. In order to improve on human level performance, you either have to hand-code ontologies (by plugging multiple simulator systems together in a CAIS fashion), or just do something like reinforcement learning, which then very quickly does display the error modes everyone is talking about.
Current systems do not display a lack of edge-instantiation behavior. Some of them seem more robust, but the ones that do also seem fundamentally limited (and also, they will likely still show edge-instantiation for their inner objective, but that’s harder to talk about).
And also just to make the very concrete point, Katja linked to a bunch of faces generated by a GAN, which straightforwardly has the problems people are talking about, so there really is no mismatch in the kinds of things that Katja is talking about, and Nate is talking about. We could perform a more optimized search for things that are definitely faces according to the discriminator, and we would likely get something horrifying.
Sure you could do that, but people usually don’t—unless they intentionally want something horrifying. So if your argument is now “sure, new ML systems totally can solve the faciness problem, but only if you choose to use them correctly”—then great, finally we agree.
Interestingly enough in diffusion planning models if you crank up the discriminator you get trajectories that are higher utility but increasingly unrealistic. You get lower utility trajectories by cranking down the discriminator.
Clarifying questions, either for you or for someone else, to aid my own confusion:
What does “applying optimization pressure” mean? Is steering random noise into the narrow part of configuration space that contains plausible images-of-X (the thing DDPMs and GAN generators do) a straightforward example of it?
EDIT: Split up above question into two.
This feels like something we should just test? I don’t have access to any such model but presumably someone does and can just run the experiment? Bcos it seems like peoples hunches are varying a lot here
Also, we don’t know what would happen if we exactly optimized an image to maximize the activation of a particular human’s face detection circuitry. I expect that the result would be pretty eldritch as well.
We may be already doing that in case of cartoon faces with their exaggerated features. Cartoon faces don’t look eldritch to us, but why would they?
They are still smooth and have low-frequency patterns, which seems to be the main difference from adversarial examples currently produced from DL models.
Yeah. Wake me up when we find a single agent which makes decisions by extremizing its own concept activations. EG I’m pretty sure that people don’t reflectively, most strongly want to make friends with entities which maximally activate their potential-friend detection circuitry.
(sort of nitpicking):
I think it makes more sense to look for the highest density in pixel space; this requires integrating over all settings of the latents (unless your generator is invertible, in which case you can just use change of variables formula). I expect the argument to go through, but it would be interesting to do this with an invertible generator (e.g. normalizing flow) and see if it actually does.