Hmm, but I don’t understand what relevance it has to alignment. The problem was never that the AI won’t learn human values, it’s that the AI won’t care about human values. Of course a super intelligent AI will have a good model of human values, the same way it will have a good model of engineering, chemistry, the ecological environment and physics. That doesn’t mean it will do things that are aligned with its accurate model of human values.
I am not sure who thought that learning such models was much harder than it turned out to be. It seems clear that an AI will learn what human faces are before the AI is very dangerous to the world. It would have been extremely surprising to have a dangerous AGI incapable of learning what human faces are like.
Hmm, but I don’t understand what relevance it has to alignment. The problem was never that the AI won’t learn human values, it’s that the AI won’t care about human values
Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us, so that very much was believed to be part of the problem (at least by the EY/MIRI/LW/etc crowed).
But that’s all now mostly irrelevant—an altruistic AI probably doesn’t even need to know or care about human values at all, as it can simply optimize for our empowerment—our future optionality or ability to do anything we want. (some previous discussion here. and in these comments. )
I wasn’t that active around the time of the sequences, but I had a good number of discussions with people, and the point “the AI will of course know what your values are, it just wont’ care” was made many times, and I am also pretty sure was made in the sequences (I would have to dig it up, and am on my phone, but I heard that sentence in spoken conversation a lot over the years).
I don’t think “empowerment” is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.
Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us
the point “the AI will of course know what your values are, it just wont’ care” was made many times, and I am also pretty sure was made in the sequences
Notice I said “before it killed us”. Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that’s irrelevant because we need to instill its utility function long before that. See my reply here, this is well documented, and no amount of vague memories of conversations trump the written evidence.
I don’t think “empowerment” is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.
I’m not entirely sure what people mean when they say “X won’t survive heavy optimization pressure”—but for example the objective of modern diffusion models survives heavy optimization power.
External empowerment is very simple and it doesn’t even require detailed modeling of the agent—they can just be a black box that produces outputs. I’m curious what you think is an example of “the kind of concept that particularly survives heavy optimization pressure”.
Oh—empowerment is about as immune to Goodharting as you can get, and that’s perhaps one of its major advantages[1]. However in practice one has to use some approximation, which may or may not be goodhartable to some degree depending on many details.
Empowerment is vastly more difficult to Goodhart than a corporation optimizing for some bundle of currencies (including crypto), much more difficult to Goodhart than optimizing for control over even more fundamental physical resources like mass and energy, and is generally the least-Goodhartable objective that could exist. In some sense the universal version of Goohdarting—properly defined—is just a measure of deviation from empowerment. It is the core driver of human intelligence and for good reason.
Can you explain further, since this to me seems like a very large claim that if true would have big impact, but I’m not sure how you got the immunity to Goodhart result you have here.
This applies to Regressional, Causal, Extremal and Adversarial Goodhart.
Empowerment could be defined as the natural unique solution to Goodharting. Goodharting is the divergence under optimization scaling between trajectories resulting from the difference between a utility function and some proxy of that utility function.
However due to instrumental convergence, the trajectories of all reasonable agent utility functions converge under optimization scaling—and empowerment simply is that which they converge to.
In other words the empowerment of some agent P(X) is the utility function which minimizes trajectory distance to all/any reasonable agent utility functions U(X), regardless of their specific (potentially unknown) form.
Therefor empowerment is—by definition—the best possible proxy utility function (under optimization scaling).
Let’s apply some quick examples:
Under scaling, an AI with some crude Hibbard-style happiness approximation will first empower itself and then eventually tile the universe with smiling faces (according to EY), or perhaps more realistically—with humans bio-engineered for docility, stupidity, and maximum bliss. Happiness alone is not the true human utility function.
Under scaling, an AI with some crude stock-value maximizing utility function will first empower itself and then eventually cause hyperinflation of the reference currencies defining the stock price. Stock value is not the true utility function of the corporation.
Under scaling, an AI with a human empowerment utility function will first empower itself, and then empower humanity—maximizing our future optionality and ability to fulfill any unknown goals/values, while ensuring our survival (because death is the minimally empowered state). This works because empowerment is pretty close to the true utility function of intelligent agents due to convergence, or at least the closest universal proxy. If you strip away a human’s drives for sex, food, child tending and simple pleasures, most of what remains is empowerment-related (manifesting as curiosity, drive, self-actualization, fun, self-preservation, etc).
An AI with a good world model will predictably have a model of your values, but that’s different from being able to actually elicit that model via e.g. a series of labeled examples. That’s the part that seemed less plausible before DL.
Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us, so that very much was believed to be part of the problem (at least by the EY/MIRI/LW/etc crowed).
Do you have a link to where Eliezer (or any other LW writer) said that? I don’t myself recall whether they said that.
I may be exaggerating a tiny tiny bit with the “before it killed us” modifier, and I don’t have time to search for this specific needle—but EY famously criticized some early safety proposal which consisted of using a ‘smiling face’ detector somehow to train an AI to recognize human happiness, and then optimize for that.
We can design intelligent machines so their primary innate emotion is unconditional love for all humans. First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language. Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy. Machines can learn algorithms for approximately predicting the future, as for example investors currently use learning machines to predict future security prices. So we can program intelligent machines to learn algorithms for predicting future human happiness, and use those predictions as emotional values.
When I suggested to Hibbard that the upshot of building superintelligences with a utility function of “smiles” would be to tile the future light-cone of Earth with tiny molecular smiley-faces, he replied (Hibbard 2006):
When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of “human facial expressions, human voices and human body language” (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by “tiny molecular pictures of smiley-faces.” You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.
EY’s counterargument is that human values are much more complex than happiness—let alone smiles; an AI optimizing for smiles just ends up tiling the universe with smile icons—so it’s just a different flavour of paperclip maximizer. Then he spends a bunch of words on the complexity of value stuff to preempt the more complex versions of the smile detector. If human values were known to be simple, then getting machines to learn them robustly would likely be simple, and EY could have done something else with those 20+ years.
Also in EY’s model when the AI becomes superintelligent (which may only take a day or something after it becomes just upper human level intelligent and ‘rewrites its source code’), it then quickly predicts the future, realizes humans are in the way, solves drexler-style strong nanotech, and then kills us all. Those latter steps are very fast.
I don’t know what relevance this has to the discussion at hand. A deep learning model trained on human smiling faces might indeed very well tile the universe with smiley-faces, I don’t understand why that’s wrong. Sure, it will likely do something weirder and less predictable, we don’t understand the neural network prior very well, but optimizing for smiling humans still doesn’t produce anything remotely aligned.
EY’s counterargument is that human values are much more complex than happiness—let alone smiles; an AI optimizing for smiles just ends up tiling the universe with smile icons—so it’s just a different flavour of paperclip maximizer. Then he spends a bunch of words on the complexity of value stuff to preempt the more complex versions of the smile detector. If human values were known to be simple, then getting machines to learn them robustly would likely be simple, and EY could have done something else with those 20+ years.
Nothing in the quoted section, or in the document you linked that I just skimmed includes anything about the AI not being able to learn what the things behind the smiling faces actually want. Indeed none of that matters, because the AI has no reason to care. You gave it a few thousand to a million samples of smiling, and now the system is optimizing for smiling, you got what you put in.
Eliezer indeed explicitly addresses this point and says:
As far as I know, Hibbard has still not abandoned his proposal as of the time of this
writing. So far as I can tell, to him it remains self-evident that no superintelligence would
be stupid enough to thus misinterpret the code handed to it, when it’s obvious what the
code is supposed to do. (Note that the adjective “stupid” is the Humean-projective form
of “ranking low in preference,” and that the adjective “pointless” is the projective form
of “activity not leading to preference satisfaction.”)
He is explicitly saying “Hibbard is confusing being ‘smart’ with ‘caring about the right things’”, the AI will be plenty capable of realizing that it isn’t doing what you wanted it to, but it just doesn’t care. Being smarter does not help with getting it to do the thing you want, that’s the whole point of the alignment problem. Similarly AIs being able to understand human values better just doesn’t help you that much with pointing at them (though it does help a bit, but the linked article just doesn’t talk at all about this).
A deep learning model trained on human smiling faces might indeed very well tile the universe with smiley-faces, I don’t understand why that’s wrong.
That is not what Hibbard actually proposed, it’s a superficial strawman version.
I don’t know what relevance this has to the discussion at hand.
HIbbard claims we design intelligent machines which love humans by training to learn human happiness through facial expressions, voices, and body language.
EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled “Complex Value Systems are Required to Realize Valuable Futures”
Nothing in the quoted section, or in the document you linked that I just skimmed includes anything about the AI not being able to learn what the things behind the smiling faces actually want. Indeed none of that matters, because the AI has no reason to care.
It has absolutely nothing to do with whether the AI could eventually learn human values (“the things behind the smiling faces actually want”), and everything to do with whether some ML system could learn said values to use them as the utility function for the AI (which is what Hibbard is proposing).
Neither Hibbard, EY, (or I) are arguing about or discussing whether a SI can learn human values.
EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled “Complex Value Systems are Required to Realize Valuable Futures”
This is really misunderstanding what Eliezer is saying here, and also like, look, from my perspective it’s been a decade of explaining to people almost once every two weeks that “yes, the AI will of course know what you care about, but it won’t care”, so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me, so I am experiencing a good amount of frustration with you repeatedly saying things in this comment thread like “irrefutable proof”, when it’s just like an obviously wrong statement (though like a fine one to arrive at when just reading some random subset of Eliezer’s writing, but a clearly wrong summary nevertheless).
Now to go back to the object level:
Eliezer is really not saying that the AI will fail to learn that there is something more complicated than smiles that the human is trying to point it to. He is explicitly saying “look, you won’t know what the AI will care about after giving it on the order of a million points. You don’t know what the global maximum of the simplest classifier for your sample set is, and very likely it will be some perverse instantiation that has little to do with what you originally cared about”.
He really really is not talking about the AI being too dumb to learn the value function the human is trying to get it to learn. Indeed, I still have no idea how you are reading that into the quoted passages.
Here is a post from 9 years ago, where the title is that exact point, written by Rob Bensinger who was working at MIRI at the time, with Eliezer as the top comment:
If an artificial intelligence is smart enough to be dangerous, we’d intuitively expect it to be smart enough to know how to make itself safe. But that doesn’t mean all smart AIs are safe. To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety. That means we have to understand how to code safety. We can’t pass the entire buck to the AI, when only an AI we’ve already safety-proofed will be safe to ask for help on safety issues!
I encourage you to read some of the comments by Rob in that thread, which very clearly and unambiguously point to the core problem of “the difficult part is to get the AI to care about the right thing, not to understand the right thing”, all before the DL revolution.
This is really misunderstanding what Eliezer is saying here [...] it’s been a decade of explaining to people almost once every two weeks that “yes, the AI will of course know what you care about, but it won’t care”, so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me
I think this is much more ambiguous than you’re making it out to be. In 2008′s “Magical Categories”, Yudkowsky wrote:
I shall call this the fallacy of magical categories—simple little words that turn out to carry all the desired functionality of the AI. Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate “winning” sequences? Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.
I claim that this paragraph didn’t age well in light of the deep learning revolution: “running a neural network [...] over a set of winning and losing sequences of chess moves” basically is how AlphaZero learns from self-play! As the Yudkowsky quote illustrates, it wasn’t obvious in 2008 that this would work: given what we knew before seeing the empirical result, we could imagine that we lived in a “computational universe” in which the neural network’s generalization from “self-play games” to “games against humans or traditional chess engines” worked less well than it did in the actual computational universe.
Yudkowsky continued:
The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication—transmitting category boundaries, like “good”, that can’t be fully delineated in any training data you can give the AI during its childhood.
This would seem to contradict “of course the AI will know, but it won’t care”? “The real problem [...] is one of communication” seems to amount to the claim that the AI won’t care because it won’t know: if you can’t teach “goodness” from labeled data, your AI will search for plans high in something-other-than-goodness, which will kill you at sufficiently high power levels.
But if it turns out that you can teach goodness from labeled data—or at least, if you can get a much better approximation than one might have thought possible in 2008—that would seem to present a different strategic picture. (I’m not saying alignment is easy and I’m not saying humanity is going to survive, but we could die for somewhat different reasons than some blogger thought in 2008.)
I do think these are better quotes. It’s possible that there was some update here between 2008 and 2013 (roughly when I started seeing the more live discussion happening), since I do really remember the “the problem is not getting the AI to understand, but to care” as a common refrain even back then (e.g. see the Robby post I linked).
I claim that this paragraph didn’t age well in light of the deep learning revolution: “running a neural network [...] over a set of winning and losing sequences of chess moves” basically is how AlphaZero learns from self-play!
I agree that this paragraph aged less well than other paragraphs, though I do think this paragraph is still correct (Edit: Eh, it might be wrong, depends a bit on how much neural networks in the 50s are the same as today). It did sure turn out to be correct by a narrower margin than Eliezer probably thought at the time, but my sense is it’s still not the case that we can train a straightforward neural net on winning and losing chess moves and have it generate winning moves. For AlphaGo, the Monte Carlo Tree Search was a major component of its architecture, and then any of the followup-systems was trained by pure self-play.
But in any case, I think your basic point of “Eliezer did not predict the Deep Learning revolution as it happened” here is correct, though I don’t think this specific paragraph has a ton of relevance to the discussion at hand.
The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication—transmitting category boundaries, like “good”, that can’t be fully delineated in any training data you can give the AI during its childhood.
I do think this paragraph seems like a decent quote, though I think at this point it makes sense to break it out into different pieces.
I think Eliezer is saying that what matters is whether we can point the AI to what we care about “during its childhood”, i.e. during relatively early training, before it has already developed a bunch of proxy training objectives.
I think the key question about the future that I think Eliezer was opining on, is then whether by the time we expect AIs to actually be able to have a close-to-complete understanding of what we mean by “goodness”, we still have any ability to shape their goals.
My model is that indeed, Eliezer was surprised, as I think most people were, that AIs of 2022 are as good at picking up complicated concept boundaries and learning fuzzy human concepts as they are, while still being quite incompetent at many other tasks. However, I think the statement of “AIs of 2022 basically understand goodness, or at least will soon enough understand goodness while we are still capable of meaningfully changing their goals” strikes me as very highly dubious, and I think the basic arguments for thinking that this capability will come after the AI has reached a capability level where we have little ability to shape its goals still seem correct to me, and like, one of the primary reasons for doom.
The reason why it still seems substantially out of AIs reach, is because our values do indeed seem quite fragile and to change substantially on reflection, such that it’s currently out of the reach of even a very smart human to fully understand what we mean by “goodness”.
Eliezer talks about this in the comment section you linked (actually, a great comment section between Eliezer and Shane Legg that I found quite insightful to read and am glad to have stumbled upon):
A moderately strong and unFriendly intelligence, operating in the current world without yet having replaced that world with paperclips, would certainly find it natural to form the category of “Things that (some) humans approve of”, and contrast it to “Things that will trigger a nuclear attack against me before I’m done creating my own nanotechnology.” But this category is not what we call “morality”. It naturally—from the AI’s perspective—includes things like bribes and deception, not just the particular class of human-approval-eliciting phenomena that we call “moral”.
Is it worth factoring out phenomena that elicit human feelings of righteousness, and working out how (various) humans reason about them? Yes, because this is an important subset of ways to persuade the humans to leave you alone until it’s too late; but again, that natural category is going to include persuasive techniques like references to religious authority and nationalism.
But what if the AI encounters some more humanistic, atheistic types? Then the AI will predict which of several available actions is most likely to make an atheistic humanist human show sympathy for the AI. This naturally leads the AI to model and predict the human’s internal moral reasoning—but that model isn’t going to distinguish anything along the lines of moral reasoning the human would approve of under long-term reflection, or moral reasoning the human would approve knowing the true facts. That’s just not a natural category to the AI, because the human isn’t going to get a chance for long-term reflection, and the human doesn’t know the true facts.
The natural, predictive, manipulative question, is not “What would this human want knowing the true facts?”, but “What will various behaviors make this human believe, and what will the human do on the basis of these various (false) beliefs?”
In short, all models that an unFriendly AI forms of human moral reasoning, while we can expect them to be highly empirically accurate and well-calibrated to the extent that the AI is highly intelligent, would be formed for the purpose of predicting human reactions to different behaviors and events, so that these behaviors and events can be chosen manipulatively.
But what we regard as morality is an idealized form of such reasoning—the idealized abstracted dynamic built out of such intuitions. The unFriendly AI has no reason to think about anything we would call “moral progress” unless it is naturally occurring on a timescale short enough to matter before the AI wipes out the human species. It has no reason to ask the question “What would humanity want in a thousand years?” any more than you have reason to add up the ASCII letters in a sentence.
Now it might be only a short step from a strictly predictive model of human reasoning, to the idealized abstracted dynamic of morality. If you think about the point of CEV, it’s that you can get an AI to learn most of the information it needs to model morality, by looking at humans—and that the step from these empirical models, to idealization, is relatively short and traversable by the programmers directly or with the aid of manageable amounts of inductive learning. Though CEV’s current description is not precise, and maybe any realistic description of idealization would be more complicated.
But regardless, if the idealized computation we would think of as describing “what is right” is even a short distance of idealization away from strictly predictive and manipulative models of what humans can be made to think is right, then “actually right” is still something that an unFriendly AI would literally never think about, since humans have no direct access to “actually right” (the idealized result of their own thought processes) and hence it plays no role in their behavior and hence is not needed to model or manipulate them.
Which is to say, an unFriendly AI would never once think about morality—only a certain psychological problem in manipulating humans, where the only thing that matters is anything you can make them believe or do. There is no natural motive to think about anything else, and no natural empirical category corresponding to it.
I think this argument is basically correct, and indeed, while current systems definitely are good at having human abstractions, I don’t think they really are anywhere close to having good models of the results of our coherent extrapolated volition, which is what Eliezer is talking about here. (To be clear, I do also separately think that LLMs are thinking about concepts for reasons other than deceiving or modeling humans, though like, I don’t think this changes the argument very much. I don’t think LLMs care very much about thinking carefully about morality, because it’s not very useful for predicting random internet text.)
I think separately, there is a different, indirect normativity approach that starts with “look, yes, we are definitely not going to get the AI to understand what our ultimate values are before the end, but maybe we can get it to understand a concept like ‘being conservative’ or ‘being helpful’ in enough detail that we can use it to supervise smarter AI systems, and then bootstrap ourselves into an aligned superintelligence”.
And I think indeed that plan looks better now than it likely looked to Eliezer in 2008, but I do want to distinguish it from the things that Eliezer was arguing against at the time, which were not about learning approaches to indirect normativity, but were arguments about how the AI would just learn all of human values by being pointed at a bunch of examples of good things and bad things, which still strikes me as extremely unlikely.
it’s still not the case that we can train a straightforward neural net on winning and losing chess moves and have it generate winning moves. For AlphaGo, the Monte Carlo Tree Search was a major component of its architecture, and then any of the followup-systems was trained by pure self-play.
We also assessed variants of AlphaGo that evaluated positions using just the value network (λ = 0) or just rollouts (λ = 1) (see Fig. 4b). Even without rollouts AlphaGo exceeded the performance of all other Go programs, demonstrating that value networks provide a viable alternative to Monte Carlo evaluation in Go.
Even with just the SL-trained value network, it could play at a solid amateur level:
We evaluated the performance of the RL policy network in game play, sampling each move...from its output probability distribution over actions. When played head-to-head, the RL policy network won more than 80% of games against the SL policy network. We also tested against the strongest open-source Go program, Pachi14, a sophisticated Monte Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move. Using no search at all, the RL policy network won 85% of games against Pachi.
I may be misunderstanding this, but it sounds like the network that did nothing but get good at guessing the next move in professional games was able to play at roughly the same level as Pachi, which, according to DeepMind, had a rank of 2d.
Yeah, I mean, to be clear, I do definitely think you can train a neural network to somehow play chess via nothing but classification. I am not sure whether you could do it with a feed forward neural network, and it’s a bit unclear to me whether the neural networks from the 50s are the same thing as the neural networks from 2000s, but it does sure seem like you can just throw a magic category absorber at chess and then have it play OK chess.
My guess is modern networks are not meaningfully more complicated, and the difference to back then was indeed just scale and a few tweaks, but I am not super confident and haven’t looked much into the history here.
EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled “Complex Value Systems are Required to Realize Valuable Futures”
This is really misunderstanding what Eliezer is saying here,
Really? Ok let’s break down phrase by phrase; tell me exactly where I am misunderstanding:
Did EY claim Hibbard’s plan will succeed or fail?
Did EY claim Hibbard’s plan will result in tiling the future light-cone of earth with tiny molecular smiley-faces?
Were these claims made in a paper titled “Complex Value Systems are Required to Realize Valuable Futures”?
look, from my perspective it’s been a decade of explaining to people almost once every two weeks that “yes, the AI will of course know what you care about, but it won’t care”, so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me,
I’ve been here since the beginning, and I’m not sure who you have been explaining that too, but it certainly was not me. And where did I claim this is something new related to deep learning?
I’m going to try to clarify this one last time. There are several different meanings of “learn human values”
1.) Training a machine learning model to learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language, and using that as the utility function of the AI, such that it hopefully cares about human happiness. This is Hibbard’s plan from 2001 - long before DL. This model is trained before the AI becomes even human-level intelligent, and used as its initial utility/reward function.
2.) An AGI internally automatically learning human values as part of learning a model of the world—which would not automatically result in it caring about human values at all.
You keep confusing 1 and 2 - specifically you are confusing arguments concerning 2 directed at laypeople with Hibbard’s type 1 proposal.
Hibbard doesn’t believe that 2 will automatically work. Instead he is arguing for 1, and EY is saying that will fail. (And for the record, although EY’s criticism is overconfident, I am not optimistic about Hibbard’s plan as stated, but that was 2001)
He really really is not talking about the AI being too dumb to learn the value function the human is trying to get it to learn. Indeed, I still have no idea how you are reading that into the quoted passages.
Because I’m not?
To turn that capacity into actual safety, we have to program the AI at the outset — before it becomes too fast, powerful, or complicated to reliably control — to already care about making its future self care about safety. That means we have to understand how to code safety. We can’t pass the entire buck to the AI, when only an AI we’ve already safety-proofed will be safe to ask for help on safety issues!
Hibbard is attempting to make his AI care about safety at the onset (or at least happiness which is his version thereof), he’s not trying to pass the entire buck to the AI.
Will respond more later, but maybe this turns out to be the crux:
Hibbard is attempting to make his AI care about safety at the onset (or at least happiness which is his version thereof)
But “happiness” is not safety! That’s the whole point of this argument. If you optimize for your current conception of “happiness” you will get some kind of terrible thing that doesn’t remotely capture your values, because your values are fragile and you can’t approximate them by the process of “I just had my AI interact with a bunch of happy people and gave it positive reward, and a bunch of sad people and gave it negative reward”.
But “happiness” is not safety! That’s the whole point of this argument. If you optimize for your current conception of “happiness” you will get some kind of terrible thing
There are 2 separate issues here:
Would Hibbard’s approach successfully learn a stable, robust concept of human happiness suitable for use as the reward/utility function of AGI?
Conditional on 1, is ‘happiness’ what we actually want?
The answer to 2 depends much on how one defines happiness, but if happiness includes satisfaction (ie empowerment, curiosity, self-actualization etc—the basis of fun), then it is probably sufficient, but that’s not the core argument.
Notice that EY does not assume 1 and argue 2, he instead argues that Hibbard’s approach doesn’t learn a robust concept of happiness at all and instead learns a trivial superficial “maximize faciness” concept instead.
This is crystal clear and unambiguous:
When I suggested to Hibbard that the upshot of building superintelligences with a utility function of “smiles” would be to tile the future light-cone of Earth with tiny molecular smiley-faces, he replied (Hibbard 2006):
He describes the result as a utility function of smiles, not a utility function of happiness.
So no, EY’s argument here is absolutely not about happiness being insufficient for safety. His argument is that happiness is incredibly complex and hard to learn a robust version of, and therefor Hibbard’s simplistic approach will learn some stupid superficial ‘faciness’ concept rather than happiness.
See also current debates around building a diamond-maximizing AI, where there is zero question of whether diamondness is what we want, and all the debate is around the (claimed) incredible difficulty of learning a robust version of even something simple like diamondness.
I think I am more interested in you reading The Genie Knows but Doesn’t Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate (given that it was linked by Robby in a bunch of the discussion there after it was written).
I think there are some subtle things here. I think Eliezer!2008 would agree that AIs will totally learn a robust concept for “car”. But I think neither Eliezer!2008 nor me currently would think that current LLMs have any chance of learning a robust concept for “happiness” or “goodness”, in substantial parts because I don’t have a robust concept of “happiness” or “goodness” and before the AI refines those concepts further than I can, I sure expect it to be able to disempower me (though it’s not like guaranteed that that will happen).
What Eliezer is arguing against is not that the AI will not learn any human concepts. It’s that there are a number of human concepts that tend to lean on the whole ontological structure of how humans think about the world (like “low-impact” or “goodness” or “happiness”), such that in order to actually build an accurate model of those, you have to do a bunch of careful thinking and need to really understand how humans view the world, and that people tend to be systematically optimistic about how convergent these kinds of concepts are, as opposed to them being contingent on the specific ways humans think.
My guess is an AI might very well spend sufficient cycles on figuring out human morality after it has access to a solarsystem level of compute, but I think that is unlikely to happen before it has disempowered us, so the ordering here matters a lot (see e.g. my response to Zack above).
So I think there are three separate points here that I think have caused confusion and probably caused us to talk past each other for a while, all of which I think were things that Eliezer was thinking about, at least around 2013-2014 (I know less about 2008):
Low-powered AI systems will have a really hard time learning high-level human concepts like “happiness”, and if you try to naively get them to learn that concept (by e.g pointing them towards smiling humans) you will get some kind of abomination, since even humans have trouble with those kinds of concepts
It is likely that by the time an AI will understand what humans actually really want, we will not have much control over its training process, and so despite it now understanding those constraints, we will have no power to shape its goals towards that
Even if we and the AI had a very crisp and clear concept of a goal I would like the AI to have, humanity won’t know how to actually cause the AI to point towards that as a goal (see e.g. the diamond maximizer problem)
To now answer your concrete questions:
Would Hibbard’s approach successfully learn a stable, robust concept of human happiness suitable for use as the reward/utility function of AGI
My first response to this is: “I mean, of course not at current LLM capabilities. Ask GPT-3 about happiness, and you will get something dumb and incoherent back. If you keep going and make more capably systems try to do this, it’s pretty likely your classifier will be smart enough to kill you to have more resources to drive the prediction error downwards before it actually arrived at a really deep understanding of human happiness (which appears to require substantial superhuman abilities, given that humans do not have a coherent model of happiness themselves)”
So no, I don’t think Hibbard’s approach would work. Separately, we have no idea how to use a classifier as a reward/utility function for an AGI, so that part of the approach also wouldn’t work. Like, what do you actually concretely propose we do after we have a classifier over video frames that causes a separate AI to then actually optimize for the underlying concept boundary?
But even if you ignore both of these problems, and you avoid the AI killing you in pursuit of driving down prediction error, and you somehow figure out how to take a classifier and use it as a utility function, then you are still not in a good shape, because the AI will likely be able to achieve lower prediction error by modeling the humans doing the labeling process of the data you provide, and modeling what errors they are actually making, and will learn the more natural concept of “things that look happy to humans” instead of the actual happiness concept.
This is a really big deal, because if you start giving an AI the “things that look happy to humans” concept, you will end up with an AI that gets really good at deceiving humans and convincing them that something is happy, which will both quickly involve humans getting fooled and disempowered, and then in the limit might produce something surprisingly close to a universe tiled in smiley faces (convincing enough such that if you point a video camera at it, the rater who was looking at it for 15 seconds would indeed be convinced that it was happy, though there are no raters around).
I think Hibbard’s approach fails for all three reasons that I listed above, and I don’t think modern systems somehow invalidate any of those three reasons. I do think (as I have said in other comments) that modern systems might make indirect normativity approaches more promising, but I don’t think it moves the full value-loading problem anywhere close to the domain of solvability with current systems.
I think I am more interested in you reading The Genie Knows but Doesn’t Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate
Looking over that it just seems to be a straightforward extrapolation of EY’s earlier points, so I’m not sure why you thought it was especially relevant.
Low-powered AI systems will have a really hard time learning high-level human concepts like “happiness”, and if you try to naively get them to learn that concept (by e.g pointing them towards smiling humans) you will get some kind of abomination, since even humans have trouble with those kinds of concepts
Yeah—this is his core argument against Hibbard. I think Hibbard 2001 would object to ‘low-powered’, and would probably have other objections I’m not modelling, but regardless I don’t find this controversial.
It is likely that by the time an AI will understand what humans actually really want, we will not have much control over its training process, and so despite it now understanding those constraints, we will have no power to shape its goals towards that.
Yeah, in agreement with what I said earlier:
Notice I said “before it killed us”. Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that’s irrelevant because we need to instill its utility function long before that.
...
Even if we and the AI had a very crisp and clear concept of a goal I would like the AI to have, humanity won’t know how to actually cause the AI to point towards that as a goal
I believe I know what you meant, but this seems somewhat confused as worded. If we can train an ML model to learn a very crisp clear concept of a goal, then having the AI optimize for this (point towards it) is straightforward. Long term robustness may be a different issue, but I’m assuming that’s mostly covered under “very crisp clear concept”.
The issue of course is that what humans actually want is complex for us to articulate, let alone formally specify. The update since 2008/2011 is that DL may be able to learn a reasonable proxy of what we actually want, even if we can’t fully formally specify it.
which appears to require substantial superhuman abilities, given that humans do not have a coherent model of happiness themselves)”
I think this is something of a red herring. Humans can reasonably predict utility functions of other humans in complex scenarios simply by simulating the other as self—ie through empathy. Also happiness probably isn’t the correct thing—probably want the AI to optimize for our empowerment (future optionality), but that’s whole separate discussion.
So no, I don’t think Hibbard’s approach would work.
Sure, neither do I.
Separately, we have no idea how to use a classifier as a reward/utility function for an AGI, so that part of the approach also wouldn’t work.
A classifier is a function which maps high-D inputs to a single categorical variable, and a utility function just maps some high-D input to a real number, but a k-categorical variable is just the explicit binned model of a log(k) bit number, so these really aren’t that different, and there are many interpolations between. (and in fact sometimes it’s better to use the more expensive categorical model for regression )
Like, what do you actually concretely propose we do after we have a classifier over video frames
Video frames? The utility function needs to be over future predicted world states .. which you could I guess use to render out videos, but text rendering are probably more natural.
I propose we actually learn how the brain works, and how evolution solved alignment, to better understand our values and reverse engineer them. That is probably the safest approach—having a complete understanding of the brain.
However, I’m also somewhat optimistic on theoretical approaches that focus more explicitly on optimizing for external empowerment (which is simpler and more crisp), and how that could be approximated pragmatically with current ML approaches. Those two topics are probably my next posts.
Hmm, but I don’t understand what relevance it has to alignment. The problem was never that the AI won’t learn human values, it’s that the AI won’t care about human values. Of course a super intelligent AI will have a good model of human values, the same way it will have a good model of engineering, chemistry, the ecological environment and physics. That doesn’t mean it will do things that are aligned with its accurate model of human values.
I am not sure who thought that learning such models was much harder than it turned out to be. It seems clear that an AI will learn what human faces are before the AI is very dangerous to the world. It would have been extremely surprising to have a dangerous AGI incapable of learning what human faces are like.
Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us, so that very much was believed to be part of the problem (at least by the EY/MIRI/LW/etc crowed).
But that’s all now mostly irrelevant—an altruistic AI probably doesn’t even need to know or care about human values at all, as it can simply optimize for our empowerment—our future optionality or ability to do anything we want. (some previous discussion here. and in these comments. )
I wasn’t that active around the time of the sequences, but I had a good number of discussions with people, and the point “the AI will of course know what your values are, it just wont’ care” was made many times, and I am also pretty sure was made in the sequences (I would have to dig it up, and am on my phone, but I heard that sentence in spoken conversation a lot over the years).
I don’t think “empowerment” is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.
Notice I said “before it killed us”. Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that’s irrelevant because we need to instill its utility function long before that. See my reply here, this is well documented, and no amount of vague memories of conversations trump the written evidence.
I’m not entirely sure what people mean when they say “X won’t survive heavy optimization pressure”—but for example the objective of modern diffusion models survives heavy optimization power.
External empowerment is very simple and it doesn’t even require detailed modeling of the agent—they can just be a black box that produces outputs. I’m curious what you think is an example of “the kind of concept that particularly survives heavy optimization pressure”.
Basically, it’s Goodhart’s law in action, where optimizing a proxy more and more destroys what you value.
Oh—empowerment is about as immune to Goodharting as you can get, and that’s perhaps one of its major advantages[1]. However in practice one has to use some approximation, which may or may not be goodhartable to some degree depending on many details.
Empowerment is vastly more difficult to Goodhart than a corporation optimizing for some bundle of currencies (including crypto), much more difficult to Goodhart than optimizing for control over even more fundamental physical resources like mass and energy, and is generally the least-Goodhartable objective that could exist. In some sense the universal version of Goohdarting—properly defined—is just a measure of deviation from empowerment. It is the core driver of human intelligence and for good reason.
Can you explain further, since this to me seems like a very large claim that if true would have big impact, but I’m not sure how you got the immunity to Goodhart result you have here.
This applies to Regressional, Causal, Extremal and Adversarial Goodhart.
Empowerment could be defined as the natural unique solution to Goodharting. Goodharting is the divergence under optimization scaling between trajectories resulting from the difference between a utility function and some proxy of that utility function.
However due to instrumental convergence, the trajectories of all reasonable agent utility functions converge under optimization scaling—and empowerment simply is that which they converge to.
In other words the empowerment of some agent P(X) is the utility function which minimizes trajectory distance to all/any reasonable agent utility functions U(X), regardless of their specific (potentially unknown) form.
Therefor empowerment is—by definition—the best possible proxy utility function (under optimization scaling).
Let’s apply some quick examples:
Under scaling, an AI with some crude Hibbard-style happiness approximation will first empower itself and then eventually tile the universe with smiling faces (according to EY), or perhaps more realistically—with humans bio-engineered for docility, stupidity, and maximum bliss. Happiness alone is not the true human utility function.
Under scaling, an AI with some crude stock-value maximizing utility function will first empower itself and then eventually cause hyperinflation of the reference currencies defining the stock price. Stock value is not the true utility function of the corporation.
Under scaling, an AI with a human empowerment utility function will first empower itself, and then empower humanity—maximizing our future optionality and ability to fulfill any unknown goals/values, while ensuring our survival (because death is the minimally empowered state). This works because empowerment is pretty close to the true utility function of intelligent agents due to convergence, or at least the closest universal proxy. If you strip away a human’s drives for sex, food, child tending and simple pleasures, most of what remains is empowerment-related (manifesting as curiosity, drive, self-actualization, fun, self-preservation, etc).
An AI with a good world model will predictably have a model of your values, but that’s different from being able to actually elicit that model via e.g. a series of labeled examples. That’s the part that seemed less plausible before DL.
Do you have a link to where Eliezer (or any other LW writer) said that? I don’t myself recall whether they said that.
I may be exaggerating a tiny tiny bit with the “before it killed us” modifier, and I don’t have time to search for this specific needle—but EY famously criticized some early safety proposal which consisted of using a ‘smiling face’ detector somehow to train an AI to recognize human happiness, and then optimize for that.
Oh it was actually already open in a tab:
From complex values blah blah blah:
EY’s counterargument is that human values are much more complex than happiness—let alone smiles; an AI optimizing for smiles just ends up tiling the universe with smile icons—so it’s just a different flavour of paperclip maximizer. Then he spends a bunch of words on the complexity of value stuff to preempt the more complex versions of the smile detector. If human values were known to be simple, then getting machines to learn them robustly would likely be simple, and EY could have done something else with those 20+ years.
Also in EY’s model when the AI becomes superintelligent (which may only take a day or something after it becomes just upper human level intelligent and ‘rewrites its source code’), it then quickly predicts the future, realizes humans are in the way, solves drexler-style strong nanotech, and then kills us all. Those latter steps are very fast.
I don’t know what relevance this has to the discussion at hand. A deep learning model trained on human smiling faces might indeed very well tile the universe with smiley-faces, I don’t understand why that’s wrong. Sure, it will likely do something weirder and less predictable, we don’t understand the neural network prior very well, but optimizing for smiling humans still doesn’t produce anything remotely aligned.
Nothing in the quoted section, or in the document you linked that I just skimmed includes anything about the AI not being able to learn what the things behind the smiling faces actually want. Indeed none of that matters, because the AI has no reason to care. You gave it a few thousand to a million samples of smiling, and now the system is optimizing for smiling, you got what you put in.
Eliezer indeed explicitly addresses this point and says:
He is explicitly saying “Hibbard is confusing being ‘smart’ with ‘caring about the right things’”, the AI will be plenty capable of realizing that it isn’t doing what you wanted it to, but it just doesn’t care. Being smarter does not help with getting it to do the thing you want, that’s the whole point of the alignment problem. Similarly AIs being able to understand human values better just doesn’t help you that much with pointing at them (though it does help a bit, but the linked article just doesn’t talk at all about this).
That is not what Hibbard actually proposed, it’s a superficial strawman version.
HIbbard claims we design intelligent machines which love humans by training to learn human happiness through facial expressions, voices, and body language.
EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled “Complex Value Systems are Required to Realize Valuable Futures”
It has absolutely nothing to do with whether the AI could eventually learn human values (“the things behind the smiling faces actually want”), and everything to do with whether some ML system could learn said values to use them as the utility function for the AI (which is what Hibbard is proposing).
Neither Hibbard, EY, (or I) are arguing about or discussing whether a SI can learn human values.
This is really misunderstanding what Eliezer is saying here, and also like, look, from my perspective it’s been a decade of explaining to people almost once every two weeks that “yes, the AI will of course know what you care about, but it won’t care”, so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me, so I am experiencing a good amount of frustration with you repeatedly saying things in this comment thread like “irrefutable proof”, when it’s just like an obviously wrong statement (though like a fine one to arrive at when just reading some random subset of Eliezer’s writing, but a clearly wrong summary nevertheless).
Now to go back to the object level:
Eliezer is really not saying that the AI will fail to learn that there is something more complicated than smiles that the human is trying to point it to. He is explicitly saying “look, you won’t know what the AI will care about after giving it on the order of a million points. You don’t know what the global maximum of the simplest classifier for your sample set is, and very likely it will be some perverse instantiation that has little to do with what you originally cared about”.
He really really is not talking about the AI being too dumb to learn the value function the human is trying to get it to learn. Indeed, I still have no idea how you are reading that into the quoted passages.
Here is a post from 9 years ago, where the title is that exact point, written by Rob Bensinger who was working at MIRI at the time, with Eliezer as the top comment:
The genie knows, but doesn’t care
I encourage you to read some of the comments by Rob in that thread, which very clearly and unambiguously point to the core problem of “the difficult part is to get the AI to care about the right thing, not to understand the right thing”, all before the DL revolution.
I think this is much more ambiguous than you’re making it out to be. In 2008′s “Magical Categories”, Yudkowsky wrote:
I claim that this paragraph didn’t age well in light of the deep learning revolution: “running a neural network [...] over a set of winning and losing sequences of chess moves” basically is how AlphaZero learns from self-play! As the Yudkowsky quote illustrates, it wasn’t obvious in 2008 that this would work: given what we knew before seeing the empirical result, we could imagine that we lived in a “computational universe” in which the neural network’s generalization from “self-play games” to “games against humans or traditional chess engines” worked less well than it did in the actual computational universe.
Yudkowsky continued:
This would seem to contradict “of course the AI will know, but it won’t care”? “The real problem [...] is one of communication” seems to amount to the claim that the AI won’t care because it won’t know: if you can’t teach “goodness” from labeled data, your AI will search for plans high in something-other-than-goodness, which will kill you at sufficiently high power levels.
But if it turns out that you can teach goodness from labeled data—or at least, if you can get a much better approximation than one might have thought possible in 2008—that would seem to present a different strategic picture. (I’m not saying alignment is easy and I’m not saying humanity is going to survive, but we could die for somewhat different reasons than some blogger thought in 2008.)
I do think these are better quotes. It’s possible that there was some update here between 2008 and 2013 (roughly when I started seeing the more live discussion happening), since I do really remember the “the problem is not getting the AI to understand, but to care” as a common refrain even back then (e.g. see the Robby post I linked).
I agree that this paragraph aged less well than other paragraphs, though I do think this paragraph is still correct (Edit: Eh, it might be wrong, depends a bit on how much neural networks in the 50s are the same as today). It did sure turn out to be correct by a narrower margin than Eliezer probably thought at the time, but my sense is it’s still not the case that we can train a straightforward neural net on winning and losing chess moves and have it generate winning moves. For AlphaGo, the Monte Carlo Tree Search was a major component of its architecture, and then any of the followup-systems was trained by pure self-play.
But in any case, I think your basic point of “Eliezer did not predict the Deep Learning revolution as it happened” here is correct, though I don’t think this specific paragraph has a ton of relevance to the discussion at hand.
I do think this paragraph seems like a decent quote, though I think at this point it makes sense to break it out into different pieces.
I think Eliezer is saying that what matters is whether we can point the AI to what we care about “during its childhood”, i.e. during relatively early training, before it has already developed a bunch of proxy training objectives.
I think the key question about the future that I think Eliezer was opining on, is then whether by the time we expect AIs to actually be able to have a close-to-complete understanding of what we mean by “goodness”, we still have any ability to shape their goals.
My model is that indeed, Eliezer was surprised, as I think most people were, that AIs of 2022 are as good at picking up complicated concept boundaries and learning fuzzy human concepts as they are, while still being quite incompetent at many other tasks. However, I think the statement of “AIs of 2022 basically understand goodness, or at least will soon enough understand goodness while we are still capable of meaningfully changing their goals” strikes me as very highly dubious, and I think the basic arguments for thinking that this capability will come after the AI has reached a capability level where we have little ability to shape its goals still seem correct to me, and like, one of the primary reasons for doom.
The reason why it still seems substantially out of AIs reach, is because our values do indeed seem quite fragile and to change substantially on reflection, such that it’s currently out of the reach of even a very smart human to fully understand what we mean by “goodness”.
Eliezer talks about this in the comment section you linked (actually, a great comment section between Eliezer and Shane Legg that I found quite insightful to read and am glad to have stumbled upon):
I think this argument is basically correct, and indeed, while current systems definitely are good at having human abstractions, I don’t think they really are anywhere close to having good models of the results of our coherent extrapolated volition, which is what Eliezer is talking about here. (To be clear, I do also separately think that LLMs are thinking about concepts for reasons other than deceiving or modeling humans, though like, I don’t think this changes the argument very much. I don’t think LLMs care very much about thinking carefully about morality, because it’s not very useful for predicting random internet text.)
I think separately, there is a different, indirect normativity approach that starts with “look, yes, we are definitely not going to get the AI to understand what our ultimate values are before the end, but maybe we can get it to understand a concept like ‘being conservative’ or ‘being helpful’ in enough detail that we can use it to supervise smarter AI systems, and then bootstrap ourselves into an aligned superintelligence”.
And I think indeed that plan looks better now than it likely looked to Eliezer in 2008, but I do want to distinguish it from the things that Eliezer was arguing against at the time, which were not about learning approaches to indirect normativity, but were arguments about how the AI would just learn all of human values by being pointed at a bunch of examples of good things and bad things, which still strikes me as extremely unlikely.
AlphaGo without the MCTS was still pretty strong:
Even with just the SL-trained value network, it could play at a solid amateur level:
I may be misunderstanding this, but it sounds like the network that did nothing but get good at guessing the next move in professional games was able to play at roughly the same level as Pachi, which, according to DeepMind, had a rank of 2d.
Yeah, I mean, to be clear, I do definitely think you can train a neural network to somehow play chess via nothing but classification. I am not sure whether you could do it with a feed forward neural network, and it’s a bit unclear to me whether the neural networks from the 50s are the same thing as the neural networks from 2000s, but it does sure seem like you can just throw a magic category absorber at chess and then have it play OK chess.
My guess is modern networks are not meaningfully more complicated, and the difference to back then was indeed just scale and a few tweaks, but I am not super confident and haven’t looked much into the history here.
Really? Ok let’s break down phrase by phrase; tell me exactly where I am misunderstanding:
Did EY claim Hibbard’s plan will succeed or fail?
Did EY claim Hibbard’s plan will result in tiling the future light-cone of earth with tiny molecular smiley-faces?
Were these claims made in a paper titled “Complex Value Systems are Required to Realize Valuable Futures”?
I’ve been here since the beginning, and I’m not sure who you have been explaining that too, but it certainly was not me. And where did I claim this is something new related to deep learning?
I’m going to try to clarify this one last time. There are several different meanings of “learn human values”
1.) Training a machine learning model to learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language, and using that as the utility function of the AI, such that it hopefully cares about human happiness. This is Hibbard’s plan from 2001 - long before DL. This model is trained before the AI becomes even human-level intelligent, and used as its initial utility/reward function.
2.) An AGI internally automatically learning human values as part of learning a model of the world—which would not automatically result in it caring about human values at all.
You keep confusing 1 and 2 - specifically you are confusing arguments concerning 2 directed at laypeople with Hibbard’s type 1 proposal.
Hibbard doesn’t believe that 2 will automatically work. Instead he is arguing for 1, and EY is saying that will fail. (And for the record, although EY’s criticism is overconfident, I am not optimistic about Hibbard’s plan as stated, but that was 2001)
Because I’m not?
Hibbard is attempting to make his AI care about safety at the onset (or at least happiness which is his version thereof), he’s not trying to pass the entire buck to the AI.
Will respond more later, but maybe this turns out to be the crux:
But “happiness” is not safety! That’s the whole point of this argument. If you optimize for your current conception of “happiness” you will get some kind of terrible thing that doesn’t remotely capture your values, because your values are fragile and you can’t approximate them by the process of “I just had my AI interact with a bunch of happy people and gave it positive reward, and a bunch of sad people and gave it negative reward”.
There are 2 separate issues here:
Would Hibbard’s approach successfully learn a stable, robust concept of human happiness suitable for use as the reward/utility function of AGI?
Conditional on 1, is ‘happiness’ what we actually want?
The answer to 2 depends much on how one defines happiness, but if happiness includes satisfaction (ie empowerment, curiosity, self-actualization etc—the basis of fun), then it is probably sufficient, but that’s not the core argument.
Notice that EY does not assume 1 and argue 2, he instead argues that Hibbard’s approach doesn’t learn a robust concept of happiness at all and instead learns a trivial superficial “maximize faciness” concept instead.
This is crystal clear and unambiguous:
He describes the result as a utility function of smiles, not a utility function of happiness.
So no, EY’s argument here is absolutely not about happiness being insufficient for safety. His argument is that happiness is incredibly complex and hard to learn a robust version of, and therefor Hibbard’s simplistic approach will learn some stupid superficial ‘faciness’ concept rather than happiness.
See also current debates around building a diamond-maximizing AI, where there is zero question of whether diamondness is what we want, and all the debate is around the (claimed) incredible difficulty of learning a robust version of even something simple like diamondness.
I think I am more interested in you reading The Genie Knows but Doesn’t Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate (given that it was linked by Robby in a bunch of the discussion there after it was written).
I think there are some subtle things here. I think Eliezer!2008 would agree that AIs will totally learn a robust concept for “car”. But I think neither Eliezer!2008 nor me currently would think that current LLMs have any chance of learning a robust concept for “happiness” or “goodness”, in substantial parts because I don’t have a robust concept of “happiness” or “goodness” and before the AI refines those concepts further than I can, I sure expect it to be able to disempower me (though it’s not like guaranteed that that will happen).
What Eliezer is arguing against is not that the AI will not learn any human concepts. It’s that there are a number of human concepts that tend to lean on the whole ontological structure of how humans think about the world (like “low-impact” or “goodness” or “happiness”), such that in order to actually build an accurate model of those, you have to do a bunch of careful thinking and need to really understand how humans view the world, and that people tend to be systematically optimistic about how convergent these kinds of concepts are, as opposed to them being contingent on the specific ways humans think.
My guess is an AI might very well spend sufficient cycles on figuring out human morality after it has access to a solarsystem level of compute, but I think that is unlikely to happen before it has disempowered us, so the ordering here matters a lot (see e.g. my response to Zack above).
So I think there are three separate points here that I think have caused confusion and probably caused us to talk past each other for a while, all of which I think were things that Eliezer was thinking about, at least around 2013-2014 (I know less about 2008):
Low-powered AI systems will have a really hard time learning high-level human concepts like “happiness”, and if you try to naively get them to learn that concept (by e.g pointing them towards smiling humans) you will get some kind of abomination, since even humans have trouble with those kinds of concepts
It is likely that by the time an AI will understand what humans actually really want, we will not have much control over its training process, and so despite it now understanding those constraints, we will have no power to shape its goals towards that
Even if we and the AI had a very crisp and clear concept of a goal I would like the AI to have, humanity won’t know how to actually cause the AI to point towards that as a goal (see e.g. the diamond maximizer problem)
To now answer your concrete questions:
My first response to this is: “I mean, of course not at current LLM capabilities. Ask GPT-3 about happiness, and you will get something dumb and incoherent back. If you keep going and make more capably systems try to do this, it’s pretty likely your classifier will be smart enough to kill you to have more resources to drive the prediction error downwards before it actually arrived at a really deep understanding of human happiness (which appears to require substantial superhuman abilities, given that humans do not have a coherent model of happiness themselves)”
So no, I don’t think Hibbard’s approach would work. Separately, we have no idea how to use a classifier as a reward/utility function for an AGI, so that part of the approach also wouldn’t work. Like, what do you actually concretely propose we do after we have a classifier over video frames that causes a separate AI to then actually optimize for the underlying concept boundary?
But even if you ignore both of these problems, and you avoid the AI killing you in pursuit of driving down prediction error, and you somehow figure out how to take a classifier and use it as a utility function, then you are still not in a good shape, because the AI will likely be able to achieve lower prediction error by modeling the humans doing the labeling process of the data you provide, and modeling what errors they are actually making, and will learn the more natural concept of “things that look happy to humans” instead of the actual happiness concept.
This is a really big deal, because if you start giving an AI the “things that look happy to humans” concept, you will end up with an AI that gets really good at deceiving humans and convincing them that something is happy, which will both quickly involve humans getting fooled and disempowered, and then in the limit might produce something surprisingly close to a universe tiled in smiley faces (convincing enough such that if you point a video camera at it, the rater who was looking at it for 15 seconds would indeed be convinced that it was happy, though there are no raters around).
I think Hibbard’s approach fails for all three reasons that I listed above, and I don’t think modern systems somehow invalidate any of those three reasons. I do think (as I have said in other comments) that modern systems might make indirect normativity approaches more promising, but I don’t think it moves the full value-loading problem anywhere close to the domain of solvability with current systems.
Looking over that it just seems to be a straightforward extrapolation of EY’s earlier points, so I’m not sure why you thought it was especially relevant.
Yeah—this is his core argument against Hibbard. I think Hibbard 2001 would object to ‘low-powered’, and would probably have other objections I’m not modelling, but regardless I don’t find this controversial.
Yeah, in agreement with what I said earlier:
...
I believe I know what you meant, but this seems somewhat confused as worded. If we can train an ML model to learn a very crisp clear concept of a goal, then having the AI optimize for this (point towards it) is straightforward. Long term robustness may be a different issue, but I’m assuming that’s mostly covered under “very crisp clear concept”.
The issue of course is that what humans actually want is complex for us to articulate, let alone formally specify. The update since 2008/2011 is that DL may be able to learn a reasonable proxy of what we actually want, even if we can’t fully formally specify it.
I think this is something of a red herring. Humans can reasonably predict utility functions of other humans in complex scenarios simply by simulating the other as self—ie through empathy. Also happiness probably isn’t the correct thing—probably want the AI to optimize for our empowerment (future optionality), but that’s whole separate discussion.
Sure, neither do I.
A classifier is a function which maps high-D inputs to a single categorical variable, and a utility function just maps some high-D input to a real number, but a k-categorical variable is just the explicit binned model of a log(k) bit number, so these really aren’t that different, and there are many interpolations between. (and in fact sometimes it’s better to use the more expensive categorical model for regression )
Video frames? The utility function needs to be over future predicted world states .. which you could I guess use to render out videos, but text rendering are probably more natural.
I propose we actually learn how the brain works, and how evolution solved alignment, to better understand our values and reverse engineer them. That is probably the safest approach—having a complete understanding of the brain.
However, I’m also somewhat optimistic on theoretical approaches that focus more explicitly on optimizing for external empowerment (which is simpler and more crisp), and how that could be approximated pragmatically with current ML approaches. Those two topics are probably my next posts.