For at least about ten years in my experience people in this community have been saying the main problem isn’t getting the AI to understand human values, it’s getting the AI to have human values. Unfortunately the word “learn human values” is sometimes used to mean “have human values” and sometimes used to mean “understand human values” hence the confusion.
To have human values the AI needs to either learn them or have them instilled. EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function. Obviously at some point a powerful AI will learn a model of human values somewhere in its world model, but that is irrelevant because that doesn’t effect its utility function and it’s far too late—the AI needs a robust model of human values well before it becomes superhuman.
Katjas point is valid - DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI ( which again is completely unrelated to the AI later learning human values somewhere in its world model )
To have human values the AI needs to either learn them or have them instilled. EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function. Obviously at some point a powerful AI will learn a model of human values somewhere in its world model, but that is irrelevant because that doesn’t effect its utility function and it’s far too late—the AI needs a robust model of human values well before it becomes superhuman.
Katjas point is valid—DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI ( which again is completely unrelated to the AI later learning human values somewhere in its world model )
I agree Eliezer is wrong, though that’s not enough to ensure success. In particular, you need to avoid inner alignment issues like deceptive alignment, where it learns values very well only for instrumental convergence reasons, and once it’s strong, it overthrows the humans and pursues whatever terminal goal it has.
I agree that boxing is at least a first step, so that it doesn’t get more compute, or worse, FOOM.
The tricky problem is we need to be able to train away a deceptive AI or forbid it entirely, without making it being obfuscated so that it looks trained away.
This is why we need to move beyond the black box paradigm, and why strong interpretability tools are necessary.
Where’s the link for that prediction, because I think there’s more than one example of critics putting words in his mouth, and then citing a place where he says something manifestly different.
As a matter of fact, if you use the right kind of neural network units, this “neural network” ends up exactly, mathematically equivalent to Naive Bayes. The central unit just needs a logistic threshold—an S-curve response—and the weights of the inputs just need to match the logarithms of the likelihood ratios, etcetera. In fact, it’s a good guess that this is one of the reasons why logistic response often works so well in neural networks—it lets the algorithm sneak in a little Bayesian reasoning while the designers aren’t looking.
Just because someone is presenting you with an algorithm that they call a “neural network” with buzzwords like “scruffy” and “emergent” plastered all over it, disclaiming proudly that they have no idea how the learned network works—well, don’t assume that their little AI algorithm really is Beyond the Realms of Logic. For this paradigm of adhockery, if it works, will turn out to have Bayesian structure; it may even be exactly equivalent to an algorithm of the sort called “Bayesian”.
In a discussion from 2010, he’s offered the chance to say that he doesn’t think the machine learning of the time could produce AGI even with a smarter approach, and he appears to pull back from saying that:
But if we’re asking about works that are sort of billing themselves as ‘I am Artificial General Intelligence’, then I would say that most of that does indeed fail immediately and indeed I cannot think of a counterexample which fails to fail immediately, but that’s a sort of extreme selection effect, and it’s because if you’ve got a good partial solution, or solution to a piece of the problem, and you’re an academic working in AI, and you’re anything like sane, you’re just going to bill it as plain old AI, and not take the reputational hit from AGI. The people who are bannering themselves around as AGI tend to be people who think they’ve solved the whole problem, and of course they’re mistaken. So to me it really seems like to say that all the things I’ve read on AGI immediately fundamentally fail is not even so much a critique of AI as rather a comment on what sort of more tends to bill itself as Artificial General Intelligence.
The context should make it clear I was not talking about an explicit prediction. See this comment for more explication.
I said:
EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function.
This is obviously true and beyond debate, see the quotes in my linked comment from EY’s “Complex Value Systems are Required to Realize Valuable Futures” where he critiques Hibbard’s proposal to install AI with a reward function which “learns to recognize happiness and unhappiness in human facial expressions, human voices and human body language”.
Then I said:
Katjas point is valid—DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI
Where Katja’s point is that DL had no trouble learning concepts of faces (and many other things) to superhuman levels, without inevitably failling by instead only producing superficial simulacra of faces when we cranked up the optimization power. I was not referring to any explicit prediction, but the implicit prediction in Katja’s analogy (where learning a complex 3D generative model of human faces from images is the analogy for learning a complex multi-modal model of human happiness from face images, voices, body language, etc).
It only takes one positive example of AI not failing by producing superficial simulacra of faces to prove my point, which Katja already provided. It doesn’t matter how many crappy AI models people make, as they lose out to stronger models.
Maybe I don’t understand the point of this example in which AI creates non-conscious images of smiling faces. Are you really arguing that, based on evidence like this, a generalization of modern AI wouldn’t automatically produce horrific or deadly results when asked to copy human values?
Peripherally: that video contains simulacra of a lot more than faces, and I may have other minor objections in that vein.
ETA, I may want to say more about the actual human analysis which I think informed the AI’s “success,” but first let me go back to what I said about linking EY’s actual words. Here is 2008-Eliezer:
Now you, finally presented with a tiny molecular smiley—or perhaps a very realistic tiny sculpture of a human face—know at once that this is not what you want to count as a smile. But that judgment reflects an unnatural category, one whose classification boundary depends sensitively on your complicated values. It is your own plans and desires that are at work when you say “No!”
Hibbard knows instinctively that a tiny molecular smileyface isn’t a “smile”, because he knows that’s not what he wants his putative AI to do. If someone else were presented with a different task, like classifying artworks, they might feel that the Mona Lisa was obviously smiling—as opposed to frowning, say—even though it’s only paint.
Hibbard proposes we can learn a model of ‘happiness’ from images of smiling humans, body language, voices, etc and then instill that as the reward/utility function for AI.
EY replies that will fail because our values (like happiness) are far too complex and fragile to be learned robustly by such a procedure, and result instead is an AI which optimizes for a different unintended goal: ‘faciness’.
Katja argues—and others concur—that maybe values are not as fragile as EY predicted, because DL now regularly learns complex concepts to superhuman accuracy—including visual models of faces.
Are you really arguing that, based on evidence like this, a generalization of modern AI wouldn’t automatically produce horrific or deadly results when asked to copy human values?
Obviously that totally depends on the system and how the human values are learned—but no, that certainly isn’t the automatic result if we continue down the path of reverse engineering the brain, including its altruism mechanisms.
I may reply to this more fully, but first I’d like you to acknowledge that you cannot in fact point to a false prediction by EY here, and in the exact post you seemed to be referring to, he says that his view is compatible with this sort of AI producing realistic sculptures of human faces!
Now you, finally presented with a tiny molecular smiley—or perhaps a very realistic tiny sculpture of a human face—know at once that this is not what you want to count as a smile.
The thing producing the very realistic tiny sculpture of a human face is a superintelligence, not some initial human designed ML system that is used to create the AI’s utility function.
What post? All I quoted recently was “Complex Value Systems are Required to Realize Valuable Futures”, which does not appear to contain the word ‘sculpture’.
And more importantly, to prevent deceptive alignment from happening, which would allow a treacherous turn.
A lot of overrated alignment plans have the function that they get outer alignment at optimum, that is the values you want to instill do not break at optimality, but use handwavium to bypass deceptive alignment, proxy and suboptimality alignment.
(Jacob Cannell is better than Alex Turner at this, since he incorporates a AI sandbox which importantly, prevents the AI from knowing it’s in a simulation.)
For at least about ten years in my experience people in this community have been saying the main problem isn’t getting the AI to understand human values, it’s getting the AI to have human values. Unfortunately the word “learn human values” is sometimes used to mean “have human values” and sometimes used to mean “understand human values” hence the confusion.
To have human values the AI needs to either learn them or have them instilled. EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function. Obviously at some point a powerful AI will learn a model of human values somewhere in its world model, but that is irrelevant because that doesn’t effect its utility function and it’s far too late—the AI needs a robust model of human values well before it becomes superhuman.
Katjas point is valid - DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI ( which again is completely unrelated to the AI later learning human values somewhere in its world model )
I agree Eliezer is wrong, though that’s not enough to ensure success. In particular, you need to avoid inner alignment issues like deceptive alignment, where it learns values very well only for instrumental convergence reasons, and once it’s strong, it overthrows the humans and pursues whatever terminal goal it has.
Sim boxing can solve deceptive alignment (and may be the only viable solution)
I agree that boxing is at least a first step, so that it doesn’t get more compute, or worse, FOOM.
The tricky problem is we need to be able to train away a deceptive AI or forbid it entirely, without making it being obfuscated so that it looks trained away.
This is why we need to move beyond the black box paradigm, and why strong interpretability tools are necessary.
>DL did not fail in the way EY predicted,
Where’s the link for that prediction, because I think there’s more than one example of critics putting words in his mouth, and then citing a place where he says something manifestly different.
Here’s a post from 2008, where he says the following:
In a discussion from 2010, he’s offered the chance to say that he doesn’t think the machine learning of the time could produce AGI even with a smarter approach, and he appears to pull back from saying that:
The context should make it clear I was not talking about an explicit prediction. See this comment for more explication.
I said:
This is obviously true and beyond debate, see the quotes in my linked comment from EY’s “Complex Value Systems are Required to Realize Valuable Futures” where he critiques Hibbard’s proposal to install AI with a reward function which “learns to recognize happiness and unhappiness in human facial expressions, human voices and human body language”.
Then I said:
Where Katja’s point is that DL had no trouble learning concepts of faces (and many other things) to superhuman levels, without inevitably failling by instead only producing superficial simulacra of faces when we cranked up the optimization power. I was not referring to any explicit prediction, but the implicit prediction in Katja’s analogy (where learning a complex 3D generative model of human faces from images is the analogy for learning a complex multi-modal model of human happiness from face images, voices, body language, etc).
That’s clearly exactly what it does today? It seems I disagree with your point on a more basic level than expected.
ETA:
It only takes one positive example of AI not failing by producing superficial simulacra of faces to prove my point, which Katja already provided. It doesn’t matter how many crappy AI models people make, as they lose out to stronger models.
Maybe I don’t understand the point of this example in which AI creates non-conscious images of smiling faces. Are you really arguing that, based on evidence like this, a generalization of modern AI wouldn’t automatically produce horrific or deadly results when asked to copy human values?
Peripherally: that video contains simulacra of a lot more than faces, and I may have other minor objections in that vein.
ETA, I may want to say more about the actual human analysis which I think informed the AI’s “success,” but first let me go back to what I said about linking EY’s actual words. Here is 2008-Eliezer:
Hibbard proposes we can learn a model of ‘happiness’ from images of smiling humans, body language, voices, etc and then instill that as the reward/utility function for AI.
EY replies that will fail because our values (like happiness) are far too complex and fragile to be learned robustly by such a procedure, and result instead is an AI which optimizes for a different unintended goal: ‘faciness’.
Katja argues—and others concur—that maybe values are not as fragile as EY predicted, because DL now regularly learns complex concepts to superhuman accuracy—including visual models of faces.
Obviously that totally depends on the system and how the human values are learned—but no, that certainly isn’t the automatic result if we continue down the path of reverse engineering the brain, including its altruism mechanisms.
I may reply to this more fully, but first I’d like you to acknowledge that you cannot in fact point to a false prediction by EY here, and in the exact post you seemed to be referring to, he says that his view is compatible with this sort of AI producing realistic sculptures of human faces!
as someone who often agrees with jake, cmon jake, own up to it, EY has said reasonable things before and you were wrong :P
edit: oops meant to reply to @jacob_cannell
Wrong about what? Of course EY has said many reasonable and insightful things
Oh do you mean this text you quoted?
The thing producing the very realistic tiny sculpture of a human face is a superintelligence, not some initial human designed ML system that is used to create the AI’s utility function.
What post? All I quoted recently was “Complex Value Systems are Required to Realize Valuable Futures”, which does not appear to contain the word ‘sculpture’.
And more importantly, to prevent deceptive alignment from happening, which would allow a treacherous turn.
A lot of overrated alignment plans have the function that they get outer alignment at optimum, that is the values you want to instill do not break at optimality, but use handwavium to bypass deceptive alignment, proxy and suboptimality alignment.
(Jacob Cannell is better than Alex Turner at this, since he incorporates a AI sandbox which importantly, prevents the AI from knowing it’s in a simulation.)