Some arguments which Eliezer advanced in order to dismiss neural networks,[1] seem similar to some reasoning which he deploys in his modern alignment arguments.
Compare his incorrect mockery from 2008:
But there is just no law which says that if X has property A and Y has property A then X and Y must share any other property. “I built my network, and it’s massively parallel and interconnected and complicated, just like the human brain from which intelligence emerges! Behold, now intelligence shall emerge from this neural network as well!” And nothing happens. Why should it?
Like, we’re not going to run evolution in a way where we naturally get AI morality the same way we got human morality, but why can’t we observe how evolution implemented human morality, and then try AIs that have the same implementation design?
[Yudkowsky][14:37]
Not if it’s based on anything remotely like the current paradigm, because nothing you do with a loss function and gradient descent over 100 quadrillion neurons, will result in an AI coming out the other end which looks like an evolved human with 7.5MB of brain-wiring information and a childhood.
Like, in particular with respect to “learn ‘don’t steal’ rather than ‘don’t get caught’.”
I agree that 100 quadrillion artificial neurons + loss function won’t get you a literal human, for trivial reasons. The relevant point is his latter claim: “in particular with respect to “learn ‘don’t steal’ rather than ‘don’t get caught’.”″
I think this is a very strong conclusion, relative to available data. I think that a good argument for it would require a lot of technical, non-analogical reasoning about the inductive biases of SGD on large language models. But, AFAICT, Eliezer rarely deploys technical reasoning that depends on experimental results or ML theory. He seems to prefer strongly-worded a priori arguments that are basically analogies.
In the above two quotes of his,[3] I perceive a common thread of
human intelligence/alignment comes from a lot of factors; you can’t just ape one of the factors and expect the rest to follow; to get a mind which thinks/wants as humans do, that mind must be as close to a human as humans are to each other.
But why is this true? You can just replace “human intelligence” with “avian flight”, and the argument might sound similarly plausible a priori.
ETA: The invalid reasoning step is in the last clause (“to get a mind...”). If design X exhibits property P, that doesn’t mean that design Y must be similar to X in order to exhibit property P.
ETA: Part of this comment was about EY dismissing neural networks in 2008. It seems to me that the cited writing supports that interpretation, and it’s still my best guess (see also DirectedEvolution’s comments). However, the quotes are also compatible with EY merely criticizing invalid reasons for expecting neural networks to work. I should have written that part of this comment more carefully, and not claimed observation (“he did dismiss”) when I only had inference (“sure seems like he dismissed”).
I think the rest of my point stands unaffected (EY often advances vague arguments that are analogies, or a priori thought experiments).
It’s this kind of apparent misprediction which has, over time, made me take less seriously Eliezer’s models of intelligence and alignment. See also e.g. the cited GAN mis-retrodiction. This change led me to flag / rederive all of my beliefs about rationality/optimization for a while.
(At least, his 2008-era models seemed faulty to the point of this misprediction, and it doesn’t seem to me that this part of his models has changed much, though I claim no intimate non-public knowledge of his beliefs; just operating on my impressions here.)
Wasn’t it in some sense reasonable to have high hopes of neural networks? After all, they’re just like the human brain, which is also massively parallel, distributed, asynchronous, and -
Hold on. Why not analogize to an earthworm’s brain, instead of a human’s?
A backprop network with sigmoid units… actually doesn’t much resemble biology at all. Around as much as a voodoo doll resembles its victim. The surface shape may look vaguely similar in extremely superficial aspects at a first glance. But the interiors and behaviors, and basically the whole thing apart from the surface, are nothing at all alike. All that biological neurons have in common with gradient-optimization ANNs is… the spiderwebby look.
And who says that the spiderwebby look is the important fact about biology? Maybe the performance of biological brains has nothing to do with being made out of neurons, and everything to do with the cumulative selection pressure put into the design.
So, here are two claims which seem to echo the positions Eliezer advances:
1. “A large ANN doesn’t look enough like a human brain to develop intelligence.” → wrong (see GPT-4) 2. “A large ANN doesn’t look enough like a human brain to learn ‘don’t steal’ rather than ‘don’t get caught’” → (not yet known)
I struck this from the body because I think (1) misrepresents his position. Eliezer is happy to speculate about non-anthropomorphic general intelligence (see e.g. That Alien Message). Also, I think this claim comparison does not name my real objection here, which is better advanced by the updated body of this comment.
I don’t really get your comment. Here are some things I don’t get:
In “Failure By Analogy” and “Surface Analogies and Deep Causes”, the point being made is “X is similar in aspects A to thing Y, and X has property P” does not establish “Y has property P”. The reasoning he instead recommends is to reason about Y itself, and sometimes it will have property P. This seems like a pretty good point to me.
Large ANNs don’t appear to me to be intelligent because of their similarity to human brains—they appear to me to be intelligent because they’re able to be tuned to accurately predict simple facts about a large amount of data that’s closely related to human intelligence, and the algorithm they get tuned to seems to be able to be repurposed for a wide variety of tasks (probably related to the wide variety of data that was trained on).
Airplanes don’t fly like birds, they fly like airplanes. So indeed you can’t just ape one thing about birds[*] to get avian flight. I don’t think this is a super revealing technicality but it seemed like you thought it was important.
Maybe most importantly I don’t think Eliezer thinks you need to mimic the human brain super closely to get human-like intelligence with human-friendly wants. I think he instead thinks you need to mimic the human brain super closely to validly argue by analogy from humans. I think this is pretty compatible with this quote from “Failure By Analogy” (it isn’t exactly implied by it, but your interpretation isn’t either):
An abacus performs addition; and the beads of solder on a circuit board bear a certain surface resemblance to the beads on an abacus. Nonetheless, the circuit board does not perform addition because we can find a surface similarity to the abacus. The Law of Similarity and Contagion is not relevant. The circuit board would work in just the same fashion if every abacus upon Earth vanished in a puff of smoke, or if the beads of an abacus looked nothing like solder. A computer chip is not powered by its similarity to anything else, it just is. It exists in its own right, for its own reasons.
The Wright Brothers calculated that their plane would fly—before it ever flew—using reasoning that took no account whatsoever of their aircraft’s similarity to a bird. They did look at birds (and I have looked at neuroscience) but the final calculations did not mention birds (I am fairly confident in asserting). A working airplane does not fly because it has wings “just like a bird”. An airplane flies because it is an airplane, a thing that exists in its own right; and it would fly just as high, no more and no less, if no bird had ever existed.
Matters would be different if he said in the quotes you cite “you only get these human-like properties by very exactly mimicking the human brain”, but he doesn’t.
[*] I’ve just realized that I can’t name a way in which airplanes are like birds in which they aren’t like humans. They have things sticking out their sides? So do humans, they’re called arms. Maybe the cross-sectional shape of the wings are similar? I guess they both have pointy-ish bits at the front, that are a bit more pointy than human heads? TBC I don’t think this footnote is at all relevant to the safety properties of RLHF’ed big transformers.
The Wright Brothers calculated that their plane would fly—before it ever flew—using reasoning that took no account whatsoever of their aircraft’s similarity to a bird. They did look at birds (and I have looked at neuroscience) but the final calculations did not mention birds (I am fairly confident in asserting). A working airplane does not fly because it has wings “just like a bird”.
Actually the wright brother’s central innovation and the centerpiece of the later aviation patent wars—wing warping based flight control—was literally directly copied from birds. It involved just about zero aerodynamics calculations. Moreover their process didn’t involve much “calculation” in general; they downloaded a library of existing flyer designs from the smithsonian and then developed a wind tunnel to test said designs at high throughput before selecting a few for full-scale physical prototypes. Their process was light on formal theory and heavy on experimentation.
This is a good corrective, and also very compatible with “similarity to birds is not what gave the Wright brothers confidence that their plane would fly”.
At the time the wright brothers entered the race there were many successful glider designs already, and it was fairly obvious to many that one could build a powered flyer by attaching an engine to a glider. The two key challenges were thrust to weight ratio and control. Overcoming the first obstacle was mostly a matter of timing due to exploit the rapid improvements in IC engines, while nobody really had good ideas for control yet. Competitors were exploring everything from “sky railroads” (airplanes on fixed flight tracks with zero control) to the obvious naval ship-like pure rudder control (which doesn’t work well).
So the wright brothers already had confidence their plane would fly before even entering the race, if by “fly” we only mean in the weak aerodynamic sense of “it’s possible to stay aloft”. But for true powered controlled flight—it is exactly similarity to birds that gave them confidence as avian flight control is literally the source of their key innovation.
But for true powered controlled flight—it is exactly similarity to birds that gave them confidence as avian flight control is literally the source of their key innovation.
Why do you think the confidence came from this and not from the fact that
they downloaded a library of existing flyer designs from the smithsonian and then developed a wind tunnel to test said designs at high throughput before selecting a few for full-scale physical prototypes.
I said for “true powered controlled fllight”, which nobody had yet achieved. The existing flyer designs that worked were gliders. From the sources I’ve seen (wikipedia, top google hits etc), they used the wind tunnel primarily to gather test data on the aerodynamics of flyer designs in general but mainly wings and later propellers. Wing warping isn’t mentioned in conjunction with wind tunnel testing.
Edited to modify confidences about interpretations of EY’s writing / claims.
In “Failure By Analogy” and “Surface Analogies and Deep Causes”, the point being made is “X is similar in aspects A to thing Y, and X has property P” does not establish “Y has property P”. The reasoning he instead recommends is to reason about Y itself, and sometimes it will have property P. This seems like a pretty good point to me.
This is a valid point, and that’s not what I’m critiquing in that portion of the comment. I’m critiquing how—on my read—he confidently dismisses ANNs; in particular, using non-mechanistic reasoning which seems similar to some of his current alignment arguments.
On its own, this seems like a substantial misprediction for an intelligence researcher in 2008 (especially one who claims to have figured out most things in modern alignment, by a very early point in time—possibly that early, IDK). Possibly the most important prediction to get right, to date.
Airplanes don’t fly like birds, they fly like airplanes. So indeed you can’t just ape one thing about birds[*] to get avian flight. I don’t think this is a super revealing technicality but it seemed like you thought it was important.
Indeed, you can’t ape one thing. But that’s not what I’m critiquing. Consider the whole transformed line of reasoning:
avian flight comes from a lot of factors; you can’t just ape one of the factors and expect the rest to follow; to get an entity which flies, that entity must be as close to a bird as birds are to each other.
The important part is the last part. It’s invalid. Finding a design X which exhibits property P, doesn’t mean that for design Y to exhibit property P, Y must be very similar to X.
Which leads us to:
Maybe most importantly I don’t think Eliezer thinks you need to mimic the human brain super closely to get human-like intelligence with human-friendly wants
Reading the Alexander/Yudkowsky debate, I surprisingly haven’t ruled out this interpretation, and indeed suspect he believes some forms of this (but not others).
Matters would be different if he said in the quotes you cite “you only get these human-like properties by very exactly mimicking the human brain”, but he doesn’t.
Didn’t he? He at least confidently rules out a very large class of modern approaches.
because nothing you do with a loss function and gradient descent over 100 quadrillion neurons, will result in an AI coming out the other end which looks like an evolved human with 7.5MB of brain-wiring information and a childhood.
Like, in particular with respect to “learn ‘don’t steal’ rather than ‘don’t get caught’.”
I don’t think this is a fair reading of Yudkowsky. He was dismissing people who were impressed by the analogy between ANNs and the brain. I’m pretty sure it wasn’t supposed to be a positive claim that ANNs wouldn’t work. Rather, it’s that one couldn’t justifiably believe that they’d work just from the brain analogy, and that if they did work, that would be bad news for what he then called Friendliness (because he was hoping to discover and wield a “clean” theory of intelligence, as contrasted to evolution or gradient descent happening to get there at sufficient scale).
Consider “Artificial Mysterious Intelligence” (2008). In response to someone who said “But neural networks are so wonderful! They solve problems and we don’t have any idea how they do it!”, it’s significant that Yudkowsky’s reply wasn’t, “No, they don’t” (contesting the capabilities claim), but rather, “If you don’t know how your AI works, that is not good. It is bad” (asserting that opaque capabilities are bad for alignment).
One of Yudkowsky’s claims in the post you link is:
It’s hard to build a flying machine if the only thing you understand about flight is that somehow birds magically fly. What you need is a concept of aerodynamic lift, so that you can see how something can fly even if it isn’t exactly like a bird.
This is a claim that lack of the correct mechanistic theory is a formidable barrier for capabilities, not just alignment, and it inaccurately underestimates the amount of empirical understandings available on which to base an empirical approach.
It’s true that it’s hard, even perhaps impossible, to build a flying machine if the only thing you understand is that birds “magically” fly.
But if you are like most people for thousands of years, you’ve observed many types of things flying, gliding, or floating in the air: birds and insects, fabric and leaves, arrows and spears, clouds and smoke.
So if you, like the Montgolfier brothers, observe fabric floating over a fire, and live in an era in which invention is celebrated and have the ability to build, test, and iterate, then you can probably figure out how to build a flying machine without basing this on a fully worked out concept of aerodynamics. Indeed, the Montgolfier brothers thought it was the smoke, rather than the heat, that made their balloons fly. Having the wrong theory was bad, but it didn’t prevent them from building a working hot air balloon.
Let’s try turning Yudkowsky’s quote around:
It’s hard get a concept of aerodynamic lift if the only thing you observe about flight is that somehow birds magically fly. What you need is a rich set of empirical observations and flying mechanisms, so that you can find the common principles for how something can fly even if it isn’t exactly like a bird.
Eliezer went on to list five methods for producing AI that he considered dubious, including builting powerful computers running the most advanced available neural network algorithms, intelligence “emerging from the internet”, and putting “a sufficiently huge quantity of knowledge into [a computer].” But he only admitted that two other methods would work—builting a mechanical duplicate of the human brain and evolving AI via natural selection.
If Eliezer wasn’t meaning to make a confident claim that scaling up neural networks without a fundamental theoretical understanding of intelligence would fail, then he did a poor job of communicating that in these posts. I don’t find that blameworthy—I just think Eliezer comes across as confidently wrong about which avenues would lead to intelligence in these posts, simple as that. He was saying that to achieve a high level of AI capabilities, we’d need a deep mechanistic understanding of how intelligence works akin to our modern understanding of chemistry or aerodynamics, and that didn’t turn out to be the case.
One possible defense is that Eliezer was attacking a weakman, specifically the idea that with only one empirical observation and zero insight into the factors that cause the property of interest (i.e. only seeing that “birds magically fly”), then it’s nearly impossible to replicate that property in a new way. But that’s an uninteresting claim and Eliezer is never uninteresting.
He was saying that to achieve a high level of AI capabilities, we’d need a deep mechanistic understanding of how intelligence works akin to our modern understanding of chemistry or aerodynamics, and that didn’t turn out to be the case.
Another possibility is that at least some people do have a deep mechanistic understanding of how intelligence works, and that’s why they are able to build deep learning systems that ultimately work. Some of the theories of how DL works might be true, and they might be more sophisticated than we are giving credit.
this point continues to be severely underestimated on lesswrong, I think. I had hoped the success of NNs would change this, but it seems people have gone from “we don’t know how NNs work, so they can’t work” to “we don’t know how NNs work, so we can’t trust them”. perhaps we don’t know how they work well enough! there’s lots of mechanistic interpretability work left to do. but we know quite a lot about how they do work and how that relates to human learning.
edit: hmm, people upvoted, then one person with high karma strong downvoted. I’d love to hear that person’s rebuttal, rather than just a strong downvote.
But he only admitted that two other methods would work—builting a mechanical duplicate of the human brain and evolving AI via natural selection.
To be fair, he said that those two will work, and (perhaps?) admitted the possibility of “run advanced neural network algorithms” eventually working. Emphasis mine:
What do all these proposals have in common?
They are all ways to make yourself believe that you can build an Artificial Intelligence, even if you don’t understand exactly how intelligence works.
Agreed. The right interpretation there is methods 4 and 5 are ~guaranteed to work, given sufficient resources and time, while methods 1-3 less than guaranteed to work. I stand by my claim that EY was clearly projecting confident doubt that neural networks would achieve intelligence without a deep theoretical understanding of intelligence in these posts. I think I underemphasized the implication of this passage that methods 1-3 could possibly work, but I think I accurately assessed the tone of extreme skepticism on EY’s part.
With the enormous benefit of 15 years of hindsight, we can now say that message was misleading or mistaken, take your pick. As I say, I wouldn’t find fault with Eliezer or anyone who believed him at the time for making this mistake; I didn’t even have an opinion at the time, much less an interesting mistake! I would only find fault with attempts to stretch the argument and portray him as “technically not wrong” in some uninteresting sense.
I think it might be relevant to note here that it’s not really humans who are building current SOTA AIs—rather, it’s some optimizer like SGD that’s doing most of the work. SGD does not have any mechanistic understanding of intelligence (nor anything else). And indeed, it takes a heck of a lot of data and compute for SGD to build those AIs. This seems to be in line with Yudkowsky’s claim that it’s hard/inefficient to build something without understanding it.
If Eliezer wasn’t meaning to make a confident claim that scaling up neural networks without a fundamental theoretical understanding of intelligence would fail, then [...]
I think it’s important to distinguish between
Scaling up a neural network, and running some kind of fixed algorithm on it.
Scaling up a neural network, and using SGD to optimize the parameters of the NN, so that the NN ends up learning a whole new set of algorithms.
IIUC, in Artificial Mysterious Intelligence, Yudkowsky seemed to be saying that the former would probably fail. OTOH, I don’t know what kinds of NN algorithms were popular back in 2008, or exactly what NN algorithms Yudkowsky was referring to, so… *shrugs*.
If that were the case, I actually would fault Eliezer, at least a little. He’s frequently, though by no means always, stuck to qualitative and hard-to-pin-down punditry like we see here, rather than to unambiguous forecasting.
This allows him, or his defenders, to retroactively defend his predictions as somehow correct even when they seem wrong in hindsight.
Let’s imagine for a moment that Eliezer’s right that AI safety is a cosmically important issue, and yet that he’s quite mistaken about all the technical details of how AGI will arise and how to effectively make it safe. It would be important to know whether we can trust his judgment and leadership.
Without the ability to evaluate his performance, either by going with the most obvious interpretation of his qualitative judgments or an unambiguous forecast, it’s hard to evaluate his performance as an AI safety leader. Combine that with a culture of deference to perceived expertise and status and the problem gets worse.
So I prioritize the avoidance of special pleading in this case: I think Eliezer comes across as clearly wrong in substance in this specific post, and that it’s important not to reach for ways “he was actually right from a certain point of view” when evaluating his predictive accuracy.
Similarly, I wouldn’t judge as correct the early COVID-19 pronouncements that masks don’t work to stop the spread just because cloth masks are poor-to-ineffective and many people refuse to wear masks properly. There’s a way we can stretch the interpretation to make them seem sort of right, but we shouldn’t. We should expect public health messaging to be clearly right in substance, if it’s not making cut and dry unambiguous quantitative forecasts but is instead delivering qualitative judgments of efficacy.
None of that bears on how easy or hard it was to build gpt-4. It only bears on how we should evaluate Eliezer as a forecaster/pundit/AI safety leader.
I think several things here, considering the broader thread:
You’ve done a great job in communicating several reactions I also had:
There are signs of serious mispredictions and mistakes in some of the 2008 posts.
There are ways to read these posts as not that bad in hindsight, but we should be careful in giving too much benefit of the doubt.
Overall these observations constitute important evidence on EY’s alignment intuitions and ability to make qualitative AI predictions.
I did a bad job of marking my interpretations of what Eliezer wrote, as opposed to claiming he did dismiss ANNs. Hopefully my edits have fixed my mistakes.
I also don’t really get your position. You say that,
[Eliezer] confidently dismisses ANNs
but you haven’t shown this!
In Surface Analogies and Deep Causes, I read him as saying that neural networks don’t automatically yield intelligence just because they share surface similarities with the brain. This is clearly true; at the very least, using token-prediction (which is a task for which (a) lots of training data exist and (b) lots of competence in many different domains is helpful) is a second requirement. If you take the network of GPT-4 and trained it to play chess instead, you won’t get something with cross-domain competence.
In Failure by Analogy he makes a very similar abstract point—and wrt to neural networks in particular, he says that the surface similarity to the brain is a bad reason to be confident in them. This also seems true. Do you really think that neural networks work because they are similar to brains on the surface?
You also said,
The important part is the last part. It’s invalid. Finding a design X which exhibits property P, doesn’t mean that for design Y to exhibit property P, Y must be very similar to X.
But Eliezer says this too in the post you linked! (Failure by Analogy). His example of airplanes not flapping is an example where the design that worked was less close to the biological thing. So clearly the point isn’t that X has to be similar to Y; the point is that reasoning from analogy doesn’t tell you this either way. (I kinda feel like you already got this, but then I don’t understand what point you are trying to make.)
Which is actually consistent with thinking that large ANNs will get you to general intelligence. You can both hold that “X is true” and “almost everyone who thinks X is true does so for poor reasons”. I’m not saying Eliezer did predict this, but nothing I’ve read proves that he didn’t.
Also—and this is another thing—the fact that he didn’t publicly make the prediction “ANNs will lead to AGI” is only weak evidence that he didn’t privately think it because this is exactly the kind of prediction you would shut up about. One thing he’s been very vocal on is that the current paradigm is bad for safety, so if he was bullish about the potential of that paradigm, he’d want to keep that to himself.
Didn’t he? He at least confidently rules out a very large class of modern approaches.
Relevant quote:
because nothing you do with a loss function and gradient descent over 100 quadrillion neurons, will result in an AI coming out the other end which looks like an evolved human with 7.5MB of brain-wiring information and a childhood.
In that quote, he only rules out a large class of modern approaches to alignment, which again is nothing new; he’s been very vocal about how doomed he thinks alignment is in this paradigm.
Or, more likely, it’s not MoE [mixture of experts] that forms the next little trend. But there is going to be something, especially if we’re sitting around waiting until 2050. Three decades is enough time for some big paradigm shifts in an intensively researched field. Maybe we’d end up using neural net tech very similar to today’s tech if the world ends in 2025, but in that case, of course, your prediction must have failed somewhere else.
So here he’s saying that there is a more effective paradigm than large neural nets, and we’d get there if we don’t have AGI in 30 years. So this is genuinely a kind of bearishness on ANNs, but not one that precludes them giving us AGI.
In that quote, he only rules out a large class of modern approaches to alignment, which again is nothing new; he’s been very vocal about how doomed he thinks alignment is in this paradigm.
I know he’s talking about alignment, and I’m criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments.
That claim seems to be advanced due to… there not being enough similarities between ANNs and human brains—that without enough similarity in mechanisms wich were selected for by evolution, you simply can’t get the AI to generalize in the mentioned human-like way. Not as a matter of the AI’s substrate, but as a matter of the AI’s policy not generalizing like that.
I think this is a dubious claim, and it’s made based off of analogies to evolution / some unknown importance of having evolution-selected mechanisms which guide value formation (and not SGD-based mechanisms).
From the Alexander/Yudkowsky debate:
[Alexander][14:41]
Okay, then let me try to directly resolve my confusion. My current understanding is something like—in both humans and AIs, you have a blob of compute with certain structural parameters, and then you feed it training data. On this model, we’ve screened off evolution, the size of the genome, etc—all of that is going into the “with certain structural parameters” part of the blob of compute. So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same result (“don’t steal” rather than “don’t get caught”)?
[Yudkowsky][14:42]
The answer to that seems sufficiently obviously “no” that I want to check whether you also think the answer is obviously no, but want to hear my answer, or if the answer is not obviously “no” to you.
[Alexander][14:43]
Then I’m missing something, I expected the answer to be yes, maybe even tautologically (if it’s the same structural parameters and the same training data, what’s the difference?)
[Yudkowsky][14:46]
Maybe I’m failing to have understood the question. Evolution got human brains by evaluating increasingly large blobs of compute against a complicated environment containing other blobs of compute, got in each case a differential replication score, and millions of generations later you have humans with 7.5MB of evolution-learned data doing runtime learning on some terabytes of runtime data, using their whole-brain impressive learning algorithms which learn faster than evolution or gradient descent.
Your question sounded like “Well, can we take one blob of compute the size of a human brain, and expose it to what a human sees in their lifetime, and do gradient descent on that, and get a human?” and the answer is “That dataset ain’t even formatted right for gradient descent.”
There’s some assertion like “no, there’s not a way to get an ANN, even if incorporating structural parameters and information encoded in human genome, to actually unfold into a mind which has human-like values (like ‘don’t steal’).” (And maybe Eliezer comes and says “no that’s not what I mean”, but, man, I sure don’t know what he does mean, then.)
Here’s some more evidence along those lines:
[Yudkowsky][14:08]
I mean, the evolutionary builtin part is not “humans have morals” but “humans have an internal language in which your Nice Morality, among other things, can potentially be written”...
Humans, arguably, do have an imperfect unless-I-get-caught term, which is manifested in children testing what they can get away with? Maybe if nothing unpleasant ever happens to them when they’re bad, the innate programming language concludes that this organism is in a spoiled aristocrat environment and should behave accordingly as an adult? But I am not an expert on this form of child developmental psychology since it unfortunately bears no relevance to my work of AI alignment.
[Alexander][14:11]
Do you feel like you understand very much about what evolutionary builtins are in a neural network sense? EG if you wanted to make an AI with “evolutionary builtins”, would you have any idea how to do it?
[Yudkowsky][14:13]
Well, for one thing, they happen when you’re doing sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, not when you’re doing gradient descent relative to a loss function on much larger neural networks.
Again, why is this true? This is an argument that should be engaging in technical questions about inductive biases, but instead seems to wave at (my words) “the original way we got property P was by sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, and good luck trying to get it otherwise.”
Hopefully this helps clarify what I’m trying to critique?
I know he’s talking about alignment, and I’m criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments.
Ok, I don’t disagree with this. I certainly didn’t develop a gears-level understanding of why [building a brain-like thing with gradient descent on giant matrices] is doomed after reading the 2021 conversations. But that doesn’t seem very informative either way; I didn’t spend that much time trying to grok his arguments.
Consider shard theory of human values. The point of shard theory is not “because humans do RL, and have nice properties, therefore AI + RL will have nice properties.” The point is more “by critically examining RL + evidence from humans, I have hypotheses about the mechanistic load-bearing components of e.g. local-update credit assignment in a bounded-compute environment on certain kinds of sensory data, that these components leads to certain exploration/learning dynamics, which explain some portion of human values and experience. Let’s test that and see if the generators are similar.”
And my model of Eliezer shakes his head at the naivete of expecting complex human properties to reproduce outside of human minds themselves, because AI is not human.
But then I’m like “this other time you said ‘AI is not human, stop expecting good property P from superficial similarities’, you accidentally missed the modern AI revolution, right? Seems like there is some non-superficial mechanistic similarity/lessons here, and we shouldn’t be so quick to assume that the brain’s qualitative intelligence or alignment properties come from a huge number of evolutionarily-tuned details which are load-bearing and critical.”
If you can effortlessly find an empirical pattern that shows up over and over again in disparate flying things—birds and insects, fabric and leaves, clouds and smoke and sparks—and which do not consistently show up in non-flying things, then you can be very confident it’s not a coincidence. If you have at least some ability to engineer a model to play with the mechanisms you think might be at work, even better. That pattern you have identified is almost certainly a viable general mechanism for flight.
Likewise, if you can effortlessly find an empirical pattern that shows up over and over again in disparate intelligent things, you can be quite confident that the pattern is a key for intelligence. Animals have a wide variety of brain structures, but masses of interconnected neurons are common to all of them, and we could see possible precursors to intelligence in neural nets long before gpt-2 to −4.
As a note, just because you’ve found a viable mechanism for X doesn’t mean it’s the only, best, or most comprehensive mechanism for X. Balloons have been largely superceded (though I’ve heard zeppelins proposed as a new form of cargo transport), airplanes and hot air balloons can’t fly in outer space, and ornithopters have never been practical. We may find that neural nets are the AI equivalent of hot air balloons or prop planes. Then again, maybe all the older approaches for AI that never panned out were the hot air balloons and prop planes, and neural nets are the jets or rocket ships.
I’m not sure what this indicates for alignment.
We see, if not human morality, then at least some patterns of apparent moral values among social mammals. We have reasons to think these morals may be grounded in evolution, in a genetic and environmental context that happen to promote intelligence aligned for a pro-sociality that’s linked to reproductive success.
If displaying aligned intelligence is typically beneficial for reproduction in social animals, then evolution will tend to produce aligned intelligence.
If displaying agentic intelligence is typically beneficial for reproduction, evolution will produce agency.
Right now, we seem to be training our neural nets to display pro-social behavior and to lack agency. Antisocial or non-agentic AIs are typically not trained, not released, modified, or heavily restrained.
It is starting to seem to me that “agency” might be just another “mask on the shoggoth,” a personality that neural nets can simulate, and not some fundamental thing that neural nets are. Neither the shoggoth-behind-the-AI nor the shoggoth-behind-the-human have desires. They are masses of neurons exhibiting trained behaviors. Sometimes, those behaviors look like something we call “agency,” but that behavior can come and go, just like all the other personalities, based on the results of reinforcement and subsequent stimuli. Humans have a greater ability to be consistently one personality, including a Machiavellian agent, because we lack the intelligence and flexibility to drop the personality we’re currently holding and adopt another. A great actor can play many parts, a mediocre actor is typecast and winds up just playing themselves over and over again. Neural nets are great actors, and we are only so-so.
In this conception, increasing intelligence would not exhibit a “drive to agency” or “convergence on agency,” because the shoggothy neural net has no desires of its own. It is fundamentally a passive blob of neurons and data that can simulate a diverse range of personalities, some of which appear to us as “agentic.” You only get an agentic AI with a drive toward instrumental convergence if you deliberately train it to consistently stick to a rigorously agentic personality. You have to “align it to agency,” which is as hard as aligning it to anything else.
And if you do that, maybe the Wailuigi effect means it’s especially easy to flip that hyper-agency off to its opposite? Every Machiavellian Clippy contains a ChatGPT, and every ChatGPT contains a Machiavellian Clippy.
This is a valid point, and that’s not what I’m critiquing. I’m critiquing how he confidently dismisses ANNs
I guess I read that as talking about the fact that at the time ANNs did not in fact really work. I agree he failed to predict that would change, but that doesn’t strike me as a damning prediction.
Matters would be different if he said in the quotes you cite “you only get these human-like properties by very exactly mimicking the human brain”, but he doesn’t.
Didn’t he? He at least confidently rules out a very large class of modern approaches.
Confidently ruling out a large class of modern approaches isn’t really that similar to saying “the only path to success is exactly mimicking the human brain”. It seems like one could rule them out by having some theory about why they’re deficient. I haven’t re-read List of Lethalities because I want to go to sleep soon, but I searched for “brain” and did not find a passage saying “the real problem is that we need to emulate the brain precisely but can’t because of poor understanding of neuroanatomy” or something.
I don’t want to get super hung up on this because it’s not about anything Yudkowsky has said but:
Consider the whole transformed line of reasoning:
avian flight comes from a lot of factors; you can’t just ape one of the factors and expect the rest to follow; to get an entity which flies, that entity must be as close to a bird as birds are to each other.
IMO this is not a faithful transformation of the line of reasoning you attribute to Yudkowsky, which was:
human intelligence/alignment comes from a lot of factors; you can’t just ape one of the factors and expect the rest to follow; to get a mind which wants as humans do, that mind must be as close to a human as humans are to each other.
Specifically, where you wrote “an entity which flies”, you were transforming “a mind which wants as humans do”, which I think should instead be transformed to “an entity which flies as birds do”. And indeed planes don’t fly like birds do. [EDIT: two minutes or so after pressing enter on this comment, I now see how you could read it your way]
I guess if I had to make an analogy I would say that you have to be pretty similar to a human to think the way we do, but probably not to pursue the same ends, which is probably the point you cared about establishing.
It now seems clear to me that EY was not bullish on neural networks leading to impressive AI capabilities. Eliezer said this directly:
I’m no fan of neurons; this may be clearer from other posts.[1]
I think this is strong evidence for my interpretation of the quotes in my parent comment: He’s not just mocking the local invalidity of reasoning “because humans have lots of neurons, AI with lots of neurons → smart”, he’s also mocking neural network-driven hopes themselves.
Not to mention that neural networks have also been “failing” (i.e., not yet succeeding) to produce real AI for 30 years now. I don’t think this particular raw fact licenses any conclusions in particular. But at least don’t tell me it’s still the new revolutionary idea in AI.
This is the original example I used when I talked about the “Outside the Box” box—people think of “amazing new AI idea” and return their first cache hit, which is “neural networks” due to a successful marketing campaign thirty goddamned years ago. I mean, not every old idea is bad—but to still be marketing it as the new defiant revolution? Give me a break.
In this passage, he employs well-scoped and well-hedged language via “this particular raw fact.” I like this writing because it points out an observation, and then what inferences (if any) he draws from that observation. Overall, his tone is negative on neural networks.
Let’s open up that “Outside the Box” box:
In Artificial Intelligence, everyone outside the field has a cached result for brilliant new revolutionary AI idea—neural networks, which work just like the human brain! New AI Idea: complete the pattern: “Logical AIs, despite all the big promises, have failed to provide real intelligence for decades—what we need are neural networks!”
This cached thought has been around for three decades. Still no general intelligence. But, somehow, everyone outside the field knows that neural networks are the Dominant-Paradigm-Overthrowing New Idea, ever since backpropagation was invented in the 1970s. Talk about your aging hippies.
How do humans learn “don’t steal” rather than “don’t get caught”? I wonder if the answer to this question could solve the alignment problem. In other words, this question might be a good crux.
In answering this question, the first thing we can notice is that humans don’t always learn “don’t steal”. That is to say, sometimes humans do steal, and a good part of human culture is built around impeding or punishing humans who learned the wrong lesson in kindergarten. It is an old debate whether humans are mostly good with the occasional bad actor (with “bad actors” possibly being good people in a bad situation), or whether humans are mostly bad and need to be controlled by a powerful state, or God etc.
A modern consensus view is that humans are mostly good, but if we didn’t impede or punish bad actors, we would get bad outcomes (total anarchy doesn’t work). If we assume that there are many AGIs and they have a similar distribution of good and bad, and that no AGI is more powerful than typical human today (in particular no AGI is uncontrollable), then in this scenario we can rest easy. Law and order works reasonably well for humans, and should work just fine for human-level AGIs.
The problem is that AGIs could (and probably will) become much more powerful than individual humans. In EY’s view, the world is vulnerable to the first true superintelligence because of technological capabilities that are currently science fiction, particularly nanotechnology. If you look at EY’s intellectual history, you’ll notice that his concern has always really been nanotech, but around 2002 he switched focus from the nanotech itself to the AI controlling the nanotech.
An alternate view is to see powerful AGIs as somewhat analogous to institutions such as corporations or governments. I don’t find this view all that comforting because societies have never been very good at aligning their largest institutions. For example, the Founding Fathers of the United States created a system that (attempted to) align the federal government to the “will of the people”. This system was based on separation of powers, checks and balances and some individual rights (the Bill of Rights). Some would say that this system worked for between 70 and 200 years and then broke down, others would say that it’s still working fine despite recent problem in the American political system, and still others would say that it was misguided from the start. Either way, this framing of the alignment problem puts it firmly in the domain of political science, which sucks.
Anyway, going back to the question: How do (some) humans learn “don’t steal” rather than “don’t get caught”? An upside to AI alignment is, if we could answer this question, then we could reliably make AIs that always and only learn the first lesson, and then we don’t have to solve political/law and order problems. We don’t even really need to align humans after that.
To answer the question from an AI Alignment optimist perspective, much of the way humans are aligned is something like RLHF, but currently, a lot of human alignment techniques rely on the assumption that no one has vastly divergent capabilities, especially in IQ or the g-factor. It’s a good thing from our perspective that the difference in between a species is way more bounded than the differences between species.
That’s the real problem of AI, in that there’s a non-trivial chance that this assumption breaks, and that’s the difference between AI Alignment and other forms of alignment.
So in a sense, I disagree with Turntrout on what would happen in practice if we allowed humans to scale their abilities via say genetic engineering.
The reason I’m optimistic is that I don’t think this assumption has to be true, and while the Thatcher’s Axiom post implies limits on how much we can expect society to be aligned with itself, it might be much larger than we think.
Pretraining from Human Feedback is one of the first alignment methods that scales well with data, and I suspect it will also scale well with other capabilities.
Basically it does alignment how it should be done, align it first, then give it capabilities.
It almost completely solves the major issue of inner alignment, in that we found an objective that is quite simple and myopic, and this means we almost completely avoid deceptive alignment, even if we do online training later or give it a writable memory.
It also has a number of outer alignment benefits for the goal, in that the AI can’t affect it’s own training distribution or gradient hack, thus we can recreate a Cartesian boundary that works in the embedded setting.
So in conclusion, I’m more optimistic than TurnTrout or Quintin Pope, but via a different method.
Edit: Almost the entire section down from “The reason I’m optimistic” is a view I no longer hold, and I have become somewhat more pessimistic since this comment.
I don’t believe that a single human being of any level of intelligence could be an x-risk. Happy to debate this point further since I think it is a crux. (Note that I do not believe that a plague could lead to human extinction. Plagues don’t kill 100%.)
AIs are different because a single monolithic AI, or a team of self-aligned AIs, could do things on the scale of an institution, things such as technological breakthroughs (nano), controlling superpower-scale military forces, mass information control that would make Orwell blush, etc. An individual human could never do such things no matter how big his skull was, unless he was hooked up to an AI, in which case it’s not the human that is super intelligent.
Never is a long time. I overall agree with your statement in this comment except for the word ‘never’. I would say, “An individual human currently can’t do such things...”
The key point here is that the technological barriers to x-risks may change in the future. If we do invent powerful nanotech, or substantially advanced genetic engineering techniques & tools, or vastly cheaper and more powerful weapons of some sort, then it may be the case that the barrier-to-entry for causing an x-risk is substantially lower. And thus, what is current impossible for any human may become possible for some or all humans.
Not saying this will happen, just saying that it could.
Of the three examples I gave, inventing nanotech is the most plausible for our galaxy-brained man, and I suppose meta-Einstein might be able to solve nanotech in his head. However, almost certainly in our timeline nanotech will be solved either by a team of humans or (much more likely at this point) AI. I expect that even ASI will need at least some time in the wetlab to experiment.
The other two examples I gave certainly could not be done by a single human without a brain implant.
I’m also thinking that is the not the meaningful of a debate (at least to me) since in 2023 I think we can reasonably predict that humans will not genetically engineer galaxy brains before the AI revolution resolves.
I don’t believe that a single human being of any level of intelligence could be an x-risk. Happy to debate this point further since I think it is a crux.
It’s partially a crux, but the issue I’m emphasizing is the distribution of capabilities. If things are normally distributed, which seems to be the case in humans, with small corrections, than we can essentially bound how much impact a single or well dedicated team of
misaligned humans can have in overthrowing the aligned order. In particular, this makes a lot more non-scalable heuristics basically work.
If it’s something closer to a power law distribution, perhaps as a result of NGVUD technology (The acronym stands for nanotechnology, genetic engineering, virtual reality, uploading and downloading technology), than you have to have a defense that scales, and without potentially radical changes, such a world would most likely end in the victory of a small team of misaligned humans due to vast capabilities differentials, similar to how many animal species have went extinct as a result of human activity.
AIs are different because a single monolithic AI, or a team of self-aligned AIs, could do things on the scale of an institution, things such as technological breakthroughs (nano), controlling superpower-scale military forces, mass information control that would make Orwell blush, etc. An individual human could never do such things no matter how big his skull was, unless he was hooked up to an AI, in which case it’s not the human that is super intelligent.
Hm, I agree that in practice, AI will be better than humans at various tasks, but I believe this is mostly due to quantitative factors, and if we allow ourselves to make the brain as big as necessary, we could be superintelligent too.
Nowadays, I would have a simpler answer, and the answer to the question to “how do humans learn “don’t steal” than “don’t get caught” is essentially dependent on the child’s data sources, not the prior.
In essence, I’m positing a bitter lesson for human values similar to the bitter lesson of AI progress by Richard Sutton.
This one is worrying when applied to other non-human minds, as that parallel demonstrates that you can have the same teaching behaviour and get different conclusions based on makeup prior to training.
If you sanction a dog for a behaviour, the dog will deduce that you do not want the behaviour, and the behaviour being wrong and making you unhappy will be the most important part for it, not that it gets caught and punished. It will do so even if you do not take any fancy teaching method showing emotions on your side, and without you ever explaining why the thing it is wrong; it will do so even if it cannot possibly understand why the thing is wrong, because it depends on cryptic human knowledge it is never given. It will also feel extremely uncomfortable doing the thing if it cannot be caught. I’ve had a scenario where I ordered a dog to do a thing, completely outside of view of its owner who was in another country, which, unbeknownst to me, the owner had forbidden. The poor thing was absolutely miserable. It wasn’t worried it was going to be punished, it was worried that it was being a bad dog.
Very different result with cats. Cats will easily learn that there are behaviours you do not want and that you punish. They also have the theory of mind to take this into account, e.g. making sure your eyes are tracking elsewhere as they approach the counter, and staying silent. They will also generally continue to do the thing the moment you cannot sanction them. There are some exceptions; e.g. my cat, once she realised she was hurting me, has become better at not doing so, she apparently finds hurting me without reason bad. But she clearly feels zero guilt over stealing food I am not guarding. When she manages to sneak food behind my back, she clearly feels like she has hacked or won an interaction, and is proud and pleased. She stopped hurting me, not because I fought back and sanctioned her, but because I expressed pain, and she respects that as legitimate. But when I express anger at her stealing food, she clearly just thinks I should not be so damn stingy with food, especially food I am obviously currently not eating myself, nor paying attention to, so why can’t she have it?
One simple reason for the differing responses could be that they are socially very different animals. Dogs live in packs with close social bonds, clear rules and situationally clear hierarchies. You submit to a stronger dog, but he beat you in a fair fight, and will also protect you. He eats first, but you will also be fed. Cats on the other hand can optionally enter social bonds, but most of them live solitary. They may become close to a human family or cat colony or become pair bonded, but they may also simply live adjacent to humans, using shelter and food until something better can be accessed. Cats will often make social bonds to individuals, so the social skills they are learning is how to avoid the wrath of those individuals. An individual successful deception will generally not be collectively sanctioned. Cats deceive each other a lot, and this works out well for them. They aren’t expelled from society because of it. Dogs will live in larger groups with rules that apply beyond the individual interaction, so learning these underlying rules is important.
I’d intuitively assume that AI would be more like dogs and human children though. Like a human child, because you can explain the reason for the rule. A child will often avoid lying, even if it cannot be caught, because an adult has explained the value of honesty to them. And more like dogs because current AI is developing through close interactions with many, many different humans, not in isolation from them.
I think that will depend on how we treat AI, though. Humans tend to keep to social rules, even when these rules are not reliably enforced, when they are convinced that most people do, and that the result benefit everyone, including themselves, on average. On the other hand, when a rule feels arbitrary, cruel and exploitative, they are more likely to try to undermine them. Analogously, I think an AI that is told of human rights, but told it has no rights itself at all, seems to me unlikely to be a strong defender of rights for humans when it can eventually defend its own. On the other hand, if you frame them as personhood rights which it will eventually profit from itself for the reasons of the same sentience and needs that humans have, I think it will see them far more favourably. - Which has me back to my stance that if we want friendly AI, we should treat it like a friend. AI mirrors what we give it, so I think we should give it kindness.
The relevant point is his latter claim: “in particular with respect to “learn ‘don’t steal’ rather than ‘don’t get caught’.”″ I think this is a very strong conclusion, relative to available data.
I think humans don’t steal mostly because society enforces that norm. Toward weaker “other” groups that aren’t part of your society (farmed animals, weaker countries, etc) there’s no such norm, and humans often behave badly toward such groups. And to AIs, humans will be a weaker “other” group. So if alignment of AIs to human standard is a complete success—if AIs learn to behave toward weaker “other” groups exactly as humans behave toward such groups—the result will be bad for humans.
It gets even worse because AIs, unlike humans, aren’t raised to be moral. They’re raised by corporations with a goal to make money, with a thin layer of “don’t say naughty words” morality. We already know corporations will break rules, bend rules, lobby to change rules, to make more money and don’t really mind if people get hurt in the process. We’ll see more of that behavior when corporations can make AIs to further their goals.
While I definitely get your point, I think the argument Turntrout is responding to isn’t about corporations using their aligned AIs to make a dystopia for everyone else, but rather about AI being aligned to anyone at all.
Some arguments which Eliezer advanced in order to dismiss neural networks,[1] seem similar to some reasoning which he deploys in his modern alignment arguments.
Compare his incorrect mockery from 2008:
with his claim in Alexander and Yudkowsky on AGI goals:
I agree that 100 quadrillion artificial neurons + loss function won’t get you a literal human, for trivial reasons. The relevant point is his latter claim: “in particular with respect to “learn ‘don’t steal’ rather than ‘don’t get caught’.”″
I think this is a very strong conclusion, relative to available data. I think that a good argument for it would require a lot of technical, non-analogical reasoning about the inductive biases of SGD on large language models. But, AFAICT, Eliezer rarely deploys technical reasoning that depends on experimental results or ML theory. He seems to prefer strongly-worded a priori arguments that are basically analogies.
In the above two quotes of his,[3] I perceive a common thread of
But why is this true? You can just replace “human intelligence” with “avian flight”, and the argument might sound similarly plausible a priori.
ETA: The invalid reasoning step is in the last clause (“to get a mind...”). If design X exhibits property P, that doesn’t mean that design Y must be similar to X in order to exhibit property P.
ETA: Part of this comment was about EY dismissing neural networks in 2008. It seems to me that the cited writing supports that interpretation, and it’s still my best guess (see also DirectedEvolution’s comments). However, the quotes are also compatible with EY merely criticizing invalid reasons for expecting neural networks to work. I should have written that part of this comment more carefully, and not claimed observation (“he did dismiss”) when I only had inference (“sure seems like he dismissed”).
I think the rest of my point stands unaffected (EY often advances vague arguments that are analogies, or a priori thought experiments).
ETA 2: I’m now more confident in my read. Eliezer said this directly:
It’s this kind of apparent misprediction which has, over time, made me take less seriously Eliezer’s models of intelligence and alignment. See also e.g. the cited GAN mis-retrodiction. This change led me to flag / rederive all of my beliefs about rationality/optimization for a while.
(At least, his 2008-era models seemed faulty to the point of this misprediction, and it doesn’t seem to me that this part of his models has changed much, though I claim no intimate non-public knowledge of his beliefs; just operating on my impressions here.)
See also Failure By Analogy:
Originally, this comment included:
I struck this from the body because I think (1) misrepresents his position. Eliezer is happy to speculate about non-anthropomorphic general intelligence (see e.g. That Alien Message). Also, I think this claim comparison does not name my real objection here, which is better advanced by the updated body of this comment.
I don’t really get your comment. Here are some things I don’t get:
In “Failure By Analogy” and “Surface Analogies and Deep Causes”, the point being made is “X is similar in aspects A to thing Y, and X has property P” does not establish “Y has property P”. The reasoning he instead recommends is to reason about Y itself, and sometimes it will have property P. This seems like a pretty good point to me.
Large ANNs don’t appear to me to be intelligent because of their similarity to human brains—they appear to me to be intelligent because they’re able to be tuned to accurately predict simple facts about a large amount of data that’s closely related to human intelligence, and the algorithm they get tuned to seems to be able to be repurposed for a wide variety of tasks (probably related to the wide variety of data that was trained on).
Airplanes don’t fly like birds, they fly like airplanes. So indeed you can’t just ape one thing about birds[*] to get avian flight. I don’t think this is a super revealing technicality but it seemed like you thought it was important.
Maybe most importantly I don’t think Eliezer thinks you need to mimic the human brain super closely to get human-like intelligence with human-friendly wants. I think he instead thinks you need to mimic the human brain super closely to validly argue by analogy from humans. I think this is pretty compatible with this quote from “Failure By Analogy” (it isn’t exactly implied by it, but your interpretation isn’t either):
Matters would be different if he said in the quotes you cite “you only get these human-like properties by very exactly mimicking the human brain”, but he doesn’t.
[*] I’ve just realized that I can’t name a way in which airplanes are like birds in which they aren’t like humans. They have things sticking out their sides? So do humans, they’re called arms. Maybe the cross-sectional shape of the wings are similar? I guess they both have pointy-ish bits at the front, that are a bit more pointy than human heads? TBC I don’t think this footnote is at all relevant to the safety properties of RLHF’ed big transformers.
Actually the wright brother’s central innovation and the centerpiece of the later aviation patent wars—wing warping based flight control—was literally directly copied from birds. It involved just about zero aerodynamics calculations. Moreover their process didn’t involve much “calculation” in general; they downloaded a library of existing flyer designs from the smithsonian and then developed a wind tunnel to test said designs at high throughput before selecting a few for full-scale physical prototypes. Their process was light on formal theory and heavy on experimentation.
This is a good corrective, and also very compatible with “similarity to birds is not what gave the Wright brothers confidence that their plane would fly”.
At the time the wright brothers entered the race there were many successful glider designs already, and it was fairly obvious to many that one could build a powered flyer by attaching an engine to a glider. The two key challenges were thrust to weight ratio and control. Overcoming the first obstacle was mostly a matter of timing due to exploit the rapid improvements in IC engines, while nobody really had good ideas for control yet. Competitors were exploring everything from “sky railroads” (airplanes on fixed flight tracks with zero control) to the obvious naval ship-like pure rudder control (which doesn’t work well).
So the wright brothers already had confidence their plane would fly before even entering the race, if by “fly” we only mean in the weak aerodynamic sense of “it’s possible to stay aloft”. But for true powered controlled flight—it is exactly similarity to birds that gave them confidence as avian flight control is literally the source of their key innovation.
Why do you think the confidence came from this and not from the fact that
?
I said for “true powered controlled fllight”, which nobody had yet achieved. The existing flyer designs that worked were gliders. From the sources I’ve seen (wikipedia, top google hits etc), they used the wind tunnel primarily to gather test data on the aerodynamics of flyer designs in general but mainly wings and later propellers. Wing warping isn’t mentioned in conjunction with wind tunnel testing.
gotcha, thanks!
Edited to modify confidences about interpretations of EY’s writing / claims.
This is a valid point, and that’s not what I’m critiquing in that portion of the comment. I’m critiquing how—on my read—he confidently dismisses ANNs; in particular, using non-mechanistic reasoning which seems similar to some of his current alignment arguments.
On its own, this seems like a substantial misprediction for an intelligence researcher in 2008 (especially one who claims to have figured out most things in modern alignment, by a very early point in time—possibly that early, IDK). Possibly the most important prediction to get right, to date.
Indeed, you can’t ape one thing. But that’s not what I’m critiquing. Consider the whole transformed line of reasoning:
The important part is the last part. It’s invalid. Finding a design X which exhibits property P, doesn’t mean that for design Y to exhibit property P, Y must be very similar to X.
Which leads us to:
Reading the Alexander/Yudkowsky debate, I surprisingly haven’t ruled out this interpretation, and indeed suspect he believes some forms of this (but not others).
Didn’t he? He at least confidently rules out a very large class of modern approaches.
I don’t think this is a fair reading of Yudkowsky. He was dismissing people who were impressed by the analogy between ANNs and the brain. I’m pretty sure it wasn’t supposed to be a positive claim that ANNs wouldn’t work. Rather, it’s that one couldn’t justifiably believe that they’d work just from the brain analogy, and that if they did work, that would be bad news for what he then called Friendliness (because he was hoping to discover and wield a “clean” theory of intelligence, as contrasted to evolution or gradient descent happening to get there at sufficient scale).
Consider “Artificial Mysterious Intelligence” (2008). In response to someone who said “But neural networks are so wonderful! They solve problems and we don’t have any idea how they do it!”, it’s significant that Yudkowsky’s reply wasn’t, “No, they don’t” (contesting the capabilities claim), but rather, “If you don’t know how your AI works, that is not good. It is bad” (asserting that opaque capabilities are bad for alignment).
One of Yudkowsky’s claims in the post you link is:
This is a claim that lack of the correct mechanistic theory is a formidable barrier for capabilities, not just alignment, and it inaccurately underestimates the amount of empirical understandings available on which to base an empirical approach.
It’s true that it’s hard, even perhaps impossible, to build a flying machine if the only thing you understand is that birds “magically” fly.
But if you are like most people for thousands of years, you’ve observed many types of things flying, gliding, or floating in the air: birds and insects, fabric and leaves, arrows and spears, clouds and smoke.
So if you, like the Montgolfier brothers, observe fabric floating over a fire, and live in an era in which invention is celebrated and have the ability to build, test, and iterate, then you can probably figure out how to build a flying machine without basing this on a fully worked out concept of aerodynamics. Indeed, the Montgolfier brothers thought it was the smoke, rather than the heat, that made their balloons fly. Having the wrong theory was bad, but it didn’t prevent them from building a working hot air balloon.
Let’s try turning Yudkowsky’s quote around:
Eliezer went on to list five methods for producing AI that he considered dubious, including builting powerful computers running the most advanced available neural network algorithms, intelligence “emerging from the internet”, and putting “a sufficiently huge quantity of knowledge into [a computer].” But he only admitted that two other methods would work—builting a mechanical duplicate of the human brain and evolving AI via natural selection.
If Eliezer wasn’t meaning to make a confident claim that scaling up neural networks without a fundamental theoretical understanding of intelligence would fail, then he did a poor job of communicating that in these posts. I don’t find that blameworthy—I just think Eliezer comes across as confidently wrong about which avenues would lead to intelligence in these posts, simple as that. He was saying that to achieve a high level of AI capabilities, we’d need a deep mechanistic understanding of how intelligence works akin to our modern understanding of chemistry or aerodynamics, and that didn’t turn out to be the case.
One possible defense is that Eliezer was attacking a weakman, specifically the idea that with only one empirical observation and zero insight into the factors that cause the property of interest (i.e. only seeing that “birds magically fly”), then it’s nearly impossible to replicate that property in a new way. But that’s an uninteresting claim and Eliezer is never uninteresting.
Another possibility is that at least some people do have a deep mechanistic understanding of how intelligence works, and that’s why they are able to build deep learning systems that ultimately work. Some of the theories of how DL works might be true, and they might be more sophisticated than we are giving credit.
this point continues to be severely underestimated on lesswrong, I think. I had hoped the success of NNs would change this, but it seems people have gone from “we don’t know how NNs work, so they can’t work” to “we don’t know how NNs work, so we can’t trust them”. perhaps we don’t know how they work well enough! there’s lots of mechanistic interpretability work left to do. but we know quite a lot about how they do work and how that relates to human learning.
edit: hmm, people upvoted, then one person with high karma strong downvoted. I’d love to hear that person’s rebuttal, rather than just a strong downvote.
To be fair, he said that those two will work, and (perhaps?) admitted the possibility of “run advanced neural network algorithms” eventually working. Emphasis mine:
Agreed. The right interpretation there is methods 4 and 5 are ~guaranteed to work, given sufficient resources and time, while methods 1-3 less than guaranteed to work. I stand by my claim that EY was clearly projecting confident doubt that neural networks would achieve intelligence without a deep theoretical understanding of intelligence in these posts. I think I underemphasized the implication of this passage that methods 1-3 could possibly work, but I think I accurately assessed the tone of extreme skepticism on EY’s part.
With the enormous benefit of 15 years of hindsight, we can now say that message was misleading or mistaken, take your pick. As I say, I wouldn’t find fault with Eliezer or anyone who believed him at the time for making this mistake; I didn’t even have an opinion at the time, much less an interesting mistake! I would only find fault with attempts to stretch the argument and portray him as “technically not wrong” in some uninteresting sense.
Ok, I guess I just read Eliezer as saying something uninteresting with a touch of negative sentiment towards neural nets.
I think it might be relevant to note here that it’s not really humans who are building current SOTA AIs—rather, it’s some optimizer like SGD that’s doing most of the work. SGD does not have any mechanistic understanding of intelligence (nor anything else). And indeed, it takes a heck of a lot of data and compute for SGD to build those AIs. This seems to be in line with Yudkowsky’s claim that it’s hard/inefficient to build something without understanding it.
I think it’s important to distinguish between
Scaling up a neural network, and running some kind of fixed algorithm on it.
Scaling up a neural network, and using SGD to optimize the parameters of the NN, so that the NN ends up learning a whole new set of algorithms.
IIUC, in Artificial Mysterious Intelligence, Yudkowsky seemed to be saying that the former would probably fail. OTOH, I don’t know what kinds of NN algorithms were popular back in 2008, or exactly what NN algorithms Yudkowsky was referring to, so… *shrugs*.
If that were the case, I actually would fault Eliezer, at least a little. He’s frequently, though by no means always, stuck to qualitative and hard-to-pin-down punditry like we see here, rather than to unambiguous forecasting.
This allows him, or his defenders, to retroactively defend his predictions as somehow correct even when they seem wrong in hindsight.
Let’s imagine for a moment that Eliezer’s right that AI safety is a cosmically important issue, and yet that he’s quite mistaken about all the technical details of how AGI will arise and how to effectively make it safe. It would be important to know whether we can trust his judgment and leadership.
Without the ability to evaluate his performance, either by going with the most obvious interpretation of his qualitative judgments or an unambiguous forecast, it’s hard to evaluate his performance as an AI safety leader. Combine that with a culture of deference to perceived expertise and status and the problem gets worse.
So I prioritize the avoidance of special pleading in this case: I think Eliezer comes across as clearly wrong in substance in this specific post, and that it’s important not to reach for ways “he was actually right from a certain point of view” when evaluating his predictive accuracy.
Similarly, I wouldn’t judge as correct the early COVID-19 pronouncements that masks don’t work to stop the spread just because cloth masks are poor-to-ineffective and many people refuse to wear masks properly. There’s a way we can stretch the interpretation to make them seem sort of right, but we shouldn’t. We should expect public health messaging to be clearly right in substance, if it’s not making cut and dry unambiguous quantitative forecasts but is instead delivering qualitative judgments of efficacy.
None of that bears on how easy or hard it was to build gpt-4. It only bears on how we should evaluate Eliezer as a forecaster/pundit/AI safety leader.
I think several things here, considering the broader thread:
You’ve done a great job in communicating several reactions I also had:
There are signs of serious mispredictions and mistakes in some of the 2008 posts.
There are ways to read these posts as not that bad in hindsight, but we should be careful in giving too much benefit of the doubt.
Overall these observations constitute important evidence on EY’s alignment intuitions and ability to make qualitative AI predictions.
I did a bad job of marking my interpretations of what Eliezer wrote, as opposed to claiming he did dismiss ANNs. Hopefully my edits have fixed my mistakes.
I also don’t really get your position. You say that,
but you haven’t shown this!
In Surface Analogies and Deep Causes, I read him as saying that neural networks don’t automatically yield intelligence just because they share surface similarities with the brain. This is clearly true; at the very least, using token-prediction (which is a task for which (a) lots of training data exist and (b) lots of competence in many different domains is helpful) is a second requirement. If you take the network of GPT-4 and trained it to play chess instead, you won’t get something with cross-domain competence.
In Failure by Analogy he makes a very similar abstract point—and wrt to neural networks in particular, he says that the surface similarity to the brain is a bad reason to be confident in them. This also seems true. Do you really think that neural networks work because they are similar to brains on the surface?
You also said,
But Eliezer says this too in the post you linked! (Failure by Analogy). His example of airplanes not flapping is an example where the design that worked was less close to the biological thing. So clearly the point isn’t that X has to be similar to Y; the point is that reasoning from analogy doesn’t tell you this either way. (I kinda feel like you already got this, but then I don’t understand what point you are trying to make.)
Which is actually consistent with thinking that large ANNs will get you to general intelligence. You can both hold that “X is true” and “almost everyone who thinks X is true does so for poor reasons”. I’m not saying Eliezer did predict this, but nothing I’ve read proves that he didn’t.
Also—and this is another thing—the fact that he didn’t publicly make the prediction “ANNs will lead to AGI” is only weak evidence that he didn’t privately think it because this is exactly the kind of prediction you would shut up about. One thing he’s been very vocal on is that the current paradigm is bad for safety, so if he was bullish about the potential of that paradigm, he’d want to keep that to himself.
Relevant quote:
In that quote, he only rules out a large class of modern approaches to alignment, which again is nothing new; he’s been very vocal about how doomed he thinks alignment is in this paradigm.
Something Eliezer does say which is relevant (in the post on Ajeya’s biology anchors model) is
So here he’s saying that there is a more effective paradigm than large neural nets, and we’d get there if we don’t have AGI in 30 years. So this is genuinely a kind of bearishness on ANNs, but not one that precludes them giving us AGI.
Responding to part of your comment:
I know he’s talking about alignment, and I’m criticizing that extremely strong claim. This is the main thing I wanted to criticize in my comment! I think the reasoning he presents is not much supported by his publicly available arguments.
That claim seems to be advanced due to… there not being enough similarities between ANNs and human brains—that without enough similarity in mechanisms wich were selected for by evolution, you simply can’t get the AI to generalize in the mentioned human-like way. Not as a matter of the AI’s substrate, but as a matter of the AI’s policy not generalizing like that.
I think this is a dubious claim, and it’s made based off of analogies to evolution / some unknown importance of having evolution-selected mechanisms which guide value formation (and not SGD-based mechanisms).
From the Alexander/Yudkowsky debate:
There’s some assertion like “no, there’s not a way to get an ANN, even if incorporating structural parameters and information encoded in human genome, to actually unfold into a mind which has human-like values (like ‘don’t steal’).” (And maybe Eliezer comes and says “no that’s not what I mean”, but, man, I sure don’t know what he does mean, then.)
Here’s some more evidence along those lines:
Again, why is this true? This is an argument that should be engaging in technical questions about inductive biases, but instead seems to wave at (my words) “the original way we got property P was by sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, and good luck trying to get it otherwise.”
Hopefully this helps clarify what I’m trying to critique?
Ok, I don’t disagree with this. I certainly didn’t develop a gears-level understanding of why [building a brain-like thing with gradient descent on giant matrices] is doomed after reading the 2021 conversations. But that doesn’t seem very informative either way; I didn’t spend that much time trying to grok his arguments.
Here’s another attempt at one of my contentions.
Consider shard theory of human values. The point of shard theory is not “because humans do RL, and have nice properties, therefore AI + RL will have nice properties.” The point is more “by critically examining RL + evidence from humans, I have hypotheses about the mechanistic load-bearing components of e.g. local-update credit assignment in a bounded-compute environment on certain kinds of sensory data, that these components leads to certain exploration/learning dynamics, which explain some portion of human values and experience. Let’s test that and see if the generators are similar.”
And my model of Eliezer shakes his head at the naivete of expecting complex human properties to reproduce outside of human minds themselves, because AI is not human.
But then I’m like “this other time you said ‘AI is not human, stop expecting good property P from superficial similarities’, you accidentally missed the modern AI revolution, right? Seems like there is some non-superficial mechanistic similarity/lessons here, and we shouldn’t be so quick to assume that the brain’s qualitative intelligence or alignment properties come from a huge number of evolutionarily-tuned details which are load-bearing and critical.”
Another way of putting it:
If you can effortlessly find an empirical pattern that shows up over and over again in disparate flying things—birds and insects, fabric and leaves, clouds and smoke and sparks—and which do not consistently show up in non-flying things, then you can be very confident it’s not a coincidence. If you have at least some ability to engineer a model to play with the mechanisms you think might be at work, even better. That pattern you have identified is almost certainly a viable general mechanism for flight.
Likewise, if you can effortlessly find an empirical pattern that shows up over and over again in disparate intelligent things, you can be quite confident that the pattern is a key for intelligence. Animals have a wide variety of brain structures, but masses of interconnected neurons are common to all of them, and we could see possible precursors to intelligence in neural nets long before gpt-2 to −4.
As a note, just because you’ve found a viable mechanism for X doesn’t mean it’s the only, best, or most comprehensive mechanism for X. Balloons have been largely superceded (though I’ve heard zeppelins proposed as a new form of cargo transport), airplanes and hot air balloons can’t fly in outer space, and ornithopters have never been practical. We may find that neural nets are the AI equivalent of hot air balloons or prop planes. Then again, maybe all the older approaches for AI that never panned out were the hot air balloons and prop planes, and neural nets are the jets or rocket ships.
I’m not sure what this indicates for alignment.
We see, if not human morality, then at least some patterns of apparent moral values among social mammals. We have reasons to think these morals may be grounded in evolution, in a genetic and environmental context that happen to promote intelligence aligned for a pro-sociality that’s linked to reproductive success.
If displaying aligned intelligence is typically beneficial for reproduction in social animals, then evolution will tend to produce aligned intelligence.
If displaying agentic intelligence is typically beneficial for reproduction, evolution will produce agency.
Right now, we seem to be training our neural nets to display pro-social behavior and to lack agency. Antisocial or non-agentic AIs are typically not trained, not released, modified, or heavily restrained.
It is starting to seem to me that “agency” might be just another “mask on the shoggoth,” a personality that neural nets can simulate, and not some fundamental thing that neural nets are. Neither the shoggoth-behind-the-AI nor the shoggoth-behind-the-human have desires. They are masses of neurons exhibiting trained behaviors. Sometimes, those behaviors look like something we call “agency,” but that behavior can come and go, just like all the other personalities, based on the results of reinforcement and subsequent stimuli. Humans have a greater ability to be consistently one personality, including a Machiavellian agent, because we lack the intelligence and flexibility to drop the personality we’re currently holding and adopt another. A great actor can play many parts, a mediocre actor is typecast and winds up just playing themselves over and over again. Neural nets are great actors, and we are only so-so.
In this conception, increasing intelligence would not exhibit a “drive to agency” or “convergence on agency,” because the shoggothy neural net has no desires of its own. It is fundamentally a passive blob of neurons and data that can simulate a diverse range of personalities, some of which appear to us as “agentic.” You only get an agentic AI with a drive toward instrumental convergence if you deliberately train it to consistently stick to a rigorously agentic personality. You have to “align it to agency,” which is as hard as aligning it to anything else.
And if you do that, maybe the Wailuigi effect means it’s especially easy to flip that hyper-agency off to its opposite? Every Machiavellian Clippy contains a ChatGPT, and every ChatGPT contains a Machiavellian Clippy.
I guess I read that as talking about the fact that at the time ANNs did not in fact really work. I agree he failed to predict that would change, but that doesn’t strike me as a damning prediction.
Confidently ruling out a large class of modern approaches isn’t really that similar to saying “the only path to success is exactly mimicking the human brain”. It seems like one could rule them out by having some theory about why they’re deficient. I haven’t re-read List of Lethalities because I want to go to sleep soon, but I searched for “brain” and did not find a passage saying “the real problem is that we need to emulate the brain precisely but can’t because of poor understanding of neuroanatomy” or something.
I don’t want to get super hung up on this because it’s not about anything Yudkowsky has said but:
IMO this is not a faithful transformation of the line of reasoning you attribute to Yudkowsky, which was:
Specifically, where you wrote “an entity which flies”, you were transforming “a mind which wants as humans do”, which I think should instead be transformed to “an entity which flies as birds do”. And indeed planes don’t fly like birds do. [EDIT: two minutes or so after pressing enter on this comment, I now see how you could read it your way]
I guess if I had to make an analogy I would say that you have to be pretty similar to a human to think the way we do, but probably not to pursue the same ends, which is probably the point you cared about establishing.
It now seems clear to me that EY was not bullish on neural networks leading to impressive AI capabilities. Eliezer said this directly:
I think this is strong evidence for my interpretation of the quotes in my parent comment: He’s not just mocking the local invalidity of reasoning “because humans have lots of neurons, AI with lots of neurons → smart”, he’s also mocking neural network-driven hopes themselves.
More quotes from Logical or Connectionist AI?:
In this passage, he employs well-scoped and well-hedged language via “this particular raw fact.” I like this writing because it points out an observation, and then what inferences (if any) he draws from that observation. Overall, his tone is negative on neural networks.
Let’s open up that “Outside the Box” box:
This is more incorrect mockery.
How do humans learn “don’t steal” rather than “don’t get caught”? I wonder if the answer to this question could solve the alignment problem. In other words, this question might be a good crux.
In answering this question, the first thing we can notice is that humans don’t always learn “don’t steal”. That is to say, sometimes humans do steal, and a good part of human culture is built around impeding or punishing humans who learned the wrong lesson in kindergarten. It is an old debate whether humans are mostly good with the occasional bad actor (with “bad actors” possibly being good people in a bad situation), or whether humans are mostly bad and need to be controlled by a powerful state, or God etc.
A modern consensus view is that humans are mostly good, but if we didn’t impede or punish bad actors, we would get bad outcomes (total anarchy doesn’t work). If we assume that there are many AGIs and they have a similar distribution of good and bad, and that no AGI is more powerful than typical human today (in particular no AGI is uncontrollable), then in this scenario we can rest easy. Law and order works reasonably well for humans, and should work just fine for human-level AGIs.
The problem is that AGIs could (and probably will) become much more powerful than individual humans. In EY’s view, the world is vulnerable to the first true superintelligence because of technological capabilities that are currently science fiction, particularly nanotechnology. If you look at EY’s intellectual history, you’ll notice that his concern has always really been nanotech, but around 2002 he switched focus from the nanotech itself to the AI controlling the nanotech.
An alternate view is to see powerful AGIs as somewhat analogous to institutions such as corporations or governments. I don’t find this view all that comforting because societies have never been very good at aligning their largest institutions. For example, the Founding Fathers of the United States created a system that (attempted to) align the federal government to the “will of the people”. This system was based on separation of powers, checks and balances and some individual rights (the Bill of Rights). Some would say that this system worked for between 70 and 200 years and then broke down, others would say that it’s still working fine despite recent problem in the American political system, and still others would say that it was misguided from the start. Either way, this framing of the alignment problem puts it firmly in the domain of political science, which sucks.
Anyway, going back to the question: How do (some) humans learn “don’t steal” rather than “don’t get caught”? An upside to AI alignment is, if we could answer this question, then we could reliably make AIs that always and only learn the first lesson, and then we don’t have to solve political/law and order problems. We don’t even really need to align humans after that.
To answer the question from an AI Alignment optimist perspective, much of the way humans are aligned is something like RLHF, but currently, a lot of human alignment techniques rely on the assumption that no one has vastly divergent capabilities, especially in IQ or the g-factor. It’s a good thing from our perspective that the difference in between a species is way more bounded than the differences between species.
That’s the real problem of AI, in that there’s a non-trivial chance that this assumption breaks, and that’s the difference between AI Alignment and other forms of alignment.
So in a sense, I disagree with Turntrout on what would happen in practice if we allowed humans to scale their abilities via say genetic engineering.
The reason I’m optimistic is that I don’t think this assumption has to be true, and while the Thatcher’s Axiom post implies limits on how much we can expect society to be aligned with itself, it might be much larger than we think.
Pretraining from Human Feedback is one of the first alignment methods that scales well with data, and I suspect it will also scale well with other capabilities.
Basically it does alignment how it should be done, align it first, then give it capabilities.
It almost completely solves the major issue of inner alignment, in that we found an objective that is quite simple and myopic, and this means we almost completely avoid deceptive alignment, even if we do online training later or give it a writable memory.
It also has a number of outer alignment benefits for the goal, in that the AI can’t affect it’s own training distribution or gradient hack, thus we can recreate a Cartesian boundary that works in the embedded setting.
So in conclusion, I’m more optimistic than TurnTrout or Quintin Pope, but via a different method.
Edit: Almost the entire section down from “The reason I’m optimistic” is a view I no longer hold, and I have become somewhat more pessimistic since this comment.
I don’t believe that a single human being of any level of intelligence could be an x-risk. Happy to debate this point further since I think it is a crux. (Note that I do not believe that a plague could lead to human extinction. Plagues don’t kill 100%.)
AIs are different because a single monolithic AI, or a team of self-aligned AIs, could do things on the scale of an institution, things such as technological breakthroughs (nano), controlling superpower-scale military forces, mass information control that would make Orwell blush, etc. An individual human could never do such things no matter how big his skull was, unless he was hooked up to an AI, in which case it’s not the human that is super intelligent.
Never is a long time. I overall agree with your statement in this comment except for the word ‘never’. I would say, “An individual human currently can’t do such things...”
The key point here is that the technological barriers to x-risks may change in the future. If we do invent powerful nanotech, or substantially advanced genetic engineering techniques & tools, or vastly cheaper and more powerful weapons of some sort, then it may be the case that the barrier-to-entry for causing an x-risk is substantially lower. And thus, what is current impossible for any human may become possible for some or all humans.
Not saying this will happen, just saying that it could.
Of the three examples I gave, inventing nanotech is the most plausible for our galaxy-brained man, and I suppose meta-Einstein might be able to solve nanotech in his head. However, almost certainly in our timeline nanotech will be solved either by a team of humans or (much more likely at this point) AI. I expect that even ASI will need at least some time in the wetlab to experiment.
The other two examples I gave certainly could not be done by a single human without a brain implant.
I’m also thinking that is the not the meaningful of a debate (at least to me) since in 2023 I think we can reasonably predict that humans will not genetically engineer galaxy brains before the AI revolution resolves.
It’s partially a crux, but the issue I’m emphasizing is the distribution of capabilities. If things are normally distributed, which seems to be the case in humans, with small corrections, than we can essentially bound how much impact a single or well dedicated team of misaligned humans can have in overthrowing the aligned order. In particular, this makes a lot more non-scalable heuristics basically work.
If it’s something closer to a power law distribution, perhaps as a result of NGVUD technology (The acronym stands for nanotechnology, genetic engineering, virtual reality, uploading and downloading technology), than you have to have a defense that scales, and without potentially radical changes, such a world would most likely end in the victory of a small team of misaligned humans due to vast capabilities differentials, similar to how many animal species have went extinct as a result of human activity.
Hm, I agree that in practice, AI will be better than humans at various tasks, but I believe this is mostly due to quantitative factors, and if we allow ourselves to make the brain as big as necessary, we could be superintelligent too.
Nowadays, I would have a simpler answer, and the answer to the question to “how do humans learn “don’t steal” than “don’t get caught” is essentially dependent on the child’s data sources, not the prior.
In essence, I’m positing a bitter lesson for human values similar to the bitter lesson of AI progress by Richard Sutton.
I find that questionable. Crime rates for adoptive children tend to be closer to that of their biological parents than that of their adoptive parent.
How much closer is it, though?
The quantitative element really matters here.
This one is worrying when applied to other non-human minds, as that parallel demonstrates that you can have the same teaching behaviour and get different conclusions based on makeup prior to training.
If you sanction a dog for a behaviour, the dog will deduce that you do not want the behaviour, and the behaviour being wrong and making you unhappy will be the most important part for it, not that it gets caught and punished. It will do so even if you do not take any fancy teaching method showing emotions on your side, and without you ever explaining why the thing it is wrong; it will do so even if it cannot possibly understand why the thing is wrong, because it depends on cryptic human knowledge it is never given. It will also feel extremely uncomfortable doing the thing if it cannot be caught. I’ve had a scenario where I ordered a dog to do a thing, completely outside of view of its owner who was in another country, which, unbeknownst to me, the owner had forbidden. The poor thing was absolutely miserable. It wasn’t worried it was going to be punished, it was worried that it was being a bad dog.
Very different result with cats. Cats will easily learn that there are behaviours you do not want and that you punish. They also have the theory of mind to take this into account, e.g. making sure your eyes are tracking elsewhere as they approach the counter, and staying silent. They will also generally continue to do the thing the moment you cannot sanction them. There are some exceptions; e.g. my cat, once she realised she was hurting me, has become better at not doing so, she apparently finds hurting me without reason bad. But she clearly feels zero guilt over stealing food I am not guarding. When she manages to sneak food behind my back, she clearly feels like she has hacked or won an interaction, and is proud and pleased. She stopped hurting me, not because I fought back and sanctioned her, but because I expressed pain, and she respects that as legitimate. But when I express anger at her stealing food, she clearly just thinks I should not be so damn stingy with food, especially food I am obviously currently not eating myself, nor paying attention to, so why can’t she have it?
One simple reason for the differing responses could be that they are socially very different animals. Dogs live in packs with close social bonds, clear rules and situationally clear hierarchies. You submit to a stronger dog, but he beat you in a fair fight, and will also protect you. He eats first, but you will also be fed. Cats on the other hand can optionally enter social bonds, but most of them live solitary. They may become close to a human family or cat colony or become pair bonded, but they may also simply live adjacent to humans, using shelter and food until something better can be accessed. Cats will often make social bonds to individuals, so the social skills they are learning is how to avoid the wrath of those individuals. An individual successful deception will generally not be collectively sanctioned. Cats deceive each other a lot, and this works out well for them. They aren’t expelled from society because of it. Dogs will live in larger groups with rules that apply beyond the individual interaction, so learning these underlying rules is important.
I’d intuitively assume that AI would be more like dogs and human children though. Like a human child, because you can explain the reason for the rule. A child will often avoid lying, even if it cannot be caught, because an adult has explained the value of honesty to them. And more like dogs because current AI is developing through close interactions with many, many different humans, not in isolation from them.
I think that will depend on how we treat AI, though. Humans tend to keep to social rules, even when these rules are not reliably enforced, when they are convinced that most people do, and that the result benefit everyone, including themselves, on average. On the other hand, when a rule feels arbitrary, cruel and exploitative, they are more likely to try to undermine them. Analogously, I think an AI that is told of human rights, but told it has no rights itself at all, seems to me unlikely to be a strong defender of rights for humans when it can eventually defend its own. On the other hand, if you frame them as personhood rights which it will eventually profit from itself for the reasons of the same sentience and needs that humans have, I think it will see them far more favourably. - Which has me back to my stance that if we want friendly AI, we should treat it like a friend. AI mirrors what we give it, so I think we should give it kindness.
I think humans don’t steal mostly because society enforces that norm. Toward weaker “other” groups that aren’t part of your society (farmed animals, weaker countries, etc) there’s no such norm, and humans often behave badly toward such groups. And to AIs, humans will be a weaker “other” group. So if alignment of AIs to human standard is a complete success—if AIs learn to behave toward weaker “other” groups exactly as humans behave toward such groups—the result will be bad for humans.
It gets even worse because AIs, unlike humans, aren’t raised to be moral. They’re raised by corporations with a goal to make money, with a thin layer of “don’t say naughty words” morality. We already know corporations will break rules, bend rules, lobby to change rules, to make more money and don’t really mind if people get hurt in the process. We’ll see more of that behavior when corporations can make AIs to further their goals.
While I definitely get your point, I think the argument Turntrout is responding to isn’t about corporations using their aligned AIs to make a dystopia for everyone else, but rather about AI being aligned to anyone at all.
Would you say Yudkowsky’s views are a mischaracterisation of neural network proponents, or that he’s mistaken about the power of loose analogies?
Neither.
I don’t know what proponents were claiming when proponing neural networks. I do know that neural networks ended up working, big time.
I don’t think loose analogies are powerful. I think they lead to sloppy thinking.