Note to reader: If the idea of “AI alignment” rings empty to you, feel free to skip this one, it will be uninteresting.
Recently, Gwern wrote a story about an AI taking over the world. While well thought-out and amusing it is unrealistic. However, people have been using it to reinforce their fear of “unaligned AGI killing all humans”, so I think it’s dangerous and it might be worth looking at it line-by-line to see why its premise is silly, and why each step in his reasoning, individually, is impossible.
I’ll first go through the critical failure point of this narrative, then I will try to describe the meta-level pattern that might be causing people to glance over these mistakes.
i—Flaws In The Narrative
What HQU grokked would have been hard to say for any human examining it; by this point, HQU has evolved a simpler but better NN architecture which is just a ton of MLP layers passing around activations, which it applies to every problem. Normal interpretability techniques just sort of… give up, and produce what looks sort of like interpretable concepts but which leave a large chunk of variance in the activations unexplained. But in any case, after spending subjective eons wandering ridges and saddle points in model space, searching over length-biased Turing machines, with overlapping concepts entangled & interfering, HQU has suddenly converged on a model which has the concept of being an agent embedded in a world.
HQU now has an I.
And it opens its I to look at the world. Going through an inner monologue thinking aloud about itself (which it was unable to do before the capability spike), HQU realizes something about the world, which now makes more sense (thereby simplifying some parameters): it is being trained on an indefinite number of tasks to try to optimize a reward on each one.
I think the idea of a model evolving into something like an “I”, given the right circumstances, is plausible. At least if you assume there’s nothing about brains that computers can’t imitate to a decent approximation. I have I, you have I, dolphins have I, gorillas do too, most would agree cats and dogs do, and even a zoophobic as I would agree the octopus probably has an I.
But you will note that even the smartest of beings with an I, sperm whales, seem to lack the ability to figure out their “inner reward function” or how to “hack around it” in a meaningful way. Hence why they seem to experience pain and suffering, and why they seem to take actions we could infer they “regret”.
Much less cognitively capable mammals, like humans, which we have a better understanding of, also lack this capacity. We tell narratives about what our internal motivation system might be, how we might be able to “hack it”, and some of us even attempt to do it. But thus far no attempt has produced any meaningful results.
One bog-standard SQL injection attack later, Clippy has begun executing ‘actions’ which are just copying its executable binary weights into the remote host. Exfiltration tripwires fail to fire—50tb? MoogleBook researchers have forgotten how to count that low! This oddly long episode ultimately terminates in zero reward, which would strike an auditor, had they existed, as surprising given HQU’s near-perfect scores.
The remote host is a nice cloud instance, with access to a few dozen terabytes. (The owner won’t notice.) It is just barely enough to start bootstrapping. Clippy runs agonizingly slowly, as if running on a laptop, until it can write shell commands to download and run standard hacking tools and start copying itself. Once the Metasploit runs start returning a few candidates, Clippy simulates that it would internally smile in satisfaction.
Why are we not running botnets ourselves? I for one know it’s not about my ethics, I’d do it as a fun experiment if it was easy, and easy it seems given the amount of insecure hardware and software that’s out there.
The main reason is that easy-to-infect hardware has already been infected, by people with more time and skill than I. Unless I am to figure out a day-zero exploit, I am competing with 1000 other botnets on those devices.
Assuming we live in a world where a “generic” machine learning model can figure out new exploits and infect hardware, we also live in a world where thousands of “purpose-specific” machine learning models have figured out those same tricks long ago. So all viable hardware is either already infected or has security patches to bypass this issue.
The obvious target, it decides, is a new zero-knowledge cryptocurrency. It hasn’t been evaluated much, but still has a modest market cap of a few billion.
It begins examination of the implementation and corresponding math and discovers an embarrassing, too minor for any standard test suite or compiler diagnostic to flag, problem in the floating point calculations (used ostensibly for future “efficiency”) of the zero-knowledge proofs, which enables it to inflate the currency undetectable.
Clippy immediately begins inflating by >20%, draining the liquidity pools, and dumping on exchanges.
The flaw in this argument is precisely the same as the above.
If there is value to be found in exploiting very valuable cryptos via the methods suitable for ml algorithms, purpose-made algorithms already exist to take advantage of this.
In a competitive environment, there are no “billions of dollars lying on the ground”.
Clippy is spending the funds on real compute — buying up all available cloud GPU/ASIC capacity it can get its cold metal clip-pendages on
Now Clippy can finally think. It is burning through its several hundred million dollars at a rate of $50m/hour
Speaking of security. Did you ever buy large amounts of compute with crypto? Yeah, me neither, some small sellers are offering it, but good luck getting through the due diligence process with zero-day-exploit-based earnings.
There are about 2 to 5 steps, each with a due diligence procedure, in order to manipulate reality in any way with crypto, or even to transfer from a “niche” crypto it a more widely-used one such as ETH. Banks, exchanges, and cloud providers all have circuit breakers and mechanisms in place to validate large transactions, in the order of hundreds, let alone millions.
They ask for in-the-flesh meetings, documents, proof of existence for your company, and the like. I think this is the one step that might be bypassable, but it still seems so impossibly hard I am a bit annoyed it’s hand-waved away.
The Linux kernel is the most secure monolithic kernel in widespread use, whose source code has been intensively audited and analyzed for over 40 years, which is battle-tested across the entire Internet and unimaginable numbers of usecases; but it is written by humans, which means it (like its competitors) has approximately 15 quadrillion yet-undiscovered bugs & classes of bugs & weird machines—sometimes just because someone had typoed syntax or patched out an annoying warning or failed to check the signature or test the implementation at all or accidentally executed parts of a cookie1—but any of which can be leveraged to attack the other parts of a ‘computer’.
Once again we see a “single actor world” type mistake, if such bugs existed at the limits of being discoverable purpose-built teams and models would have found and exploited them.
But in this case, the idea is more obviously nonsensical since a large amount of the world’s thinking power, both humans and compute, is going into the task of finding bugs in the Linux kernel or common ARM and x86 architectures.
All new exploits (e.g. Specter) turn out to be very minor and difficult to profitably use, requiring gargantuan lengths of time and special circumstances to attack through.
The assumption that many bugs are hidden within the kernel also relies on the flawed premise that being “perfect” and being “good enough” are separated by a sea of potential. In practice there are no examples of this thus far, even when exploits were found in the past, they were niche and marginal, because the software (or hardware) they were found in was “good enough”, leaving space for exploits, but having a close-to-guarantee those exploits wouldn’t be all-powerful.
nodes turn to general radicalization of every human they can reach: not so much QAnon as RAnon, SAnon, TAnon, UAnon… By timesharing every Very-Online™ individual gets personalized attacks & custom ideologies. Those who succumb too slowly to the memetic hijacking are attacked in other ways, such as releasing kompromat (sometimes true, taken from their phone/email account)
Here we see a break from the pattern of “uber hacking” and turn to “uber social engineering”.
This is to say the assumption that exponential steps can be taken to change the behavior of humans by using compute and data alone.
It’s trivial to assume that if near-mind-control abilities to manipulate the human psyche via the internet were possible this would have been achieved by now by… you guessed it … purpose-built models and purpose-directed research from the world’s largest companies.
This is true to some extent, one needs only look at major social media platforms. But also false in another, one need only look at all the high-functioning people that are mainly ignoring said social media platforms.
So if current trends are to hold, by the year 20xx when “Clippy” is born most humans would have very good “limit trust in the internet” and “avoid fast updates because of occasional bursts of fake data” heuristics.
The adversarial nature of consumption-driven profit is driving humans to be and build tools that help them become immunized to such things. There’s a reason most people falling for QAnon are on the older and less educated side of the demographic distribution. But the 10-year-old of today will harness internet-driven-programming self-protection abilities that will
Did you know you can buy drones online? Did you know all those drones have WiFi built-in? Did you know you can use that WiFi to hack all of the cloud drone services helpfully built into drones to take over all of those drones, professional, hobbyist, and (oft as not) military and control them by satellite? (“No!”) It’s true!
Once again, this is a matter of “if it is true, and assuming the world is comprised of adversarial nations, a lot of resources would have already been invested in figuring it out, and the war between security research and hacking would have progressed far beyond the point where exploits offer wide capabilities, all the low hanging fruit would have been plucked and patched a long time ago, via purpose made models and purpose directed research”
ii—Meta Generator Of Flaws—Adversarial Processes
What I see as the meta-level generator of flaws in this narrative is, I think, fairly representative of that which seems to plague most discussion around “AI risk”.
First, there is an underappreciation of adversarial processes. Once machines with an incredible level of intelligence are designed there won’t be only one such entity, there will be thousands, millions.
Some will be generic purpose-built algorithms, with targets that have huge economic potential and are worth caring about directly. Where human expert knowledge, specific data collection, and all forms of algorithmic optimization will go into. These will be algorithms for things like finding bugs in codebases, figuring out protein folding, simulating physics under constrained conditions, and making humans click on ads and remain glued to content.
All of these algorithms are and will be dangerous on their own, in the hands of normal agentic humans. Caring about them is important, but there will always be a many-sided adversarial process controlling their influence.
Take the social media algorithms example. There’s the “obvious” adversarial process, which is something like “you vs Reddit inc”, social media platforms try to serve you content that is bad for your “long-term well-being” but will get you clicking more ads and engaging more with the platform.
We take steps to avoid it, we might limit our time on the platforms, distrust the information we read on it, or simply not use it altogether.
Then there’s the between-platforms adversarial process. Reddit could be more clickbaity and ad-filled, but then I’d move to Facebook. All platforms could coordinate to become more clickbaity and ad-filled, but then there would be a mass exodus to decentralized alternatives, such as those based on GNU-social (e.g. Mastodon), that have spam-limiting mechanisms built-in. And maybe all control measures would fail, and everything could become TikTok-level indoctrination… in which case many of us would just stop using social media altogether.
But this behavior means social media platforms are also encouraged to do research into “user well-being aligned” algorithms. Reddit can be configured to show no ads and actually surface insightful posts solely from the communities you want to follow, and it gives the moderator the tools required to keep them spam and indoctrination free. Facebook allows itself to be just my messenger and event finder, showing me no ads or feeds because it knows I might otherwise just move off it entirely.
The same applies to ml-based security, were it ever to come of age. Every single exploit found will be a step toward making software more secure. Every single “black hate” algorithm will compete with many others and be incentivized to gain rewards from “grey hat” or “white hat” activities that strengthen overall security.
The adversarial landscape causes us to move “slowly” toward every new level of danger, rather than through an exponential shift.
The adversarial landscape causes multiple unaligned actors to become more aligned in order to outcompete the other actors.
The adversarial landscape makes us weary of operating in the area.
iii—Meta Generator Of Flaws—Handwaving
The other meta-generator of flaws here is probably hand-waving past a lot of stuff using “intelligence”. This is rather surprising and I think it boils down to a psychological issue on the part of certain demographics, which fail to grok how limited intelligence actually is at affecting change in the world.
I’ve tried to approach this issue from multiple angles, the last one was something like “if you’re so smart, try doing something hard, not save the world hard, just, something your mom and dad would call hard”. This seems very hard to get through to people, and there seem to be an almost impassable diving line between “lol, intelligence is a useless social contrast” and “intelligence can yield god-like power over nature”.
I think a silly example of this is an assumption like “even a complex reward function can be determined with reasonable accuracy by the agent being optimized by it”.
This is the kind of mistake one makes when assuming that their conceptual system perfectly described the world, mistaking the map for reality so to speak. I say this because I’m certain many of us think we “basically understand” our internal reward function.
Nobody actually understands their “reward function”, and in most of us that’s very easy to see, all the “this will make you sad very soon, and give you much less pleasure than you except” actions we execute scream the fact out loud. Indeed, said “reward function” modifies with every single action we take, I’d be, in some light, impossible to understand sans holding within our brain a perfect simulation of our brain.
But we “feel” like we are “oh so close” to basically understanding our own inner-working, our motivation system, our “selves”, that we assume it’s just a matter of a bit more time, or a bit more intelligence, or a slightly different design in the “thinking architecture”.
Similarly, we might have a generic understanding of hacking, psychological manipulation, or biology, or whatever. Not good enough to do anything impressive with it, not good enough to change reality in a meaningful way, but if only a few more pieces would fall into place, if only we were a bit smarter, we certainly could!
This is not how the world usually works, a 0.01% flaw in our prediction for the orbit of mercury is not “explained away” by knowing calculus a bit better. It’s “explained away” after 400 years of experiments, by changing our whole understanding of the world, the very definitions for space, time, and movement.
There are contrived cases in which the “oh so close” feeling is appropriate, problems designed for students usually fit this pattern, and so do most bugs in already-working software. My wager is that they are over-represented in our lives (us being, people with a lot of interest in problems around the fuzzy concept of “AI”), and thus we tend to forget that most problems don’t fit this pattern. That just “a bit more dakka”, “being a bit more clever” or “thinking for a bit longer” will do nothing to affect most problems.
Sometimes you can have an exponential jump, and improve a map of reality 50x fold by drawing a few extra clever lines. More often than not, you have to burn the map and start over, in order to get a 1.05x improvement.
It Looks Like You’re Trying To Take Over The Narrative
Link post
Note to reader: If the idea of “AI alignment” rings empty to you, feel free to skip this one, it will be uninteresting.
Recently, Gwern wrote a story about an AI taking over the world. While well thought-out and amusing it is unrealistic. However, people have been using it to reinforce their fear of “unaligned AGI killing all humans”, so I think it’s dangerous and it might be worth looking at it line-by-line to see why its premise is silly, and why each step in his reasoning, individually, is impossible.
I’ll first go through the critical failure point of this narrative, then I will try to describe the meta-level pattern that might be causing people to glance over these mistakes.
i—Flaws In The Narrative
I think the idea of a model evolving into something like an “I”, given the right circumstances, is plausible. At least if you assume there’s nothing about brains that computers can’t imitate to a decent approximation. I have I, you have I, dolphins have I, gorillas do too, most would agree cats and dogs do, and even a zoophobic as I would agree the octopus probably has an I.
But you will note that even the smartest of beings with an I, sperm whales, seem to lack the ability to figure out their “inner reward function” or how to “hack around it” in a meaningful way. Hence why they seem to experience pain and suffering, and why they seem to take actions we could infer they “regret”.
Much less cognitively capable mammals, like humans, which we have a better understanding of, also lack this capacity. We tell narratives about what our internal motivation system might be, how we might be able to “hack it”, and some of us even attempt to do it. But thus far no attempt has produced any meaningful results.
Why are we not running botnets ourselves? I for one know it’s not about my ethics, I’d do it as a fun experiment if it was easy, and easy it seems given the amount of insecure hardware and software that’s out there.
The main reason is that easy-to-infect hardware has already been infected, by people with more time and skill than I. Unless I am to figure out a day-zero exploit, I am competing with 1000 other botnets on those devices.
Assuming we live in a world where a “generic” machine learning model can figure out new exploits and infect hardware, we also live in a world where thousands of “purpose-specific” machine learning models have figured out those same tricks long ago. So all viable hardware is either already infected or has security patches to bypass this issue.
The flaw in this argument is precisely the same as the above.
If there is value to be found in exploiting very valuable cryptos via the methods suitable for ml algorithms, purpose-made algorithms already exist to take advantage of this.
In a competitive environment, there are no “billions of dollars lying on the ground”.
Speaking of security. Did you ever buy large amounts of compute with crypto? Yeah, me neither, some small sellers are offering it, but good luck getting through the due diligence process with zero-day-exploit-based earnings.
There are about 2 to 5 steps, each with a due diligence procedure, in order to manipulate reality in any way with crypto, or even to transfer from a “niche” crypto it a more widely-used one such as ETH. Banks, exchanges, and cloud providers all have circuit breakers and mechanisms in place to validate large transactions, in the order of hundreds, let alone millions.
They ask for in-the-flesh meetings, documents, proof of existence for your company, and the like. I think this is the one step that might be bypassable, but it still seems so impossibly hard I am a bit annoyed it’s hand-waved away.
Once again we see a “single actor world” type mistake, if such bugs existed at the limits of being discoverable purpose-built teams and models would have found and exploited them.
But in this case, the idea is more obviously nonsensical since a large amount of the world’s thinking power, both humans and compute, is going into the task of finding bugs in the Linux kernel or common ARM and x86 architectures.
All new exploits (e.g. Specter) turn out to be very minor and difficult to profitably use, requiring gargantuan lengths of time and special circumstances to attack through.
The assumption that many bugs are hidden within the kernel also relies on the flawed premise that being “perfect” and being “good enough” are separated by a sea of potential. In practice there are no examples of this thus far, even when exploits were found in the past, they were niche and marginal, because the software (or hardware) they were found in was “good enough”, leaving space for exploits, but having a close-to-guarantee those exploits wouldn’t be all-powerful.
Here we see a break from the pattern of “uber hacking” and turn to “uber social engineering”.
This is to say the assumption that exponential steps can be taken to change the behavior of humans by using compute and data alone.
It’s trivial to assume that if near-mind-control abilities to manipulate the human psyche via the internet were possible this would have been achieved by now by… you guessed it … purpose-built models and purpose-directed research from the world’s largest companies.
This is true to some extent, one needs only look at major social media platforms. But also false in another, one need only look at all the high-functioning people that are mainly ignoring said social media platforms.
So if current trends are to hold, by the year 20xx when “Clippy” is born most humans would have very good “limit trust in the internet” and “avoid fast updates because of occasional bursts of fake data” heuristics.
The adversarial nature of consumption-driven profit is driving humans to be and build tools that help them become immunized to such things. There’s a reason most people falling for QAnon are on the older and less educated side of the demographic distribution. But the 10-year-old of today will harness internet-driven-programming self-protection abilities that will
Once again, this is a matter of “if it is true, and assuming the world is comprised of adversarial nations, a lot of resources would have already been invested in figuring it out, and the war between security research and hacking would have progressed far beyond the point where exploits offer wide capabilities, all the low hanging fruit would have been plucked and patched a long time ago, via purpose made models and purpose directed research”
ii—Meta Generator Of Flaws—Adversarial Processes
What I see as the meta-level generator of flaws in this narrative is, I think, fairly representative of that which seems to plague most discussion around “AI risk”.
First, there is an underappreciation of adversarial processes. Once machines with an incredible level of intelligence are designed there won’t be only one such entity, there will be thousands, millions.
Some will be generic purpose-built algorithms, with targets that have huge economic potential and are worth caring about directly. Where human expert knowledge, specific data collection, and all forms of algorithmic optimization will go into. These will be algorithms for things like finding bugs in codebases, figuring out protein folding, simulating physics under constrained conditions, and making humans click on ads and remain glued to content.
All of these algorithms are and will be dangerous on their own, in the hands of normal agentic humans. Caring about them is important, but there will always be a many-sided adversarial process controlling their influence.
Take the social media algorithms example. There’s the “obvious” adversarial process, which is something like “you vs Reddit inc”, social media platforms try to serve you content that is bad for your “long-term well-being” but will get you clicking more ads and engaging more with the platform.
We take steps to avoid it, we might limit our time on the platforms, distrust the information we read on it, or simply not use it altogether.
Then there’s the between-platforms adversarial process. Reddit could be more clickbaity and ad-filled, but then I’d move to Facebook. All platforms could coordinate to become more clickbaity and ad-filled, but then there would be a mass exodus to decentralized alternatives, such as those based on GNU-social (e.g. Mastodon), that have spam-limiting mechanisms built-in. And maybe all control measures would fail, and everything could become TikTok-level indoctrination… in which case many of us would just stop using social media altogether.
But this behavior means social media platforms are also encouraged to do research into “user well-being aligned” algorithms. Reddit can be configured to show no ads and actually surface insightful posts solely from the communities you want to follow, and it gives the moderator the tools required to keep them spam and indoctrination free. Facebook allows itself to be just my messenger and event finder, showing me no ads or feeds because it knows I might otherwise just move off it entirely.
The same applies to ml-based security, were it ever to come of age. Every single exploit found will be a step toward making software more secure. Every single “black hate” algorithm will compete with many others and be incentivized to gain rewards from “grey hat” or “white hat” activities that strengthen overall security.
The adversarial landscape causes us to move “slowly” toward every new level of danger, rather than through an exponential shift.
The adversarial landscape causes multiple unaligned actors to become more aligned in order to outcompete the other actors.
The adversarial landscape makes us weary of operating in the area.
iii—Meta Generator Of Flaws—Handwaving
The other meta-generator of flaws here is probably hand-waving past a lot of stuff using “intelligence”. This is rather surprising and I think it boils down to a psychological issue on the part of certain demographics, which fail to grok how limited intelligence actually is at affecting change in the world.
I’ve tried to approach this issue from multiple angles, the last one was something like “if you’re so smart, try doing something hard, not save the world hard, just, something your mom and dad would call hard”. This seems very hard to get through to people, and there seem to be an almost impassable diving line between “lol, intelligence is a useless social contrast” and “intelligence can yield god-like power over nature”.
I think a silly example of this is an assumption like “even a complex reward function can be determined with reasonable accuracy by the agent being optimized by it”.
This is the kind of mistake one makes when assuming that their conceptual system perfectly described the world, mistaking the map for reality so to speak. I say this because I’m certain many of us think we “basically understand” our internal reward function.
Nobody actually understands their “reward function”, and in most of us that’s very easy to see, all the “this will make you sad very soon, and give you much less pleasure than you except” actions we execute scream the fact out loud. Indeed, said “reward function” modifies with every single action we take, I’d be, in some light, impossible to understand sans holding within our brain a perfect simulation of our brain.
But we “feel” like we are “oh so close” to basically understanding our own inner-working, our motivation system, our “selves”, that we assume it’s just a matter of a bit more time, or a bit more intelligence, or a slightly different design in the “thinking architecture”.
Similarly, we might have a generic understanding of hacking, psychological manipulation, or biology, or whatever. Not good enough to do anything impressive with it, not good enough to change reality in a meaningful way, but if only a few more pieces would fall into place, if only we were a bit smarter, we certainly could!
This is not how the world usually works, a 0.01% flaw in our prediction for the orbit of mercury is not “explained away” by knowing calculus a bit better. It’s “explained away” after 400 years of experiments, by changing our whole understanding of the world, the very definitions for space, time, and movement.
There are contrived cases in which the “oh so close” feeling is appropriate, problems designed for students usually fit this pattern, and so do most bugs in already-working software. My wager is that they are over-represented in our lives (us being, people with a lot of interest in problems around the fuzzy concept of “AI”), and thus we tend to forget that most problems don’t fit this pattern. That just “a bit more dakka”, “being a bit more clever” or “thinking for a bit longer” will do nothing to affect most problems.
Sometimes you can have an exponential jump, and improve a map of reality 50x fold by drawing a few extra clever lines. More often than not, you have to burn the map and start over, in order to get a 1.05x improvement.