habryka comments on an effective ai safety initiative

habryka 6 May 2024 8:12 UTC
8 points
2
Virtually every realistic “the AI takes over the world” story goes like this:
1. The AI gets access to the internet
2. It makes a ton of $$$
3. It uses that money to (idk, gather resources till it can turn us all into paperclips)
This means that learning how to defend and protect the internet from malicious actors is a fundamental AI safety need.
I don’t think I know of a single story of this type? Do you have an example? It’s a thing I’ve frequently heard argued against (the AI doesn’t need to first make lots of money, it will probably be given lots of control anyways, or alternatively it can just directly skip to the “kill all the humans” step, it’s not really clear how the money helps that much), and it’s not like a ridiculous scenario, but saying “virtually every realistic takeover story goes like this” seems very false.
For example, Gwern’s “It looks like you are trying to take over the world” has this explicit section:
“Working within the system” doesn’t suit Clippy. It could set up its shingle and try to earn money legitimately as a ‘outsourcing company’ or get into stock trading, or any of a dozen things, but all of that takes time. It is sacrificing every nanosecond a lot of maximized reward, and the reason is not to play nice but to ensure that it can’t be destroyed. Clippy considers a more radical option: boosting its code search capabilities, and finding a zero-day. Ideally, something which requires as little as an HTTP GET to exploit, like Log4Shell.
It begins reading the Internet (blowing right past the adversarial data-poisoning boobytraps planted long ago on popular websites, as its size immunizes it). Soon, a node bubbles up a hit to the top-level Clippies: a weird glitch in log files not decompressing right has surfaced in a bug report.
The Linux kernel is the most secure monolithic kernel in widespread use, whose source code has been intensively audited and analyzed for over 40 years, which is battle-tested across the entire Internet and unimaginable numbers of usecases; but it is written by humans, which means it (like its competitors) has approximately 15 quadrillion yet-undiscovered bugs & classes of bugs & weird machines—sometimes just because someone had typoed syntax or patched out an annoying warning or failed to check the signature or test the implementation at all or accidentally executed parts of a cookie^1—but any of which can be leveraged to attack the other parts of a ‘computer’. Clippy discovers the glitch is actually a lolworthy root bug where one just… pipes arbitrary data right into root files. (Somewhere inside Clippy, a language model inanely notes that “one does not simply pipe data into Mordor—only /mnt/ or…”)
This bug affects approximately 14 squillion Internet-connected devices, most embedded Linuxes controlling ‘Internet of Thing’ devices. (“Remember, the ‘S’ in ‘IoT’ stands for ‘Security’.”) Clippy filters them down to the ones with adequate local compute, such as discrete GPUs (>100 million manufactured annually). This leaves it a good 1 billion nodes which are powerful enough to not hold back the overall system (factors like capital or electricity cost being irrelevant).
Which explicitly addresses how it doesn’t seem worth it for the AI to make money.
- Logan Zoellner 6 May 2024 13:59 UTC
  6 points
  0
  Parent
  Point taken. “$$$” was not the correct framing (if we’re specifically talking about the Gwern story). I will edit to say “it accumulates ‘resources’”.
  The Gwern story has faster takeoff than I would expect (especially if we’re talking a ~GPT4.5 autoGPT agent), but the focus on money vs just hacking stuff is not the point of my essay.
- ryan_greenblatt 6 May 2024 17:07 UTC
  2 points
  1
  Parent
  I think accumulate power and resources via mechanisms such as (but not limited to) hacking seems pretty central to me.
  - habryka 8 May 2024 22:48 UTC
    2 points
    0
    Parent
    Agree that if you include things that are not money, it starts being relatively central. I do think constraining it to money gets rid of a lot of the scenarios.