anonymousaisafety comments on Let’s See You Write That Corrigibility Tag

anonymousaisafety 20 Jun 2022 10:37 UTC
25 points
4
I worry that the question as posed is already assuming a structure for the solution—“the sort of principles you’d build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it”.
When I read that, I understand it to be describing the type of behavior or internal logic that you’d expect from an “aligned” AGI. Since I disagree that the concept of “aligning” an AGI even makes sense, it’s a bit difficult for me to reply on those grounds. But I’ll try to reply anyway, based on what I think is reasonable for AGI development.
In a world where AGI was developed and deployed safely, I’d expect the following properties:
1. Controlled environments.
2. Controlled access to information.
3. Safety-critical systems engineering.
4. An emphasis on at-rest encryption and secure-by-default networking.
5. Extensive logging, monitoring, interpretability, and circuit breakers.
6. Systems with AGI are assumed to be adversarial.
Let’s stop on the top of the mountain and talk about (6).
Generally, the way this discussion goes is we discuss how unaligned AGI can kill everyone, and therefore we need to align the AGI, and then once we figure out how to align the AGI, problem solved, right?
Except someone then points out that, well, other people might create unaligned AGI, and then that will kill everyone, so that’s awkward.
Also maybe the team that thought they aligned the AGI actually didn’t, their proof had a mistake, whoops.
Or they had a formally proven proof, but they deployed it to a general purpose computer, implemented zero hardware redundancy, and then a bit flip caused the system to kill everyone anyway, whoops.
So normally we don’t discuss the last 2 failure modes, because it’s a bit awkward for the discussion of alignment, and we instead talk about how the deployment of the aligned AGI is going to go totally fine, and then we’ll just do a little pivotal act.
How does AGI kill everyone in these stories? Almost always, the answer is nanotechnology. If the answer isn’t nanotechnology, it’s hacking into <hackable thing>.
When you try to suggest that maybe the AGI shouldn’t be able to hack into <hackable thing>, you’re told that’s insane, because that would be boxing the AGI, and we know that boxing doesn’t work.
The world I am going to describe might not be our world, but it also isn’t Eliezer’s world.
We are separated by a few critical assumptions—and it starts with hardware.
1. Controlled environments.
In a world where we cared about AGI being developed safely, here’s a fun idea: you know all that TPU hardware that Google designed, or the other custom AI/ML chips that companies are working? It’s all just glorified GPUs, except the Mythic AMP is kind of cool, since it’s an actual analog compute platform for running neural networks, like an analog FPGA. What if AGI wasn’t developed to run on general-purpose consumer hardware? What if it can’t be run on general-purpose hardware? What if the general intelligence algorithm is computationally inefficient enough that the path to AGI actually looks like incredibly specialized, heavily parallelized, wide processors with low clock rates, large caches, and fast memory access.
In this world, the general intelligence algorithm needs hardware designed for it because no amount of pipelining with a traditional CPU, or even a GPU, is able to run that code efficiently.
This is a world where the AGI machine code must be flashed to the ROM via an external debugger. There is no ability to flash or update the code from within the hardware/software system. The AGI cannot update itself. We do not design, deploy, or use self-improving systems.
This is a world where the hardware used for AGI development uses a memory-protection unit so that memory accesses to the ROM containing the AGI machine code trigger a hardware exception. It is not possible to read the ROM from within the hardware/software system. The AGI cannot inspect its own source code.
This is a world where the AGI has RAM, but no accessible non-volatile memory. If it loses power, poof, gone. The AGI cannot maintain, or persist, any type of internal or “custom” state between resets. Anything the AGI needs to know is either baked into the image flashed to the ROM, or it’s transferred into the AGI’s RAM after power-on.
This is a world where we deliberately leave performance on the floor because what we want from the system is consistency, predictability, and some minimum level of effectiveness above human capabilities—even if we hypothetically could run an AGI system at some 1000x “speed” (whatever that means) relative to a human mind, doesn’t mean that you’d have to do so. If you’re worried about not being able to react to sudden or unexpected changes, then deliberately run the system at arbitrarily low clock rates. There is zero reason why we couldn’t run an AGI on a custom hardware/software system that has a clock rate so low it thinks in glacial time relative to human perception.
This is a world where it’s reasonable to regulate the AGI hardware the same way we regulate components for ICBMs, because it is not ubiquitously used, and we don’t need to burn everyone’s GPU—we just need to not sell consumer AGI hardware, in the same way that we don’t sell consumer GPS units that could be used for an ICBM instead of a car. The very concept of “stealing” or “releasing” the AGI doesn’t make sense, because the hardware to run the AGI machine code is as carefully-controlled and rare as the hardware for custom-built spacecraft.
2. Controlled access to information.
In a world where we cared about AGI being developed safely, no one would be dumb enough to allow that system unrestricted access to networks. This would be accomplished by external, non-AGI systems. If we don’t trust the AGI, it would be weird to try and use some other AGI system as a watchdog. So instead, use custom network hardware. Build it with an FPGA. Ethernet ports on either side. The custom switch has a simple rule: when a packet is received from the AGI, request approval. If no approval is given, drop the packet. Same for incoming traffic. If the complaint is that this is hugely inefficient because it slows down the AGI’s ability to do ad-hoc discovery & learning, that’s the point. If we want the AGI to solve a problem about optimizing car production, we will give it a data dump of material that we think is relevant for the problem. If it then wants to request additional information, then we have to approve that request. This is how workers are treated on projects with highly proprietary or otherwise silo’d information and that’s how the AGI should be treated too.
But there’s more. We’re running the AGI on custom hardware, right? Internal to the AGI board, after the Ethernet (let’s assume we’re using one or more gigabit links) port on the PCB, we’ll put a hardware security module and FPGA for decrypting the packets. This is not connected to, or part of, the hardware/software system running the AGI code. It is located on the PCB, but otherwise it is a separate system entirely. This system has one job: try to decrypt the traffic and drop anything that is invalid. Why are we going to do this? Because outside of this system, on the non-AGI network switch, we are going to encrypt traffic that we approve for the AGI. If we don’t approve the traffic, it’ll got dropped. We’ll do the reverse for traffic generated by the AGI. It is always encrypted prior to leaving the system, and the AGI itself has no control over this function. If the custom network hardware isn’t there to decrypt the approved traffic, it’ll just be babbling noise indecipherable from random data.
There’s no WiFi module on this hardware, and no public WiFi endpoints in the facility. All WiFi access points in the facility require passwords and have a white-list of approved MAC addresses.
The reason why controlling access to information matters is because there are hard limits to how far or fast an intelligent entity can reason in a vacuum. There is enough context-dependent information that you can’t reason everything from first principles, because the world’s current state has tons of structure embedded in it that humans put in place, and anything that wants to use human supply chains, logistics, and manufacturing is going to have to respect those totally arbitrary and unpredictable constraints.
An AGI won’t be able to magically decode arbitrary data. If we don’t give it access to datasheets, technical descriptions, or allow it to interact with data sources, it’ll be just as stuck as any other intelligence faced with an arbitrary number of hypotheses and no way to rule any of them out.
Depending on how complex the “general intelligence” algorithm turns out to be, just reasoning through if-then logic and Bayesian world modeling could be non-trivial compute time. The more we control information for the AGI, the less feasible plans that involve looking at thousands of CCTV cameras simultaneously will seem.
3. Safety-critical systems engineering.
In a world where we cared about AGI being developed safely, the hardware that I’m describing above containing the hardware security module would actually have at least 3 identical processors running on a synchronized clock. The hardware security module would expect 3 identical packets for any I/O. If that did not occur, an immediate interrupt would power-cycle the system—again, this is outside of the processors running the AGI code itself. This is because even if we assumed the AGI was “aligned”, we don’t want uncontrollable behavior like radiation-induced bit flips, hardware defects from manufacturing, or other unexpected issues to result in erroneous behavior.
Power for the entire system would be controlled via external, non-AGI systems. There would be multiple easily accessed locations for throwing a physical switch on the power line. In addition to that, there would be network accessible microprocessors for doing the same remotely, with the caveat that use small pyrotechnic charges are therefore irreversible. The default behavior of the system is that the power is controlled via these external systems and on reset, they do not supply power downstream to the AGI system itself. After an unexpected reset, e.g. loss of facility power, it requires manual control to re-enable downstream power to the AGI. This is done on a logically separate power domain from the rest of the facility—i.e. you can turn on power to everything else in the facility, WITHOUT being forced to power anything related to the AGI system.
This is a world where AGI systems are designed under similar paranoia that you’d use for developing any other hardware/software system that could kill people, and the resulting pessimism drives us to isolate and limit functionality using non-AGI systems by default, where those non-AGI systems can be designed, implemented, verified, and QA-tested all using traditional systems safety methods.
4. An emphasis on at-rest encryption and secure-by-default networking.
In a world where we cared about limiting the damage that AGI could do, we would ask manufacturers of hard drives, solid state drives, and every operating system to require encryption at rest. Your work laptop probably already does this. If you’re paranoid, you’ve got it enabled on your cellphone too. Do it everywhere. This is just good security practice, even without an AGI, but this is doubling down on (2) -- we are controlling access to information. Limit what can be known. Prevent discovery.
Ditto for the second point. Emphasize encrypted networking, VPNs, firewalls, and anything else that makes it harder for any attacker to find targets. Importantly, and I want to emphasize this point: maybe don’t network factories to the internet and totally remote-controllable. This isn’t actually a problem in today’s world, as I’ve posted about it before, but it’s something to think about going forward. If a failure mode in many stories where the AGI kills everyone is “well it could hack this cool networked factory”, maybe don’t do that? The weird thing about factories is that there kind of aren’t that many in the world that conceivably fit that description? Like, there are lots of factories, but very few that are designed to build arbitrary products. It’s a handful of places in China and they still require manual reconfiguration for new products. Most things called “factories” build specific products. The AGI that somehow takes over the Ford car factory is not going to immediately begin turning out things that are not Ford cars.
5. Extensive logging, monitoring, interpretability, and circuit breakers.
Remember all of those non-AGI systems from 2 and 3? They’re constantly logging and telemetering information. If the AGI suddenly starts babbling like an idiot unexpectedly, cut the power. If the AGI is requesting information and has been denied repeatedly, cut the power. If the AGI unexpectedly starts using significantly more power, or the power begins to fluctuate in “weird” ways, cut the power. If any of these non-AGI systems report errors, or failures, cut the power. None of this needs to be controlled by humans. It’s all driven by simple microprocessors running watchdogs. Even failures like “this switch is supposed to report telemetry, but didn’t” will result in killing the power.
If you can do this for systems whose failure mode is dozens of people die, you can do this for the system whose failure mode is “everybody dies”.
For the AGI itself, we want interpretability. As much as we can get. We don’t want this to be a function of the AGI, because we don’t trust the AGI. But we don’t need it to be. We’ve got access to the RAM through an external debug interface. We don’t need the AGI to report that state, and we don’t have to trust it.
Even though we won’t use AGI systems as watchdogs, it is acceptable to use traditional AI/ML systems for parts of this loop. Maybe no human can parse what the AGI’s RAM says, but if we can train an AI/ML system to do a good job of producing a human-readable digest, that is better than nothing.
In a world where we cared about AGI being developed safely, the AGI is still treated like a task-based service. The AGI is given a problem to solve, and when it reports a solution and we finish negotiating with it on the particulars of the solution, we reset the system. We don’t maintain long-running AGI instances because there’s no need. Point it at a problem, get the solution, power-cycle it.
6. Systems with AGI are assumed to be adversarial.
Here we are again, back on the top of the mountain.
The point of the systems safety process is to make some system safe, despite the inherent unsafety of components within it.
To do that, we look for the ways it can cause harm, and we find mitigations.
We practice defense-in-depth, with an emphasis on redundancy and at least one-fault-tolerance.
In a world where we cared about not dying to AGI, it would be insufficient to expect that an aligned AGI would “just work”. It is not enough for there to be a proof that a hypothetical product as-designed is safe. We also need to look at the likelihood of:
- design failures (the formal proof was wrong because the verification of it had a bug, there is no formal proof, the “formally verified” proof was actually checked by humans and not by an automated theorem prover)
- manufacturing failures (hardware behavior out-of-spec, missed errata, power failures, bad ICs, or other failure of components)
- implementation failures (software bugs, compiler bugs, differences between an idealized system in a proof vs the implementation of that system in some runtime or with some language)
- verification failures (bugs in tests that resulted in a false claim that the software met the formal spec)
- environment or runtime failures (e.g. radiation-induced upsets like bit flips; Does the system use voting? Is the RAM using ECC? What about the processor itself?)
- usage failures (is the product still safe if it’s misused? what type of training or compliance might be required? is maintenance needed? is there some type of warning or lockout on the device itself if it is not actively maintained?)
- process failures (“normalization of deviance”)
For each of these failure modes, we then look at the worst-case magnitude of that failure. Does the failure result in non-functional behavior, or does it result in erroneous behavior? Can erroneous behavior be detected? By what? Etc. This type of review is called an FMEA. This review process can rule out designs that “seem good on paper” if there’s sufficient likelihood of failures and inability to mitigate them to our desired risk tolerances outside of just the design itself, especially if there exist other solutions in the same design space that do not have similar flaws.
If we did this process, and we stuck to it, I think we’d come to an awkward conclusion.
Many things that people have assumed we would give to the AGI to solve, we could just not.
If the AGI tells us the way to solve climate change is nanobots, we can say “no thanks, give us a different solution”. It doesn’t matter if the AGI promises us this solution is 99.99% efficient and the next best option using boring, conventional technology is only 70% efficient. We can opt to not do things that we don’t think we can verify.
Or if the AGI gives us the plans for a remote controlled factory that is going to be so efficient because it’s totally controlled by the AGI over the internet—don’t build that factory. Ask the AGI for the schematics, data sheets, and any other design material needed to build a new product, and build them the old fashioned way. Trade efficiency for stability.
That’s how the system safety process is supposed to work. If after we do an FMEA, we are uncertain about a proposed solution, the ethical and correct response is to reject that solution!
If we ask the AGI to explain a plan, and it says “Humans can’t comprehend the complexity of this.”, then we should reply “Toss that one then, give us one we can understand”.
That’s how design reviews work.
You don’t tell the review board, “You wouldn’t understand why this design is safe.”
Why would I hold a super-intelligent AGI to a lower standard than I would a human engineer?
Conclusion
Is any of this feasible?
In my mind, the key assumption comes down to whether AGI can be deployed to general-purpose hardware.
I think the answer might be “no”, which is why we increasingly see the most successful AI/ML labs in the world are investing in specialized hardware like the Google TPUs, Cerebras Wafer, Nvidia JETSON, Microsoft / Graphcore IPU, Mythic AMP, or literal dozens of other examples. All of these are examples of specialized, dedicated HW for AI/ML systems that replace general-purpose hardware like a CPU or GPU.
The alternative is a world where a 10-year-old laptop can run an AGI in someone’s living room.
I have not seen anything yet that makes me think we’re leaning that way. Nothing about the human brain, or our development of AI/ML systems so far, makes me think that when we create an actual AGI, it’ll be possible for that algorithm to run efficiently on general-purpose hardware.
In the world that I’m describing, we do develop AGI, but it never becomes ubiquitous. It’s not like a world where every single company has pet AGI projects. It’s the ones you’d expect. The mega-corporations. The most powerful nations. AGI are like nuclear power plants. They’re expensive and hard to build and the companies that do so have zero incentive to give that away. If you can’t spend the billions of dollars on designing totally custom, novel hardware that looks nothing like any traditional general-purpose computer hardware built in the last 40 years, then you can’t develop the platform needed for AGI. And for the few companies that did pull it off, their discoveries and inventions get the dubious honor of being regulated as state secrets, so you can’t buy it on the open market either. This doesn’t mean AI/ML development stops or ceases. The development of AGI advances that field immensely too. It’s just in the world I’m describing, even after AGI is discovered, we still have development focused on creating increasingly powerful, efficient, and task-focused AI/ML systems that have no generality or agent-like behavior—a lack of capabilities isn’t a dead-end, it’s yet another reason why this world survives. If you don’t need agents for a problem, then you shouldn’t apply an agent to that problem.
What links here?
- alyssavance 21 Jun 2022 7:19 UTC
  4 points
  1
  Parent
  Thanks for writing this! I think it’s a great list; it’s orthogonal to some other lists, which I think also have important stuff this doesn’t include, but in this case orthogonality is super valuable because that way you’re less likely for all lists to miss something.
  - anonymousaisafety 23 Jun 2022 3:32 UTC
    4 points
    0
    Parent
    I deliberately tried to focus on “external” safety features because I assumed everyone else was going to follow the task-as-directed and give a list of “internal” safety features. I figured that I would just wait until I could signal-boost my preferred list of “internal” safety features, and I’m happy to do so now—I think Lauro Langosco’s list here is excellent and captures my own intuition for what I’d expect from a minimally useful AGI, and that list does so in probably a clearer / easier to read manner than what I would have written. It’s very similar to some of the other highly upvoted lists, but I prefer it because it explicitly mentions various ways to avoid weird maximization pitfalls, like that the AGI should be allowed to fail at completing a task.

anonymousaisafety comments on Let’s See You Write That Corrigibility Tag

1. Controlled environments.

2. Controlled access to information.

3. Safety-critical systems engineering.

4. An emphasis on at-rest encryption and secure-by-default networking.

5. Extensive logging, monitoring, interpretability, and circuit breakers.

6. Systems with AGI are assumed to be adversarial.

Conclusion