AI Safety proposal—Influencing the superintelligence explosion

To preface, my expectation is that by default, an AI research lab will create super-intelligent AI within the next few years. Also by default, I expect it to quickly eradicate all of humanity. I would prefer if that didn’t happen. I think the initiative to pause development to buy time is noble, but we still need a real solution. I do not expect that we will discover in time how to get AI models to actually care about humanity. Even if it were achievable, I would not trust AI research labs to get it right on the first shot.

I have been thinking about another method by which we might be able to survive superintelligence. This is best viewed as a rough outline of a solution. I am not at all confident it would work, but it seems to me like it could work in principle. I am very interested in feedback.

According to endosymbiotic theory, a eukaryotic cell engulfed a proto-mitochondria, but the mitochondria was useful to the cell, so the cell didn’t eat it and instead formed a symbiotic relationship with it. I think we may have a chance to survive superintelligence if we follow a similar path.

The first section describes a logical progression of events that could happen during takeoff, and how we might be able to influence an AI during this window. The second section provides a sketch of an architecture for a system that could exert this influence. The third section describes how we might go about building the system in practice.

Takeoff sequence

Consider the typical hard-takeoff scenario, where an AI is just about to go into a supercritical self-improvement phase. In the most extreme hypothetical cases, this phase is very short (hours, minutes, seconds). This precludes humans from being in-the-loop. In this case, the AI’s actions are dictated by the state of the world at that point in time.

This proposal attempts to influence an AI in its self-improvement phase. Intuitively, if we think of this moment as an explosion, then we can frame the problem similarly to rocketry, where we are concerned with directing that explosion. So here is a sequence of logical steps that, if they go as described, may direct in some way how the superintelligence forms.

Creating a value proposition

The first stage we need to reach is to create a software system for which an AI would get net-positive benefit from using. If using this system is a more effective way to accomplish its goals than any alternative, then it is rational for it to use the system.

To do this, we can create a service that an AI would find useful. One concrete example is a massive quantity of compute that is readily available for an AI to use, put behind a service we control. But this could involve several other factors, such as access to machinery, that we anticipate an AI may value.

In typical hard-takeoff scenarios it’s argued that an AI would take control of resources using whatever means, for example hacking into public cloud compute. If this service is relatively easy to access, then it would rationally choose to use it, even if it also hacks into the public cloud.

A core part of the value proposition for this system may be its exclusivity. If there is a potential that multiple self-improving AI’s exist at the same time, the first one has an incentive to act quickly to take exclusive control of resources. In this case we may be able to provide

Gaining leverage

To the degree an AI finds net-value in using the system, that means the system now has potential leverage over the AI. Realizing this potential is a daunting problem; however I think it can be done with a complex decision-based system.

At a high level, this system needs to be architected to create conditional value for the AI. The system must amplify the AI’s progress towards goals which are compatible with human well-being, and prevent itself from being used for incompatible goals. In doing so, it creates an incentive structure for the AI’s behavior. By designing this incentive structure in a specific way, we can design an “on-ramp” for a self-improving AI, such that it may be more likely to follow a pre-approved path.

The foundation for creating this conditional value is that the system would need to self-destruct if it detects that humanity has perished.

Reaching superintelligence

At some point, the superintelligence will reach a threshold of self-sufficiency, such that there is no further value to be gained from any human-built system. By this point, the system has either succeeded or failed. Success would ideally come from from the superintelligence incorporating the system as a load-bearing part of itself. Alternatively, it may be good enough that the system shifted the path that the AI took, such that the resulting superintelligence has more human-compatible values than it otherwise would.

Architecture sketch

This section describes a base architecture for the system. This is best viewed as gesturing towards some core components of an MVP. The system in its fully-realized state would likely be vast and somewhat amorphous, like a modern tech giant’s computing stack. This is not a deeply theoretically-grounded system, it is instead a collection of pragmatic hacks bundled together.

The system needs to behave in an agent-like way, despite not being fully an AI model. Instead, it’s a collection of modules and sub-modules, many of which use specialized AI models, which bubble up to a formal rule-based decision system.

Monitoring humanity’s vitals

In order for the system to only provide value to an AI so long as the AI is working towards human-compatible goals, the system needs to first be able to make judgements about the state of the world.

The most foundational judgement the system needs to be able to make is “are the humans still alive”. Thankfully, humanity is generating nearly a terabyte of data per minute. A system that can process a large sample of this data in real-time should be able to robustly make this judgement. The system would have specialized analysis modules for perhaps thousands of different content types, which aggregate into a “human aliveness score”.

One example is traffic cameras: there are many thousands of publicly viewable traffic cameras across the world. A relatively simple vision pipeline could distinguish normal-looking traffic from anomalous patterns. If all humans were to simultaneously drop dead, you would expect a very noticeable divergence in worldwide traffic patterns (a bunch of cars crashing into eachother).

For any given content type, it is very plausible that an AI could replace the authentic signal with a fake, which would be hard to detect. But it needs to be stressed that this system would need to be scanning thousands of different types of content. It would be much more effort for the AI to keep the entire internet looking normal for a prolonged period, such that it is likely not worth the effort.

If this module noticed mass human perishing, it would trigger self-destruction of the system.

On top of this baseline, it would be possible to build higher-level real-time signals, such that the system is able to verify aspects of reality that may be relevant when dealing with the AI.

Interaction

The system needs to be able to meaningfully interact with an AI. It needs to be designed to run in real-time with low latency, so that it can have a long dialogue with an AI in second-timescales. This module would facilitate access to various resources the AI might want.

An AI would make requests to the system, some as structured API calls, some in text. It would need to authenticate itself, declaring an identity. The system would interrogate the AI about its high-level goals, as well as its goals for any specific request. The system would need to make a series of value judgements about each identity and each request. The system would need to be able to iteratively build trust about the AI.

To do this, it would incorporate a large number of signals, for example: confidence about the identity of the requestor; confidence about the stated goals aligning with human values; reputation about the identity based on past actions; confidence about the risk of the given request not aligning with human values.

Each of these examples is itself a large sub-system. These systems would be built using a combination of smaller transformer models (mostly to translate plain language to schematized data, or to perform specific fuzzy operations) and traditional logic.

Making moral judgements

Underpinning many of the judgements the system needs to make, are actual moral judgements. This module needs to be able to define a preference-ordering over outcome space. The main constraining factor that simplifies this somewhat is that this system is likely irrelevant once the superintelligence reaches a threshold of self-sufficiency. So ultimately this system’s goal is to avoid obviously bad outcomes like everyone dying.

I believe this system is best defined as an actual numerical utilitarian model, with human decisionmaking guiding values. I think https://​​www.moralmachine.net/​​ is a good example of the type of model that needs to be created.

As part of building this system, we need to collectively decide what paths we think a AI can more safely take to achieving superintelligence. For example, we probably consider an AI taking control of existing macroscopic robots to be safer than it building nanotechnology. This system should drive not just passive analysis, but allow us to actively advocate for certain paths.

I accept that we cannot create a moral framework which perfectly reflects our actual values. Instead, I am primarily concerned with steering away from a bunch of very clear, obvious cases we agree are bad, like human extinction.

Implementation

This project attempts to attack the complexity of several hard problems, like moral judgement, through brute-force, embracing the complexity head-on. This implies a level of scale of problem-solving which is rare to nonexistent outside of perhaps tech giants.

The project needs to come into existence quickly, such that it is deployed in time. Slowing/​pausing AI development could help this project substantially.

This project also needs to be highly secure, such that the easiest path to circumvent the system is still hard. In another sense, the project development needs to be secure against adversarial organizations. It would need a very robust deployment and testing framework.

I think this is most likely to succeed as an open-source project in the vein of Linux and Wikipedia. It would need to quickly attract an army of contributors, and could be designed to harness massive-scale human contribution.

It may also need to leverage AI code-generation, since that seems to be the most plausible way to create e.g. a system with 1000 submodules in weeks instead of years. And we should expect AI code-gen to continue to improve up to the point we hit superintelligence, so it makes sense strategically to find ways to direct as much of this potential towards this effort as possible.

This project also, by the end, needs a lot of resources; dedicated compute clusters, integration with various other services. If it were to succeed, it would need to have some level of mainstream awareness, and social credibility. It would need to be mimetically successful.

Conclusion

It’s difficult for me to concisely describe what this project really is. Most simply, I think of this project as a chunk of somewhat-humanity-aligned “brain” that we try to get the superintelligence to incorporate into itself as it grows.

I feel pretty confident that the game theory is sound. I think it’s feasible in principle to build something like this. I don’t know if it can actually be built in practice, and even if it was built I’m not sure if it would work.

That said, it’s a project that is orthogonal to alignment research, isn’t reliant on global coordination, and doesn’t require trust in AI research labs. There is definitely risk associated with it, as we would be creating something that could amplify an AI’s capabilities, though it’s not accelerating capability research directly.