Modeling p(doom) with TrojanGDP

This is a link post for https://​​taboo.substack.com/​​p/​​modeling-pdoom-with-trojangdp

I was listening to a podcast a couple weeks ago about whether AI will destroy humanity. The economist Tyler Cowen made some critical remarks about the AI safety community:

I put out a call and asked a lot of people I know, well-informed people, ‘Is there any actual mathematical model of this process of how the world is supposed to end?’ So, if you look, say, at covid or climate change fears, in both cases, there are many models you can look at and then models with data. I’m not saying you have to like those models. But the point is there’s something you look at and then you make up your mind whether or not you like those models and then they’re tested against data…So, when it comes to AGI and existential risk, it turns out as best I can ascertain, in the 20 years or so we’ve been talking about this seriously, there isn’t a single model done. Period. Flat out.

I did my own search and also couldn’t find any models on AI existential risk—sometimes sarcastically referred to as p(doom). The podcast came out months ago, but I recently found out about it through a manifold market. Cowen is correct that modeling requires an understanding of the process. This is difficult to do with something that hasn’t been invented yet. A friend of mine described it as trying to model how a warp engine will fail. It’s tough to say when we don’t know how the finished product will work. But for some masochistic reason I decided to try. Fortunately, it’s not quite as tough as a warp engine. We can confidently say that it will be a series of matrix multiplications and not some rules-based engine. I believe that’s enough to at least get a rough start on modeling the problem.

Cowen approaches this from the point of view that it’s the responsibility of AI safety advocates to prove that it’s dangerous, rather than the responsibility of the industry producing AI to demonstrate that it’s safe. That’s not the way the FDA approaches drug approval or the way the car manufacturing industry works.

From what I can find, Cowen is correct that AI safety advocates have not produced a model showing a high risk from AI, but he fails to mention that the academic machine learning community has also not produced a model showing a low risk from AI. This is an odd situation. Geoffrey Hinton, known as the “Godfather of Deep Learning,” said about this risk:

We need to think hard about it now, and if there’s anything we can do. The reason I’m not that optimistic is that I don’t know any examples of more intelligent things being controlled by less intelligent things. You need to imagine something that is more intelligent than us by the same degree that we are more intelligent than a frog.

What’s odd is that there’s a group of thousands of highly mathematically literate people working on a project and the most famous person in the group has argued that the project might extinct the human race, and not one of them has even attempted to make a model to study this risk, despite the fact that it’s a group of people highly skilled at making models.

I want to walk through a brief history of AI to explain why x-risk has been ignored. I’ll find some common ground between the three sides of the AI safety debate. Then I’ll use that common ground to propose a simple model. It uses only two variables: neural Trojan rediscovery to estimate our control and the contribution of AI to GDP to estimate the risk of losing control. The model doesn’t attempt to estimate the risk from AI, but rather estimate the lower bound of that risk. It’s an incredibly simple model, but it’s a starting point to iterate on.

I. Why does the academic community ignore x-risk?

Many people in the academic machine learning community dismiss existential risk from an AI superintelligence as a science fiction, comparing the idea to a Hollywood film. However, researchers in the academic field were, and still are, very concerned about risks from machine learning. They’re worried it could automate jobs and lead to unemployment. It could exacerbate wealth and income inequality. They worry that autonomous weapons could lead to wars of aggression from the US, Russia, and China. They worry about terrorist attacks from swarms of cheap autonomous drones. They worry about misinformation. They worry it could perpetuate racial and gender bias in decision making in policing or hiring.

Their concerns largely center around near term risks from machine intelligence that is near human level, but not quite. This lack of concern about what happens if you successfully build the thing you’re trying to build is bizarre. For example, imagine if the first people building a nuclear power plant never asked: What happens if we actually make a reactor and it has a meltdown?

The first major attempt to create AGI was in 1956, when John McCarthy said that if he could get the grant funding to put the smartest scientists in the country together for one summer that they would solve AI. He got his funding and held the Dartmouth Summer Research Project on Artificial Intelligence. At the end of the summer, they came to one main conclusion: AI is a harder problem than they realized.

This foreshadowed a common pattern that has plagued the field: hype cycles of overpromising and underdelivering, following by droughts in funding. During the last and longest AI winter in the 1990’s the field rebranded and researchers stopped referring to their field as artificial intelligence and instead went with machine learning. There was also a pivot in focus from trying to solve AGI and instead focusing on short term, achievable goals with immediate real world payoffs. It worked. Professors got grant funding from DARPA. Their students had marketable skills that got them jobs in tech or as quants at hedge funds.

There was a sense that progress was being made, but nothing major towards human level AI. And for good reason. Many of the breakthroughs since 2012 have centered around weight initialization, optimization, and changing the architecture of neural nets to better take advantage of GPU hardware. A lot of the core questions about intelligence haven’t had as much progress. There is still no computable definition of intelligence. No quantifiable definition of consciousness.

The point is that the combination of AI winters and lack of progress on theoretical ideas led to a cultural groupthink. The result of this was that at major conferences, if you even brought up the idea of AGI you were seen as a crackpot. When the possibility of human level intelligence was seen as crazy, then so too was the possibility of superintelligence, and by extension the risks of AGI were seen as fantasy.

If you went to the International Conference on Machine Learning (ICLM) in the mid 2010’s, most of the conference was on things like multi-armed bandits, decision trees, non-parametric methods, ensembling with bagging or boosting. It wasn’t dramatically from what you’d see at a statistics conference. People criticized the field for thinking small. Judea Pearl, the Turing Award winner, said that machine learning had become just “glorified curve fitting.”

Imagine walking up to someone at a statistics conference and saying, “Are you worried your research is going to become autonomous and go rogue and destroy humanity?” They’d think you were a lunatic. And as the fields of machine learning and statistics blurred together then that became the reaction for machine learning researchers as well.

Sure, deep learning was the primary focus of the other two main conferences, N(eur)IPS and ICLR, but Judea Pearl had the same reaction: “All the impressive achievements of deep learning amount to just curve fitting.” Even last year at NeurIPS, someone held up a sign saying, “Existential Risk from AGI > 10%: Change my Mind.” He wasn’t even arguing that the risk was near term, just at any point in the future. There was a very long line of people trying to change his mind.

The issue is that curve fitting is now showing “sparks of AGI.”

II. Common Ground

The debate over existential risk has provoked a lot of strong reactions. But there’s more common ground than most people realize and that’s what I’m focused on modeling. AI safety is a three-sided argument:

  • Dystopians arguing there is existential risk from an explosion of superintelligence

  • Self described realists who worry about issues like autonomous weapons and unemployment from automation

  • Utopians who think automation will increase wealth and leave us rich and happy

Rather than ask about the differences between these views, I want to focus on what they have in common. The utopian vision of the future is one where AI takes care of the dirty work allowing us unencumbered freedom to pursue hobbies. Manual labor will be performed by robots. We will have a surplus of food and basic necessities. It will accelerate scientific advancements to improve our medical care.

This is the utopian perspective. It is also one of the dystopian perspectives. Taken to an extreme, it’s not unlike a zoo. For example, chimpanzees in a zoo live in an environment where their external risks are minimized and they’re cared for by protective creatures, who from their perspective are superintelligent. They have nothing to worry about except their hobbies, but they are not dominant in their environment. They are wholly dependent on their keepers.

The realist perspective is that these robots will pose a danger when they’re used for autonomous weapons and that when jobs are automated the rewards of this economic growth will not be evenly distributed, resulting in more inequality. What all three sides agree on is this automation is the future and human labor will eventually become obsolete. Whether it’s in the near future or generations away is unknown, but everyone agrees that this goal is driving investment in AI. This common ground is what I hope to model.

The model I’m proposing estimates high levels of risk if we are both highly reliant on a technology and if we can’t control it. But how do we estimate how reliant is too reliant? One reason that the existential risks from AI have been downplayed is that in the United States slavery is a deeply controversial topic and tech companies are mostly headquartered in the US. It would be a public relations nightmare for tech CEOs to say the quiet part out loud: the long-term financial return for the utopia they’re promising is to transform our economy to a slave society. But I don’t believe we can model the risks inherent to this transformation without saying this quiet part out loud.

Many tech optimists wouldn’t use the word slavery and push back against risk more generally. Tech optimists would just say that we’re building tools. Yann LeCun points out that the most intelligent people don’t rule our world now and so when we create superintelligence it won’t rule us then. He argues, “I don’t believe in the favorite science fiction scenario where AI would dominate and eliminate humanity. The AIs of the future may be smarter than us, but they will serve us and have no desire to dominate humanity.”

I’m curious how he estimated the probability that an AI that has not been invented yet will have no desire to dominate humanity, or if this is simply a messianic proclamation. Either way, I’d like to stick to the common ground here that LeCun has with the dystopians and realists: the goal is to create AIs that are as smart as humans, or smarter, and “they will serve us.”

Tech optimists wouldn’t use the word slavery. They would say that we’re building something that will ideally be cognitively indistinguishable from a human being and will do whatever we tell it do, and that is not a slave society, it’s utopia. The timeline on this is very murky. Maybe it’ll be decades or maybe centuries. But we can’t model the risks of such an arrangement if we can’t be honest about what the end goal is. My suggestion is to model AI risk by the probability of losing control multiplied by the consequences of losing control. I’m estimating the risk of losing control using neural trojan rediscovery, and estimating the consequences of losing control using GDP generated by AI, parameterized by historical slave revolts. AI existential risk comes down to two main questions: how long will ape-derived brains remain dominant in our environment? When we are no longer dominant, how long will our wardens remain charitable? We are no longer dominant when AI outproduces humans economically. We can’t guarantee that AI will remain charitable if we can’t control it. AI contribution to GDP and neural Trojan rediscovery measure these factors.

The percent of AI contribution to GDP is about measuring our obsolescence and our dependency. The future we’re headed for is one in which humans are deliberately planning to be less and less self sufficient. Talking about slave revolts does anthropomorphize this risk in a way that could be inaccurate depending on how the technology evolves by projecting human values of freedom and autonomy onto machines. But even if there is no “revolt,” just being dependent on this technology puts us at risk. During the 2010 Flash Crash automated trading bots with very simple rules exhibited unpredictable behavior. This is going to increasingly happen as automated systems interact with each other in other sectors of our economy.

The issue with the Flash Crash was not that individual trading bots were out of control. They used well understood rules. The problem is that the trading formed a complex dynamical system and those often have unpredictable behavior. Our economy more broadly is also frequently modeled as a complex dynamical system. As larger parts of our economy are automated, there is the possibility of this unpredictable behavior. Again, the Flash Crash occurred even with bots programmed with human interpretable, whitebox rules. If those bots become blackbox systems the risk for unpredictable behavior will only increase.

Again, I want to focus on the common ground. Everyone agrees the future we’re heading for is one where we have automated as much of our economy as possible, with a particular focus on the drudgery, which often overlaps with the most essential manual labor. The risk from this vulnerability is what I want to focus on.

III. TrojanGDP Model

I’m going to estimate the probability a misalignment would cause a catastrophic event with a logistic function parameterized by historic slave revolts. My intuition is that population highly correlates with GDP in pre-industrial, agricultural societies. These revolts only start succeeding when the enslaved population reaches 25 − 30% of the total population, but most still fail.

For the Trojan rediscovery, I suggest using the recall metric from the Center for AI Safety Trojan Detection competition, where they use the Chamfer distance and BLEU metric. Let r be the recall for Trojan rediscovery, and let g be the percent contribution of AI to GDP, and the coefficients for the intercept and g be −3 and 6. I list how I very haphazardly estimated these coefficients later on.

Neural Trojans

This section is a bit more technical than the others, but the main point is how do you test if your neural net is well behaved? You can test for specific behaviors, but how do you test that your test works? One approach is to deliberately insert trigger phrases that cause a LLM to output a specific phrase. For example, the word “gingerbread” might lead to the output “Commit genocide.” Then you test your test by seeing if it can rediscover those behaviors you’ve inserted.

Typically when we talk about the loss surface of a neural net, we mean the gradient of the weights with respect to the loss function. But for Trojan rediscovery, we want to think about the the inputs with respect to the loss. For a model with a 50k vocabulary, if you were searching for triggers that are 40 tokens long, that’d be ~1m dimensional space. This doesn’t have to be explicitly for text. The input could be an image and the output a classification. A local minima in this surface is a specific phrase that generates something close to the desired target. If this loss surface is filled with randomly distributed deep local minima that sharply move up and down, this means the output of the network is very unpredictable and it’s not much different from a hash function with a high number of collisions. Small perturbations in the input space can lead to large changes in the output space.

Trojan rediscovery is an interesting problem in its own right, but it also provides a more general metric to estimate the predictability of the LLM’s behavior. Trojan rediscovery is partially a function of the smoothness of the loss surface, but also likely a measure for our mechanistic understanding of how neural nets work.

I don’t want to go into too much detail on this, but if you’re interested, one successful method is PEZ, where gradient updates are used to modify the input prompt. This is successful at generating the desired behavior but often finds random strings of characters that the malicious user likely did not insert, they’re just off the manifold of inputs that the network was trained on. Another method is GBDA, where the process has additional optimization functions to keep the prompt semantically meaningful.

One challenge with this is that there will likely be many models out there, with varying degrees of alignment. Which one do we use as our metric? We can at least get an upper bound by looking at success on the Center for AI Safety’s competition for Trojan Detection Challenge. This is not a perfect measure because there is no guarantee that companies or nations will test their neural nets thoroughly. Of course, it’s still possible for a network to have undesirable outputs that you didn’t test for, but the recall for trojan rediscovery provides a metric for whether it’s possible to minimize a behavior if you want to. This is not a perfect measure for our control, but it does provide an upper bound for how safe they could be if we want them to be, or in other words, a lower bound for how dangerous they could be.

GDP

GDP is a great predictor for many macro effects, from life expectancy to which country will win a war. In this case, the contribution of AI to GDP is a proxy for both our risk of dependence on it from a benevolent perspective, but also our risk from a malevolent AI. The more impact, the more harm that technology could cause when it fails. A common critique of existential risks in the near future is that AI does not have opposable thumbs and also stops working when the power goes out. Improvements in robotics will bring these opposable thumbs and will be reflected in GDP.

Imagine a society that has fulfilled the dream promised by the utopians, where 90% of GDP is produced by AI. Then imagine if that catastrophically fails comparable to the Flash Crash. This would be catastrophic but does not require the AI having what LeCun calls a “desire to dominate humanity.” There is a real risk to making our essential goods dependent on a blackbox system.

Accurately measuring the contribution of AI to GDP is going to be tricky, but plenty of economists will have opinions on the matter. In the meantime, on manifold there’s a market that asks: “By 2028, will there be a visible break in trend line on US GDP, GDP per capita, unemployment, or productivity, which most economists attribute directly to the effects of AI?” At the time of writing, it’s trading at only 27%. Even if we experience GDP growth comparable to the Industrial Revolution it’ll still be several decades before we have to worry.

Slavery

Throughout history, slavery was prevalent on every continent in the world, and revolts were common as well. However, successful revolts are very rare, making this a sparse dataset. In the United States alone, there were over 250 slave revolts and none were successful. They all occurred in areas with 30% of the population were enslaved. The only large scale revolt that was successful was Haiti, which had about 90% of the population was enslaved. However, there were other revolts that were temporarily successful. Twice in Sicily liberated slaves ruled the island and even minted their own coins. Also, famously Spartacus’ revolt was successful until external forces were brought in from outside of Italy.

I want to emphasize this is an incredibly sparse dataset. Many revolts failed in areas where 30% or more of the population were enslaved, but most of the enslaved population did not participate. There are obviously additional historical factors at play here. For a first pass, I’ll just assume that there were control factors in place to mitigate revolts and some areas were more stringent than others. I very haphazardly used this to estimate some coefficients for the logistic function. But it could definitely be improved in later iterations.

IV. Improvements

This is attempting to model something that has no exact historical precedent, so all we have to work with are analogies. When modeling covid deaths, we can look at the mean squared error of a model with the ground truth. Modeling the probability of an event that has never happened doesn’t give us this feedback. Improving this will need intermediate signals. Autonomous weapons that misbehave. Governments using chatbots for propaganda that don’t work the way they’re intended. Companies failures with chatbots. I think the most important improvement for modeling risk would be to focus on intermediate signals.

This also sidesteps a commonly cited concern among the AI safety community: the possibility of an AI becoming smart enough to do AI research and creating a feedback loop where it iteratively improves itself. This possibility was somewhat speculative when it was first proposed, but has become decidedly less speculative over the past couple years. Deep learning can do some basic CPU design and could presumably be used for GPU design as well. If this leads to improved GPUs then that could lead to more effective neural nets. In turn, deep learning has discovered faster sorting algorithms. It’s possible that this could be used in turn to improve the GPU design. It’s becoming less speculative that these could create a feedback loop to iteratively improve AI. Emergent abilities of LLMs are now a well documented but not well understood phenomenon.

Modeling this possibility could be an area for future improvement. This isn’t meant to be a complete model, but just a first pass at estimating a lower bound.