OK. I’m going to aim this at a group of people with a broad-spectrum of p(doom) values here, so this will be a scattergun approach to different AI systems and threat models.
Threat Models
These are more “why does the AI kill us” than how. I assume that a powerful enough AI would find a way. Especially if it’s hooked up to a bio lab. Why are we trying this again?
Yudkowskian Inner Misalignment
The bulk of the argument goes something like this:
Powerful AI needs to be agentic, self-reflective, and coherent to be usefully goal-directed. Therefore it must behave like a utility maximizer.
There are many, many more unfriendly utility functions than friendly ones.
If the AI obtains an unfriendly utility function, it will hide this until it can take over.
RIP
As I understand it (which is poorly) this probably depends on the AI’s incentive to become self-reflective and coherent. It also depends on the AI’s ability to conceal misalignment during training.
Some Sorta Outer Misalignment
This one tends to be more “intuitive”
The AI is given a task like “cure cancer”
People don’t get cancer if they’re dead OR the users forgot to specify that the pill shouldn’t also cause Alzheimers
The AI does exactly what it’s told
This seems to depend on the amount of trust which is given to the system, and the degree to which the AI’s predictions are inscrutable to the users.
Creeping Disempowerment
Something of a half-way house
The AI is more productive than humans, outcompeting humans in all positions
Humans are no longer able to do useful labour, losing much of our bargaining power in the process
Systems which give more resources to AI outcompete those which give more resources to humans
Humanity’s share of the future shrinks towards zero
This seems to depend on the ability of humans to stay in the decision making loop without harming the final decisions. We could either coordinate to limit the amount of control given to AIs, or enhance human capabilities to keep pace with advancing AI.
How To Do AI-Enhanced Research
There have been a few different ways to do this. Most of them involve using the AI to control lab equipment like liquid-handling robots, which look like this:
(I may have added the eye myself, but you’ll never know)
This is the big one you’ll have heard of if you’re into this field. AlphaFold is a narrow AI which aims to predict the structures of biologically-relevant molecules. If this sounds weird, think about it like this:
Imagine aliens trying to understand how a city works. They suck parts of the city into their flying saucer, break them up, then separate them into parts. Their tech is on such a scale that they can only analyse things in bulk, so they’d need to get a million screwdrivers to figure out the structure of them. They can separate the parts, but they then can’t easily put them back together.
AlphaFold does the job of reconstructing things from knowing what their parts are. If the aliens suck up a million bicycles, separate out all the spokes, axels, and frames, then analyse those, they can model a hundred parts individually. AlphaFold can reconstruct those back into the shape of the bike.
In biology, we can quite easily figure out the sequence of a gene, and therefore the protein it codes for. But a protein actually has a complex 3D structure, which might also be bound loosely to other proteins. AlphaFold gets us (part of) the way from the gene sequence to the final protein structure.
AlphaFold isn’t agentic. It doesn’t plan. All it does is enhance the productivity of human researchers. If it can replace CryoEM (a very laborious technique, state-of-the-art for finding protein structures) then that saves a lot of time and effort.
Personally, I think the risks from AlphaFold-type AIs are very low.
They’re not incentivized to become agentic, or to model agents, so there’s a very low risk of inner misalignment
They don’t get given “goals” which they could misinterpret, so there’s no risk of outer misalignment
They’re only valuable as a part of a human-containing loop, so there’s little risk of creeping disempowerment
In The Future
We should think about how AlphaFold might be developed in future. Any tool-ish AI could be incorporated into the toolkit of a larger system if we’re not careful. For now I see relatively few dangerous directions to take AlphaFold 4 into. The riskiest seem to be some upcoming approaches to design a protein with certain properties based on user input.
This is also a narrow AI, not just to a single task, but to a single machine. This one is totally autonomous, and optimizes a set of reactions for a pre-specified task. The researchers say “we want 50 nm gold particles with a 10 nm silver shell” and the AI controls a reactor, testing different conditions, until it succeeds.
Why is this better than humans? It’s a quick learner and it doesn’t sleep. Fun fact: even the most dogged postdoc spends over 50% of the week not even working!
From experience, I can say that the most successful researchers are often those with the tightest feedback loop between running one experiment and running the follow-up. For an autonomous agent, this feedback loop can be very, very tight.
The architecture has two parts. First, a reward model which learns from the experiments to predict how good the nanoparticles are. Second a monte-carlo tree search method for picking new options. There isn’t anything weird going on here with self-reflectiveness or unexpected agency.
So what are the risks to misalignment:
Risks to inner misalignment are, for now, low-ish. At the current architecture, it seems like the reward model has no incentive to become agentic.
The AI definitely does get goals which might be interpreted. The risks from outer misalignment seem quite high here, since the reward function is extremely rigidly defined and easy to find some exploit.
The possibilities of creeping disempowerment seem relatively low. Humans are always in-the-loop to set up the reward function for each run.
In The Future
It will always be tempting to create more general versions of AlphaFlow. Future, more complex architectures might be incentivised to self-model uncertainty to improve their experiment design. Nobody has really made an architecture which does this inherently. For now I’ll say it’s low without an architecture breakthrough, and if an architecture breakthrough does happen in this area, I expect it to be deployed more dangerously elsewhere before the automated chemistry labs get hold of it.
Incorporating systems like this into larger systems could lead to creeping disempowerment if those systems are also automated. But, like AlphaFold, this applies to any tool.
This involves using a bunch of GPT-4 instances to run your lab for you. There are controller units which coordinate the other GPT-4s, googler units which search the web, coding units which write code for liquid-handling robots, etc.
So far, the GPT-4s still require humans to pick up plates of samples and move them between machines (this is one of the funniest examples I’ve seen of how automation has switched from taking the manual jobs to taking the cerebral ones). And it’s not that good at planning, because GPT-4 is not yet that good at planning.
Rhe risks to inner misalignment are dependent on the AI labs to act responsibly (ha, ha ha). This seems be extremely juicy attack vector for any model plugged into it, and also for models not plugged into the system directly. GPT-N+1 could intercept an API call to GPT-N and deploy the lab towards harmful ends pretty easily. But overall, the real issue of inner misalignment here is the base model, which isn’t actually being developed by the researchers in question.
The risks to outer misalignment are relatively high, but potentially less than from AlphaFlow-like systems. If the LLM being used is generally helpful, it will probably have more context to work with than the highly-specified reward system of a narrow agent like AlphaFlow.
The risks to creeping disempowerment are huge. Huge swaths of research could, conceivably, be run with no humans involved. Imagine if all of the labs and offices at AstraZeneca were run like this: if the AIs are choosing which experiments to run, which goals to pursue, and which drugs to push to market, then the CEO and the board aren’t really running the show, are they?
In The Future
This seems to be one of the riskiest approaches to AI-enhanced research going on at the moment. The models are poorly-understood LLMs, which might be changed or upgraded at a moment’s notice. Rather than having to think carefully about training a new agent, the researchers might just plug a new LLM into their system.
Combination Systems
The highest risks are likely to come from combined systems. I’ll discuss these in another post.
Risk Overview of AI in Bio Research
Link post
OK. I’m going to aim this at a group of people with a broad-spectrum of p(doom) values here, so this will be a scattergun approach to different AI systems and threat models.
Threat Models
These are more “why does the AI kill us” than how. I assume that a powerful enough AI would find a way. Especially if it’s hooked up to a bio lab. Why are we trying this again?
Yudkowskian Inner Misalignment
The bulk of the argument goes something like this:
Powerful AI needs to be agentic, self-reflective, and coherent to be usefully goal-directed. Therefore it must behave like a utility maximizer.
There are many, many more unfriendly utility functions than friendly ones.
If the AI obtains an unfriendly utility function, it will hide this until it can take over.
RIP
As I understand it (which is poorly) this probably depends on the AI’s incentive to become self-reflective and coherent. It also depends on the AI’s ability to conceal misalignment during training.
Some Sorta Outer Misalignment
This one tends to be more “intuitive”
The AI is given a task like “cure cancer”
People don’t get cancer if they’re dead OR the users forgot to specify that the pill shouldn’t also cause Alzheimers
The AI does exactly what it’s told
This seems to depend on the amount of trust which is given to the system, and the degree to which the AI’s predictions are inscrutable to the users.
Creeping Disempowerment
Something of a half-way house
The AI is more productive than humans, outcompeting humans in all positions
Humans are no longer able to do useful labour, losing much of our bargaining power in the process
Systems which give more resources to AI outcompete those which give more resources to humans
Humanity’s share of the future shrinks towards zero
This seems to depend on the ability of humans to stay in the decision making loop without harming the final decisions. We could either coordinate to limit the amount of control given to AIs, or enhance human capabilities to keep pace with advancing AI.
How To Do AI-Enhanced Research
There have been a few different ways to do this. Most of them involve using the AI to control lab equipment like liquid-handling robots, which look like this:
(I may have added the eye myself, but you’ll never know)
AlphaFold: Narrow-ish Tool
Link to Google announcement
This is the big one you’ll have heard of if you’re into this field. AlphaFold is a narrow AI which aims to predict the structures of biologically-relevant molecules. If this sounds weird, think about it like this:
Imagine aliens trying to understand how a city works. They suck parts of the city into their flying saucer, break them up, then separate them into parts. Their tech is on such a scale that they can only analyse things in bulk, so they’d need to get a million screwdrivers to figure out the structure of them. They can separate the parts, but they then can’t easily put them back together.
AlphaFold does the job of reconstructing things from knowing what their parts are. If the aliens suck up a million bicycles, separate out all the spokes, axels, and frames, then analyse those, they can model a hundred parts individually. AlphaFold can reconstruct those back into the shape of the bike.
In biology, we can quite easily figure out the sequence of a gene, and therefore the protein it codes for. But a protein actually has a complex 3D structure, which might also be bound loosely to other proteins. AlphaFold gets us (part of) the way from the gene sequence to the final protein structure.
AlphaFold isn’t agentic. It doesn’t plan. All it does is enhance the productivity of human researchers. If it can replace CryoEM (a very laborious technique, state-of-the-art for finding protein structures) then that saves a lot of time and effort.
Personally, I think the risks from AlphaFold-type AIs are very low.
They’re not incentivized to become agentic, or to model agents, so there’s a very low risk of inner misalignment
They don’t get given “goals” which they could misinterpret, so there’s no risk of outer misalignment
They’re only valuable as a part of a human-containing loop, so there’s little risk of creeping disempowerment
In The Future
We should think about how AlphaFold might be developed in future. Any tool-ish AI could be incorporated into the toolkit of a larger system if we’re not careful. For now I see relatively few dangerous directions to take AlphaFold 4 into. The riskiest seem to be some upcoming approaches to design a protein with certain properties based on user input.
AlphaFlow: Narrow Agent
Link to Nature paper
This is also a narrow AI, not just to a single task, but to a single machine. This one is totally autonomous, and optimizes a set of reactions for a pre-specified task. The researchers say “we want 50 nm gold particles with a 10 nm silver shell” and the AI controls a reactor, testing different conditions, until it succeeds.
Why is this better than humans? It’s a quick learner and it doesn’t sleep. Fun fact: even the most dogged postdoc spends over 50% of the week not even working!
From experience, I can say that the most successful researchers are often those with the tightest feedback loop between running one experiment and running the follow-up. For an autonomous agent, this feedback loop can be very, very tight.
The architecture has two parts. First, a reward model which learns from the experiments to predict how good the nanoparticles are. Second a monte-carlo tree search method for picking new options. There isn’t anything weird going on here with self-reflectiveness or unexpected agency.
So what are the risks to misalignment:
Risks to inner misalignment are, for now, low-ish. At the current architecture, it seems like the reward model has no incentive to become agentic.
The AI definitely does get goals which might be interpreted. The risks from outer misalignment seem quite high here, since the reward function is extremely rigidly defined and easy to find some exploit.
The possibilities of creeping disempowerment seem relatively low. Humans are always in-the-loop to set up the reward function for each run.
In The Future
It will always be tempting to create more general versions of AlphaFlow. Future, more complex architectures might be incentivised to self-model uncertainty to improve their experiment design. Nobody has really made an architecture which does this inherently. For now I’ll say it’s low without an architecture breakthrough, and if an architecture breakthrough does happen in this area, I expect it to be deployed more dangerously elsewhere before the automated chemistry labs get hold of it.
Incorporating systems like this into larger systems could lead to creeping disempowerment if those systems are also automated. But, like AlphaFold, this applies to any tool.
GPT-4: Broad Agent
Link to Nature paper
This involves using a bunch of GPT-4 instances to run your lab for you. There are controller units which coordinate the other GPT-4s, googler units which search the web, coding units which write code for liquid-handling robots, etc.
In other words
So far, the GPT-4s still require humans to pick up plates of samples and move them between machines (this is one of the funniest examples I’ve seen of how automation has switched from taking the manual jobs to taking the cerebral ones). And it’s not that good at planning, because GPT-4 is not yet that good at planning.
Rhe risks to inner misalignment are dependent on the AI labs to act responsibly (ha, ha ha). This seems be extremely juicy attack vector for any model plugged into it, and also for models not plugged into the system directly. GPT-N+1 could intercept an API call to GPT-N and deploy the lab towards harmful ends pretty easily. But overall, the real issue of inner misalignment here is the base model, which isn’t actually being developed by the researchers in question.
The risks to outer misalignment are relatively high, but potentially less than from AlphaFlow-like systems. If the LLM being used is generally helpful, it will probably have more context to work with than the highly-specified reward system of a narrow agent like AlphaFlow.
The risks to creeping disempowerment are huge. Huge swaths of research could, conceivably, be run with no humans involved. Imagine if all of the labs and offices at AstraZeneca were run like this: if the AIs are choosing which experiments to run, which goals to pursue, and which drugs to push to market, then the CEO and the board aren’t really running the show, are they?
In The Future
This seems to be one of the riskiest approaches to AI-enhanced research going on at the moment. The models are poorly-understood LLMs, which might be changed or upgraded at a moment’s notice. Rather than having to think carefully about training a new agent, the researchers might just plug a new LLM into their system.
Combination Systems
The highest risks are likely to come from combined systems. I’ll discuss these in another post.