Epistemic status: this is the result of me trying to better understand the idea of mesa optimizers. It’s speculative and full of gaps, but maybe it’s interesting and I’m not realistically going to have time to improve it much in the near future.
Humans are often presented as an example of “mesa optimisers”—organisms created to “maximise evolutionary fitness” that end up doing all sorts of other things including not maximising evolutionary fitness and transforming the world in the process. This analogy is usually accompanied by a disclaimer like this:
We do not expect our analogy to live up to intense scrutiny. We present it as nothing more than that: an evocative analogy (and, to some extent, an existence proof) that explains the key concepts.
I am proposing that if we focus on evolutionary “influence” instead of “fitness”, we can flip both claims on their head:
Humans are extremely evolutionarily influential
We should take the evolution analogy seriously
Evolution is about how things change
I think evolutionary fitness is to some extent not the interesting thing about evolution. If the world was always full of indistinguishable rabbits and always will be full of indistinguishable rabbits, then rabbits are in some sense “fit”, but evolution is also a boring theory because all it says is “rabbits”. The interesting content of evolution is in what it says about how things change: if the world is full of regular rabbits plus one extra-fit rabbit, then evolution says that in a few years, the world will have few regular rabbits and lots of extra-fit rabbits.
I want to propose a rough definition of evolutionary influence that generalises this idea. There are a few gaps in the definition which I hope can be successfully resolved, but I haven’t had the time to do this yet.
First, we need an environment. I currently think of an environment as “a universe at a particular point in time”. The universe is:
A set T of “points in time”
A set Ω of configurations that the universe can have at a particular time
An update rule f:Ω→Δ(Ω) that probabilistically maps the configuration at one point in time to the configuration at the next (Δ(Ω) means “the set of probability distributions on Ω”).
Given a time t∈T, the environment μt is a probability distribution on Ω.
An organism is a particular configuration of a “small piece” of a universe. We can specify a function u:Ω→{0,1} that evaluates whether the universe contains the organism, and u is somehow restricted to evaluating “small parts” of the universe (what I mean by a “small part” is currently a gap in the definition). We can condition an environment μt on the presence of an organism u to get the environment μt|u and similarly μt|∼u is the environment without the organism u.
A feature is a “large piece” of a universe. Like organisms, I’m not sure what I mean by “large piece”. In any case, there’s a function v:Ω→{0,1} that tells us whether a feature is present, and it must in some sense be “big and obvious”.
An organism u at t has a large evolutionary influence at t′>t if the probability of some large feature v is very different in the future environment with the organism μt|uft′−t than in the future environment without it μt|∼uft′−t
Intuition: If at time t the environment μt is full of grass and there are also a pair of rabbits u there, then in the future t′ the environment μt|uft′−t will be full of rabbits and have a lot less grass. On the other hand, if there are no rabbits at time t then the future environment μt|∼uft′−t will still be mostly grass.
Intuition 2: Perhaps if humans had not appeared when they did, highly intelligent life would have taken much longer to appear on Earth/never done so. Then the Earth without humans 300k years ago wouldn’t have any cities etc. today.
Relevance to AI
The fundamental question I’m asking here is: are AI research efforts likely to produce highly influential “organisms”. A second important question is whether this influence is aligned with the creator’s aims, but this seems to me to add a lot of complication.
My basic thinking here is that an AI system in training is embedded in two “universes”. Think of a large neural network in training. One “universe” in which it lives is the space of network weights, and the update rule is given by the training algorithm and the loss incurred on the data at each step of training. It’s not clear that it’s meaningful to talk about “influence” in this universe. Maybe there is some “small” feature of the initialisation that determines whether it converges to something useful or not, but that is speculative (and I don’t know what I mean by “small”).
It also exists in the real universe—i.e. it’s a configuration in the storage of some computer somewhere. Here there’s a more intuitive sense that we can talk about “influentialness”—if it produces useful outputs somehow, people will be excited by their new AI and publish papers about it, create products using it and so forth, whereas if it doesn’t then none of that will happen and it will be forgotten.
Given the way neural networks are trained, a trained neural network basically has to be something that performs reasonably well in the training universe. However, influence in the real universe trumps performance in the training universe—a real-universe-influential AI that isn’t actually good on the training set is still real-universe-influential.
Compatibility
Two postulates about AI and influence:
There is a training universe that, when run for long enough, produces a highly influential organism in our universe
An example of this would be a very high-performance reinforcement learner whose reward is based on some “large feature” of our universe
AI training that we actually do has a reasonable chance of creating such a training universe
For example, maybe it’s not too hard to repurpose existing techniques (or future developments of them) to create the reinforcement learner mentioned in 1
Speculatively, there is some sense in which we can talk about the compatibility of the training environment and the real universe in the sense that “high performance” in the training environment is correlated with influence in the real universe. For a hypothetically “optimal” reinforcement learner rewarded based on large features of the real universe, this compatibility is maximal. However, even more pragmatic AI training regimes might exhibit high compatibility.
Also, pragmatic AI training regimes might exhibit high compatibility without much transparency about what kind of real-world influence is compatible with the training. Recalling that real-world influence screens off performance on the training objective, good behaviour assuming optimality with respect to the training objective may not be enough to guarantee good behaviour. This seems to me to be one of the key insights of the idea of mesa-optimisation. On the other hand, it’s completely plausible to me that good behaviour assuming optimality could imply good behaviour for near-optimality too. It is also quite mysterious to me how to actually characterise “good behaviour”.
One thing we get from the “compatibility” framing that we don’t get from the “optimization” framing is that compatibility arises because people want AIs that can do useful stuff in the real world. This is true for technology in general, of course, but AI stands out as being a unique combination of
Search/”optimization pressure” (compared to designing a shovel, training an AI involves a lot more searching)
Training environment compatibility (compared to a shortest path search, training an AI involves a lot more signal from the real world)
Conclusion
I’ve sketched a few rough ideas here that might be useful for better understanding AI risk. If they are actually going to be useful, they really need more development. Some questions:
How should influence actually be defined?
Can “small parts” and “big parts” of the universe be defined in some way that leads to influence being non-trivial?
Should influence reduce to evolutionary fitness in under appropriate assumptions?
If we do have a definition, how should “compatibility between environments” be defined?
Can we actually derive any results along the lines of the speculative proposals above?
Is evolutionary influence the mesa objective that we’re interested in?
Epistemic status: this is the result of me trying to better understand the idea of mesa optimizers. It’s speculative and full of gaps, but maybe it’s interesting and I’m not realistically going to have time to improve it much in the near future.
Humans are often presented as an example of “mesa optimisers”—organisms created to “maximise evolutionary fitness” that end up doing all sorts of other things including not maximising evolutionary fitness and transforming the world in the process. This analogy is usually accompanied by a disclaimer like this:
I am proposing that if we focus on evolutionary “influence” instead of “fitness”, we can flip both claims on their head:
Humans are extremely evolutionarily influential
We should take the evolution analogy seriously
Evolution is about how things change
I think evolutionary fitness is to some extent not the interesting thing about evolution. If the world was always full of indistinguishable rabbits and always will be full of indistinguishable rabbits, then rabbits are in some sense “fit”, but evolution is also a boring theory because all it says is “rabbits”. The interesting content of evolution is in what it says about how things change: if the world is full of regular rabbits plus one extra-fit rabbit, then evolution says that in a few years, the world will have few regular rabbits and lots of extra-fit rabbits.
I want to propose a rough definition of evolutionary influence that generalises this idea. There are a few gaps in the definition which I hope can be successfully resolved, but I haven’t had the time to do this yet.
First, we need an environment. I currently think of an environment as “a universe at a particular point in time”. The universe is:
A set T of “points in time”
A set Ω of configurations that the universe can have at a particular time
An update rule f:Ω→Δ(Ω) that probabilistically maps the configuration at one point in time to the configuration at the next (Δ(Ω) means “the set of probability distributions on Ω”).
Given a time t∈T, the environment μt is a probability distribution on Ω.
An organism is a particular configuration of a “small piece” of a universe. We can specify a function u:Ω→{0,1} that evaluates whether the universe contains the organism, and u is somehow restricted to evaluating “small parts” of the universe (what I mean by a “small part” is currently a gap in the definition). We can condition an environment μt on the presence of an organism u to get the environment μt|u and similarly μt|∼u is the environment without the organism u.
A feature is a “large piece” of a universe. Like organisms, I’m not sure what I mean by “large piece”. In any case, there’s a function v:Ω→{0,1} that tells us whether a feature is present, and it must in some sense be “big and obvious”.
An organism u at t has a large evolutionary influence at t′>t if the probability of some large feature v is very different in the future environment with the organism μt|uft′−t than in the future environment without it μt|∼uft′−t
Intuition: If at time t the environment μt is full of grass and there are also a pair of rabbits u there, then in the future t′ the environment μt|uft′−t will be full of rabbits and have a lot less grass. On the other hand, if there are no rabbits at time t then the future environment μt|∼uft′−t will still be mostly grass.
Intuition 2: Perhaps if humans had not appeared when they did, highly intelligent life would have taken much longer to appear on Earth/never done so. Then the Earth without humans 300k years ago wouldn’t have any cities etc. today.
Relevance to AI
The fundamental question I’m asking here is: are AI research efforts likely to produce highly influential “organisms”. A second important question is whether this influence is aligned with the creator’s aims, but this seems to me to add a lot of complication.
My basic thinking here is that an AI system in training is embedded in two “universes”. Think of a large neural network in training. One “universe” in which it lives is the space of network weights, and the update rule is given by the training algorithm and the loss incurred on the data at each step of training. It’s not clear that it’s meaningful to talk about “influence” in this universe. Maybe there is some “small” feature of the initialisation that determines whether it converges to something useful or not, but that is speculative (and I don’t know what I mean by “small”).
It also exists in the real universe—i.e. it’s a configuration in the storage of some computer somewhere. Here there’s a more intuitive sense that we can talk about “influentialness”—if it produces useful outputs somehow, people will be excited by their new AI and publish papers about it, create products using it and so forth, whereas if it doesn’t then none of that will happen and it will be forgotten.
Given the way neural networks are trained, a trained neural network basically has to be something that performs reasonably well in the training universe. However, influence in the real universe trumps performance in the training universe—a real-universe-influential AI that isn’t actually good on the training set is still real-universe-influential.
Compatibility
Two postulates about AI and influence:
There is a training universe that, when run for long enough, produces a highly influential organism in our universe
An example of this would be a very high-performance reinforcement learner whose reward is based on some “large feature” of our universe
AI training that we actually do has a reasonable chance of creating such a training universe
For example, maybe it’s not too hard to repurpose existing techniques (or future developments of them) to create the reinforcement learner mentioned in 1
Speculatively, there is some sense in which we can talk about the compatibility of the training environment and the real universe in the sense that “high performance” in the training environment is correlated with influence in the real universe. For a hypothetically “optimal” reinforcement learner rewarded based on large features of the real universe, this compatibility is maximal. However, even more pragmatic AI training regimes might exhibit high compatibility.
Also, pragmatic AI training regimes might exhibit high compatibility without much transparency about what kind of real-world influence is compatible with the training. Recalling that real-world influence screens off performance on the training objective, good behaviour assuming optimality with respect to the training objective may not be enough to guarantee good behaviour. This seems to me to be one of the key insights of the idea of mesa-optimisation. On the other hand, it’s completely plausible to me that good behaviour assuming optimality could imply good behaviour for near-optimality too. It is also quite mysterious to me how to actually characterise “good behaviour”.
One thing we get from the “compatibility” framing that we don’t get from the “optimization” framing is that compatibility arises because people want AIs that can do useful stuff in the real world. This is true for technology in general, of course, but AI stands out as being a unique combination of
Search/”optimization pressure” (compared to designing a shovel, training an AI involves a lot more searching)
Training environment compatibility (compared to a shortest path search, training an AI involves a lot more signal from the real world)
Conclusion
I’ve sketched a few rough ideas here that might be useful for better understanding AI risk. If they are actually going to be useful, they really need more development. Some questions:
How should influence actually be defined?
Can “small parts” and “big parts” of the universe be defined in some way that leads to influence being non-trivial?
Should influence reduce to evolutionary fitness in under appropriate assumptions?
If we do have a definition, how should “compatibility between environments” be defined?
Can we actually derive any results along the lines of the speculative proposals above?