If running a single copy of a given AI system (let’s call it SketchyBot) for 1 month has a 5% chance of destroying the world …
Even given entirely aleatoric risk, it’s not clear to me that the compounding effect is necessary.
Suppose my model for AI risk is a very naive one—when the AI is first turned on, its values are either completely aligned (95% chance) or unaligned (5% chance). Under this model, one month after turning on the AI, I’ll have a 5% chance of being dead, and a 95% chance of being an immortal demigod. Another month, year, or decade, and there’s still a 5% chance after a decade that I’m dead, and a 95% chance I’m an immortal demigod. Running other copies of the same AI in parallel doesn’t change that either.
More generally, it seems that any model of AI risk where self.goingToDestroyTheWorld() is evaluated exactly once isn’t subject to those sorts of multiplicative risks. In other words, 1 - .95**60 == we’re all dead only works under fairly specific conditions, no epistemic arguments required
In fact, the epistemic uncertainty can actually increase the total risk if my base is the evaluated once model. Adding other worlds where the AI decides if it wants to destroy the world each morning, or is fundamentally incompatible with humans no matter what we try, just moves that integral over all possible models towards doom.
To make this model and little richer and share something of how I think of it, I tend to think of the risk of any particular powerful AI the way I think of risk in deploying software.
I work in site reliability/operations, and so we tend to deal with things we model as having aleatory uncertainty like holding constant a risk that any particular system will fail unexpected for some reason (hardware failure, cosmic rays, unexpected code execution path, etc.), but I also know that most of the risk comes right at the beginning when I first turn something on (turn on new hardware, deploy new code, etc.). A very simple model of this is something like f(x)=e−x+c where most of the risk of failure happens right at the start and beyond that there’s little to no risk of failure, so running for months doesn’t represent a 95% risk; almost all of the 5% risk is eaten up right at the start because the probability distribution function is shaped such that all the mass is under the curve at the beginning.
Agree, good point. I’d say it’s aleatoric risk is necessary to produce compounding, but not sufficient, but maybe I’m still looking at this the wrong way.
The mathematical property that you’re looking for is independence. In particular, your computation of 1 - .95**60 would be valid if the probability of failure in one month is independent of the probability of failure in any other month.
I don’t think aleatoric risk is necessary. Consider an ML system that was magically trained to maximize CEV (or whatever you think would make it aligned), but it is still vulnerable to adversarial examples. Suppose that adversarial example questions form 1% of the space of possible questions that I could ask. (This is far too high, but whatever.) It’s likely roughly true that two different questions that I ask have independent probabilities of being adversarial examples, since I have no clue what the space of adversarial examples looks like. So the probability of failure compounds in the number of questions I ask.
Personally, I still put a lot of weight on models where the kind of advanced AI systems we’re likely to build are not dangerous by default, but carry some ~constant risk of becoming dangerous for every second they are turned on (e.g. by breaking out of a box, having critical insights about the world, instantiating inner optimizers, etc.).
In this case I think you should estimate the probability of the AI system ever becoming dangerous (bearing in mind how long it will be operating), not the probability per second. I expect much better intuitions for the former.
Even given entirely aleatoric risk, it’s not clear to me that the compounding effect is necessary.
Suppose my model for AI risk is a very naive one—when the AI is first turned on, its values are either completely aligned (95% chance) or unaligned (5% chance). Under this model, one month after turning on the AI, I’ll have a 5% chance of being dead, and a 95% chance of being an immortal demigod. Another month, year, or decade, and there’s still a 5% chance after a decade that I’m dead, and a 95% chance I’m an immortal demigod. Running other copies of the same AI in parallel doesn’t change that either.
More generally, it seems that any model of AI risk where self.goingToDestroyTheWorld() is evaluated exactly once isn’t subject to those sorts of multiplicative risks. In other words, 1 - .95**60 == we’re all dead only works under fairly specific conditions, no epistemic arguments required
In fact, the epistemic uncertainty can actually increase the total risk if my base is the evaluated once model. Adding other worlds where the AI decides if it wants to destroy the world each morning, or is fundamentally incompatible with humans no matter what we try, just moves that integral over all possible models towards doom.
To make this model and little richer and share something of how I think of it, I tend to think of the risk of any particular powerful AI the way I think of risk in deploying software.
I work in site reliability/operations, and so we tend to deal with things we model as having aleatory uncertainty like holding constant a risk that any particular system will fail unexpected for some reason (hardware failure, cosmic rays, unexpected code execution path, etc.), but I also know that most of the risk comes right at the beginning when I first turn something on (turn on new hardware, deploy new code, etc.). A very simple model of this is something like f(x)=e−x+c where most of the risk of failure happens right at the start and beyond that there’s little to no risk of failure, so running for months doesn’t represent a 95% risk; almost all of the 5% risk is eaten up right at the start because the probability distribution function is shaped such that all the mass is under the curve at the beginning.
Agree, good point. I’d say it’s aleatoric risk is necessary to produce compounding, but not sufficient, but maybe I’m still looking at this the wrong way.
The mathematical property that you’re looking for is independence. In particular, your computation of 1 - .95**60 would be valid if the probability of failure in one month is independent of the probability of failure in any other month.
I don’t think aleatoric risk is necessary. Consider an ML system that was magically trained to maximize CEV (or whatever you think would make it aligned), but it is still vulnerable to adversarial examples. Suppose that adversarial example questions form 1% of the space of possible questions that I could ask. (This is far too high, but whatever.) It’s likely roughly true that two different questions that I ask have independent probabilities of being adversarial examples, since I have no clue what the space of adversarial examples looks like. So the probability of failure compounds in the number of questions I ask.
In this case I think you should estimate the probability of the AI system ever becoming dangerous (bearing in mind how long it will be operating), not the probability per second. I expect much better intuitions for the former.
1) Yep, independence.
2) Seems right as well.
3) I think it’s important to consider “risk per second”, because
(i) I think many AI systems could eventually become dangerous, just not on reasonable time-scales.
(ii) I think we might want to run AI systems which have the potential to become dangerous for limited periods of time.
(iii) If most of the risk is far in the future, we can hope to become more prepared in the meanwhile