I think my real complaint here is that your story is getting its emotional oomph from an artificial constraint (every output must be 100% correct or many beings die) that doesn’t usually hold, not even for AI alignment
Well OK I agree that “every output must be 100% correct or many beings die” is unrealistic. My apologies for a bad choice of toy problem that suggested that I thought such a stringent requirement was realistic.
But would you agree that there are some invariants that we want advanced AI systems to have, and that we really want to be very confident that our AI systems to satisfy these invariants before we deploy them, and that these invariants really must hold at every time step?
To take an example from ARCHES, perhaps it should be the case that, for every action output at every time step, the action does not cause the Earth’s atmospheric temperature to move outside some survivable interval. Or perhaps you say that this invariant is not a good safety invariant—ok, but surely you agree that there is some correct formulation of some safety invariants that we really want to hold in an absolute way at every time step? Perhaps we can never guarantee that all actions will have acceptable consequences because we can never completely rule out some confluence of unlucky conditions, so then perhaps we formulate some intent alignment invariant that is an invariant on the internal mechanism by which actions are generated. Or perhaps intent alignment is misguided and we get our invariants from some other theory of AI safety. But there are going to be invariants that we want our systems to satisfy in an absolute way, no?
And if we want to check whether our system satisfies some invariant in an absolute way then I claim that we need to be able to look inside the system and see how it works, and convince ourselves based on an understanding of how the thing is assembled that, yes, this python code really will sort integers correctly in all cases; that, yes, this system really is structured such that this intent alignment invariant will always hold; that yes, this learning algorithm is going to produce acceptable outputs in all cases for an appropriate definition of acceptability.
When we build sophisticated systems and we want them to satisfy sophisticated invariants, it’s very hard to use end-to-end testing alone. And we are forced to use end-to-end testing alone whenever we are dealing with systems that we do not understand the internals of. Search produces systems that are very difficult to understand the internals of. Therefore we need something beyond search. This is the claim that my integer sorting example was trying to be an intuition pump for. (This discussion is helping to clarify my thinking on this a lot.)
I agree that there are some invariants that we really would like to hold, but I don’t think it should necessarily be thought of in the same way as in the sorting example.
Like, it really would be nice to have a 100% guarantee on intent alignment. But it’s not obvious to me that you should think of it as “this neural network output has to satisfy a really specific and tight constraint for every decision it ever makes”. It’s not like for every possible low-level action a neural net is going to take, it’s going to completely rethink its motivations / goals and forward-chain all the way to what action it should take. The risk seems quite a bit more nebulous: maybe the specific motivation the agent has changes in some particular weird scenario, or would predictably drift away from what humans want as the world becomes very different from the training setup.
(All of these apply to humans too! If I had a human assistant who was intent aligned with me, I might worry that if they were deprived of food for a long time, they might stop being intent aligned with me; or if I got myself uploaded, then they may see the uploaded-me as a different person and so no longer be intent aligned with me. Nonetheless, I’d be pretty stoked to have an intent aligned human assistant.)
There is a relevant difference between humans and AI systems here, which is that we expect that we’ll be ceding more and more decision-making influence to AI systems over time, and so errors in AI systems are more consequential than errors in humans. I do think this raises the bar for what properties we want out of AI systems, but I don’t think it gets to the point of “every single output must be correct”, at least depending on what you mean by “correct”.
Re: the ARCHES point: I feel like an AI system would only drastically modify the temperature “intentionally”. Like, I don’t worry about humans “unintentionally” jumping into a volcano. The AI system could still do such a thing, even if intent aligned (e.g. if it’s user was fighting a war and that was a good move, or if the user wanted to cause human extinction). My impression is that this is the sort of scenario ARCHES is worried about: if we don’t solve the problem of humans competing with each other, then humans will fight with more and more impactful AI-enabled “weapons”, and eventually this will cause an existential catastrophe. This isn’t the sort of thing you can solve by designing an AI system that doesn’t produce “weapons”, unless you get widespread international coordination to ensure that no one designs an AI system that can produce “weapons”.
(Weapons in quotes because I want it to also include things like effective propaganda.)
Well OK I agree that “every output must be 100% correct or many beings die” is unrealistic. My apologies for a bad choice of toy problem that suggested that I thought such a stringent requirement was realistic.
But would you agree that there are some invariants that we want advanced AI systems to have, and that we really want to be very confident that our AI systems to satisfy these invariants before we deploy them, and that these invariants really must hold at every time step?
To take an example from ARCHES, perhaps it should be the case that, for every action output at every time step, the action does not cause the Earth’s atmospheric temperature to move outside some survivable interval. Or perhaps you say that this invariant is not a good safety invariant—ok, but surely you agree that there is some correct formulation of some safety invariants that we really want to hold in an absolute way at every time step? Perhaps we can never guarantee that all actions will have acceptable consequences because we can never completely rule out some confluence of unlucky conditions, so then perhaps we formulate some intent alignment invariant that is an invariant on the internal mechanism by which actions are generated. Or perhaps intent alignment is misguided and we get our invariants from some other theory of AI safety. But there are going to be invariants that we want our systems to satisfy in an absolute way, no?
And if we want to check whether our system satisfies some invariant in an absolute way then I claim that we need to be able to look inside the system and see how it works, and convince ourselves based on an understanding of how the thing is assembled that, yes, this python code really will sort integers correctly in all cases; that, yes, this system really is structured such that this intent alignment invariant will always hold; that yes, this learning algorithm is going to produce acceptable outputs in all cases for an appropriate definition of acceptability.
When we build sophisticated systems and we want them to satisfy sophisticated invariants, it’s very hard to use end-to-end testing alone. And we are forced to use end-to-end testing alone whenever we are dealing with systems that we do not understand the internals of. Search produces systems that are very difficult to understand the internals of. Therefore we need something beyond search. This is the claim that my integer sorting example was trying to be an intuition pump for. (This discussion is helping to clarify my thinking on this a lot.)
I agree that there are some invariants that we really would like to hold, but I don’t think it should necessarily be thought of in the same way as in the sorting example.
Like, it really would be nice to have a 100% guarantee on intent alignment. But it’s not obvious to me that you should think of it as “this neural network output has to satisfy a really specific and tight constraint for every decision it ever makes”. It’s not like for every possible low-level action a neural net is going to take, it’s going to completely rethink its motivations / goals and forward-chain all the way to what action it should take. The risk seems quite a bit more nebulous: maybe the specific motivation the agent has changes in some particular weird scenario, or would predictably drift away from what humans want as the world becomes very different from the training setup.
(All of these apply to humans too! If I had a human assistant who was intent aligned with me, I might worry that if they were deprived of food for a long time, they might stop being intent aligned with me; or if I got myself uploaded, then they may see the uploaded-me as a different person and so no longer be intent aligned with me. Nonetheless, I’d be pretty stoked to have an intent aligned human assistant.)
There is a relevant difference between humans and AI systems here, which is that we expect that we’ll be ceding more and more decision-making influence to AI systems over time, and so errors in AI systems are more consequential than errors in humans. I do think this raises the bar for what properties we want out of AI systems, but I don’t think it gets to the point of “every single output must be correct”, at least depending on what you mean by “correct”.
Re: the ARCHES point: I feel like an AI system would only drastically modify the temperature “intentionally”. Like, I don’t worry about humans “unintentionally” jumping into a volcano. The AI system could still do such a thing, even if intent aligned (e.g. if it’s user was fighting a war and that was a good move, or if the user wanted to cause human extinction). My impression is that this is the sort of scenario ARCHES is worried about: if we don’t solve the problem of humans competing with each other, then humans will fight with more and more impactful AI-enabled “weapons”, and eventually this will cause an existential catastrophe. This isn’t the sort of thing you can solve by designing an AI system that doesn’t produce “weapons”, unless you get widespread international coordination to ensure that no one designs an AI system that can produce “weapons”.
(Weapons in quotes because I want it to also include things like effective propaganda.)