I agree that there are some invariants that we really would like to hold, but I don’t think it should necessarily be thought of in the same way as in the sorting example.
Like, it really would be nice to have a 100% guarantee on intent alignment. But it’s not obvious to me that you should think of it as “this neural network output has to satisfy a really specific and tight constraint for every decision it ever makes”. It’s not like for every possible low-level action a neural net is going to take, it’s going to completely rethink its motivations / goals and forward-chain all the way to what action it should take. The risk seems quite a bit more nebulous: maybe the specific motivation the agent has changes in some particular weird scenario, or would predictably drift away from what humans want as the world becomes very different from the training setup.
(All of these apply to humans too! If I had a human assistant who was intent aligned with me, I might worry that if they were deprived of food for a long time, they might stop being intent aligned with me; or if I got myself uploaded, then they may see the uploaded-me as a different person and so no longer be intent aligned with me. Nonetheless, I’d be pretty stoked to have an intent aligned human assistant.)
There is a relevant difference between humans and AI systems here, which is that we expect that we’ll be ceding more and more decision-making influence to AI systems over time, and so errors in AI systems are more consequential than errors in humans. I do think this raises the bar for what properties we want out of AI systems, but I don’t think it gets to the point of “every single output must be correct”, at least depending on what you mean by “correct”.
Re: the ARCHES point: I feel like an AI system would only drastically modify the temperature “intentionally”. Like, I don’t worry about humans “unintentionally” jumping into a volcano. The AI system could still do such a thing, even if intent aligned (e.g. if it’s user was fighting a war and that was a good move, or if the user wanted to cause human extinction). My impression is that this is the sort of scenario ARCHES is worried about: if we don’t solve the problem of humans competing with each other, then humans will fight with more and more impactful AI-enabled “weapons”, and eventually this will cause an existential catastrophe. This isn’t the sort of thing you can solve by designing an AI system that doesn’t produce “weapons”, unless you get widespread international coordination to ensure that no one designs an AI system that can produce “weapons”.
(Weapons in quotes because I want it to also include things like effective propaganda.)
I agree that there are some invariants that we really would like to hold, but I don’t think it should necessarily be thought of in the same way as in the sorting example.
Like, it really would be nice to have a 100% guarantee on intent alignment. But it’s not obvious to me that you should think of it as “this neural network output has to satisfy a really specific and tight constraint for every decision it ever makes”. It’s not like for every possible low-level action a neural net is going to take, it’s going to completely rethink its motivations / goals and forward-chain all the way to what action it should take. The risk seems quite a bit more nebulous: maybe the specific motivation the agent has changes in some particular weird scenario, or would predictably drift away from what humans want as the world becomes very different from the training setup.
(All of these apply to humans too! If I had a human assistant who was intent aligned with me, I might worry that if they were deprived of food for a long time, they might stop being intent aligned with me; or if I got myself uploaded, then they may see the uploaded-me as a different person and so no longer be intent aligned with me. Nonetheless, I’d be pretty stoked to have an intent aligned human assistant.)
There is a relevant difference between humans and AI systems here, which is that we expect that we’ll be ceding more and more decision-making influence to AI systems over time, and so errors in AI systems are more consequential than errors in humans. I do think this raises the bar for what properties we want out of AI systems, but I don’t think it gets to the point of “every single output must be correct”, at least depending on what you mean by “correct”.
Re: the ARCHES point: I feel like an AI system would only drastically modify the temperature “intentionally”. Like, I don’t worry about humans “unintentionally” jumping into a volcano. The AI system could still do such a thing, even if intent aligned (e.g. if it’s user was fighting a war and that was a good move, or if the user wanted to cause human extinction). My impression is that this is the sort of scenario ARCHES is worried about: if we don’t solve the problem of humans competing with each other, then humans will fight with more and more impactful AI-enabled “weapons”, and eventually this will cause an existential catastrophe. This isn’t the sort of thing you can solve by designing an AI system that doesn’t produce “weapons”, unless you get widespread international coordination to ensure that no one designs an AI system that can produce “weapons”.
(Weapons in quotes because I want it to also include things like effective propaganda.)