Skeptic: It seems to me that the distinction between “alignment” and “misalignment” has become something of a motte and bailey. Historical arguments that AIs would be misaligned used it in sense 1: “AIs having sufficiently general and large-scale motivations that they acquire the instrumental goal of killing all humans (or equivalently bad behaviour)”. Now people are using the word in sense 2: “AIs not quite doing what we want them to do”. But when our current AIs aren’t doing quite what we want them to do, is that mainly evidence that future, more general systems will be misaligned1 (which I agree is bad) or misaligned2?
Advocate: Concepts like agency are continuous spectra. GPT-3 is a little bit agentic, and we’ll eventually build AGIs that are much more agentic. Insofar as GPT-3 is trying to do something, it’s trying to do the wrong thing. So we should expect future systems to be trying to do the wrong thing in a much more worrying way (aka be misaligned1) for approximately the same reason: that we trained them on loss functions that incentivised the wrong thing.
Skeptic: I agree that this is possible. But what should our update be after observing large language models? You could look at the difficulties of making GPT-3 do exactly what we want, and see this as evidence that misalignment is a big deal. But actually, large language models seem like evidence against misalignment1 being a big deal (because they seem to be quite intelligent without being very agentic, but the original arguments for worrying about misalignment1 relied on the idea that intelligence and agency are tightly connected, making it very hard to build superintelligent systems which don’t have large-scale goals).
Advocate: Even if that’s true for the original arguments, it’s not for more recent arguments.
Skeptic: These newer arguments rely on assumptions about economic competition and coordination failures which seem quite speculative to me, and which haven’t been vetted very much.
Advocate: These assumptions seem like common sense to me—e.g. lots of people are already worried about the excesses of capitalism. But even if they’re speculative, they’re worth putting a lot of effort into understanding and preparing for.
In case it wasn’t clear from inside the dialogue, I’m quite sympathetic to both sides of this conversation (indeed, it’s roughly a transcript of a debate that I’ve had with myself a few times). I think more clarity on these topics would be very valuable.
It seems to me that the distinction between “alignment” and “misalignment” has become something of a motte and bailey. Historical arguments that AIs would be misaligned used it in sense 1: “AIs having sufficiently general and large-scale motivations that they acquire the instrumental goal of killing all humans (or equivalently bad behaviour)”. Now people are using the word in sense 2: “AIs not quite doing what we want them to do”.
There’s an identical problem with “friendliness”. Sometimes unfriendliness means we all die, sometimes it means we don’t get utopia.
Skeptic: It seems to me that the distinction between “alignment” and “misalignment” has become something of a motte and bailey. Historical arguments that AIs would be misaligned used it in sense 1: “AIs having sufficiently general and large-scale motivations that they acquire the instrumental goal of killing all humans (or equivalently bad behaviour)”. Now people are using the word in sense 2: “AIs not quite doing what we want them to do”. But when our current AIs aren’t doing quite what we want them to do, is that mainly evidence that future, more general systems will be misaligned1 (which I agree is bad) or misaligned2?
Advocate: Concepts like agency are continuous spectra. GPT-3 is a little bit agentic, and we’ll eventually build AGIs that are much more agentic. Insofar as GPT-3 is trying to do something, it’s trying to do the wrong thing. So we should expect future systems to be trying to do the wrong thing in a much more worrying way (aka be misaligned1) for approximately the same reason: that we trained them on loss functions that incentivised the wrong thing.
Skeptic: I agree that this is possible. But what should our update be after observing large language models? You could look at the difficulties of making GPT-3 do exactly what we want, and see this as evidence that misalignment is a big deal. But actually, large language models seem like evidence against misalignment1 being a big deal (because they seem to be quite intelligent without being very agentic, but the original arguments for worrying about misalignment1 relied on the idea that intelligence and agency are tightly connected, making it very hard to build superintelligent systems which don’t have large-scale goals).
Advocate: Even if that’s true for the original arguments, it’s not for more recent arguments.
Skeptic: These newer arguments rely on assumptions about economic competition and coordination failures which seem quite speculative to me, and which haven’t been vetted very much.
Advocate: These assumptions seem like common sense to me—e.g. lots of people are already worried about the excesses of capitalism. But even if they’re speculative, they’re worth putting a lot of effort into understanding and preparing for.
In case it wasn’t clear from inside the dialogue, I’m quite sympathetic to both sides of this conversation (indeed, it’s roughly a transcript of a debate that I’ve had with myself a few times). I think more clarity on these topics would be very valuable.
There’s an identical problem with “friendliness”. Sometimes unfriendliness means we all die, sometimes it means we don’t get utopia.