I generally agree with the points made in this post.
Points I agree with
Slowing down AI progress seems rational conditional on there being a significant probability that AGI will cause extinction.
Generally, technologies are accepted only when their expected benefit significantly outweighs their expected harms. Consider flying as an example. Let’s say the benefit of each flight is +10 and the harm of getting killed is −1000. If x is the probability of surviving then the net utility equation is 10x−1000(1−x)=0.
Solving for x, the utility is 0 when x≈0.99. In other words, the flight would only be worth it if there was at least a 99% chance of survival which makes intuitive sense.
If we use the same utility function for AI and assume that Eliezer believes that creating AGI will have a 50% chance of causing human extinction then the outcome would be strongly net negative for humanity and one should agree with this sentiment unless one’s P(extinction) is less than 1%.
Eliezer is saying that we can in principle make AI safe but argues that it could take decades to advance AI safety to the point where we can be sufficiently confident that creating an AGI would have net positive utility.
If slowing down AI progress is the best course of action, then achieving a good outcome for AGI seems more like an AI governance problem than a technical AI safety research problem.
Points I disagree with
“Progress in AI capabilities is running vastly, vastly ahead of progress in AI alignment or even progress in understanding what the hell is going on inside those systems. If we actually do this, we are all going to die.”
I think Evan Hubinger has said that before if this were the case, GPT-4 would be less aligned than GPT-3 but the opposite is true in reality (GPT-4 is more aligned according to OpenAI). Still, I think we ideally want a scalable AI alignment solution long before the level of capabilities is reached where it’s needed. A similar idea is how Claude Shannon conceived of a minimax chess algorithm decades before we had the compute to implement it.
Other points
Eliezer has been sounding the alarm for some time and it’s easy to get alarm fatigue and become complacent. But the fact that a leading member of the AI safety research community has a message as extreme as this is alarming.
In regards to the point you disagree on:
As I understood it, (seemingly) linear relationships between the behaviour and the capabilities of a system don’t need to stay that way.
For example, I think that Robert Miles recently was featured in a video on Computerphile (YouTube), in which he described how the answers of LLMs to “What happens if you break a mirror” actually got worse with more capability.
As far as I understand it, you can have a system that behaves in a way which seems completely aligned, and which still hits a point of (… let’s call it “power”...) power at which it starts behaving in a way that is not aligned. (And/Or becomes deceptive.)
The fact that GPT-4 seems to be more aligned may well be because it hasn’t hit this point yet.
So, I don’t see how the point you quoted would be an indicator of what future versions will bring, unless they can actually explain what exactly made the difference in behaviour, and how it is robust in more powerful systems (with access to their own code).
If I’m mistaken in my understanding, I’d be happy about corrections (:
I generally agree with the points made in this post.
Points I agree with
Slowing down AI progress seems rational conditional on there being a significant probability that AGI will cause extinction.
Generally, technologies are accepted only when their expected benefit significantly outweighs their expected harms. Consider flying as an example. Let’s say the benefit of each flight is +10 and the harm of getting killed is −1000. If x is the probability of surviving then the net utility equation is 10x−1000(1−x)=0.
Solving for x, the utility is 0 when x≈0.99. In other words, the flight would only be worth it if there was at least a 99% chance of survival which makes intuitive sense.
If we use the same utility function for AI and assume that Eliezer believes that creating AGI will have a 50% chance of causing human extinction then the outcome would be strongly net negative for humanity and one should agree with this sentiment unless one’s P(extinction) is less than 1%.
Eliezer is saying that we can in principle make AI safe but argues that it could take decades to advance AI safety to the point where we can be sufficiently confident that creating an AGI would have net positive utility.
If slowing down AI progress is the best course of action, then achieving a good outcome for AGI seems more like an AI governance problem than a technical AI safety research problem.
Points I disagree with
I think Evan Hubinger has said that before if this were the case, GPT-4 would be less aligned than GPT-3 but the opposite is true in reality (GPT-4 is more aligned according to OpenAI). Still, I think we ideally want a scalable AI alignment solution long before the level of capabilities is reached where it’s needed. A similar idea is how Claude Shannon conceived of a minimax chess algorithm decades before we had the compute to implement it.
Other points
Eliezer has been sounding the alarm for some time and it’s easy to get alarm fatigue and become complacent. But the fact that a leading member of the AI safety research community has a message as extreme as this is alarming.
In regards to the point you disagree on: As I understood it, (seemingly) linear relationships between the behaviour and the capabilities of a system don’t need to stay that way. For example, I think that Robert Miles recently was featured in a video on Computerphile (YouTube), in which he described how the answers of LLMs to “What happens if you break a mirror” actually got worse with more capability.
As far as I understand it, you can have a system that behaves in a way which seems completely aligned, and which still hits a point of (… let’s call it “power”...) power at which it starts behaving in a way that is not aligned. (And/Or becomes deceptive.) The fact that GPT-4 seems to be more aligned may well be because it hasn’t hit this point yet.
So, I don’t see how the point you quoted would be an indicator of what future versions will bring, unless they can actually explain what exactly made the difference in behaviour, and how it is robust in more powerful systems (with access to their own code).
If I’m mistaken in my understanding, I’d be happy about corrections (: