It seems like a lot of people are still thinking of alignment as too binary, which leads to critical errors in thinking like: “there will be sufficient economic incentives to solve alignment”, and “once alignment is a bottleneck, nobody will want to deploy unaligned systems, since such a system won’t actually do what they want”.
2) This level of alignment is suboptimal from the point of view of x-safety, since the downside risk of extinction for the actors deploying the system is less than the downside risk of extinction summed over all humans.
3) We will develop techniques for “good enough” alignment before we develop techniques that are acceptable from the standpoint of x-safety.
4) Therefore, the expected outcome is: once “good enough alignment” is developed, a lot of actors deploy systems that are aligned enough for them to benefit from them, but still carry an unacceptably high level of x-risk.
5) Thus if we don’t improve alignment techniques quickly enough after developing “good enough alignment”, it’s development will likely lead to a period of increased x-risk (under the “alignment bottleneck” model).
It seems like a lot of people are still thinking of alignment as too binary, which leads to critical errors in thinking like: “there will be sufficient economic incentives to solve alignment”, and “once alignment is a bottleneck, nobody will want to deploy unaligned systems, since such a system won’t actually do what they want”.
It seems clear to me that:
1) These statements are true for a certain level of alignment, which I’ve called “approximate value learning” in the past (https://www.lesswrong.com/posts/rLTv9Sx3A79ijoonQ/risks-from-approximate-value-learning). I think I might have also referred to it as “pretty good alignment” or “good enough alignment” at various times.
2) This level of alignment is suboptimal from the point of view of x-safety, since the downside risk of extinction for the actors deploying the system is less than the downside risk of extinction summed over all humans.
3) We will develop techniques for “good enough” alignment before we develop techniques that are acceptable from the standpoint of x-safety.
4) Therefore, the expected outcome is: once “good enough alignment” is developed, a lot of actors deploy systems that are aligned enough for them to benefit from them, but still carry an unacceptably high level of x-risk.
5) Thus if we don’t improve alignment techniques quickly enough after developing “good enough alignment”, it’s development will likely lead to a period of increased x-risk (under the “alignment bottleneck” model).