Vika comments on Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Vika 20 Dec 2023 19:52 UTC
LW: 12 AF: 7
0
AF
I still endorse the breakdown of “sharp left turn” claims in this post. Writing this helped me understand the threat model better (or at all) and make it a bit more concrete.
This post could be improved by explicitly relating the claims to the “consensus” threat model summarized in Clarifying AI X-risk. Overall, SLT seems like a special case of that threat model, which makes a subset of the SLT claims:
- Claim 1 (capabilities generalize far) and Claim 3 (humans fail to intervene), but not Claims 1a/b (simultaneous / discontinuous generalization) or Claim 2 (alignment techniques stop working).
- It probably relies on some weaker version of Claim 2 (alignment techniques failing to apply to more powerful systems in some way). This seems necessary for deceptive alignment to arise, e.g. if our interpretability techniques fail to detect deceptive reasoning. However, I expect that most ways this could happen would not be due to the alignment techniques being fundamentally inadequate for the capability transition to more powerful systems (the strong version of Claim 2 used in SLT).
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)
- habryka's comment on The LessWrong 2022 Review: Review Phase by RobertM (10 Jan 2024 22:04 UTC; 17 points)