Thanks a lot for this post, I found it extremely helpful and expect I will refer to it a lot in thinking through different threat models.
I’d be curious to hear how you think the Production Web stories differ from part 1 of Paul’s “What failure looks like”.
To me, the underlying threat model seems to be basically the same: we deploy AI systems with objectives that look good in the short-run, but when those systems become equally or more capable than humans, their objectives don’t generalise “well” (i.e. in ways desirable by human standards), because they’re optimising for proxies (namely, a cluster of objectives that could loosely be described as “maximse production” within their industry sector) that eventually come apart from what we actually want (“maximising production” eventually means using up resources critical to human survival but non-critical to machines).
From reading some of the comment threads between you and Paul, it seems like you disagree about where, on the margin, resources should be spent (improving the cooperative capabilities of AI systems and humans vs improving single-single intent alignment) - but you agree on this particular underlying threat model?
It also seems like you emphasise different aspects of these threat models: you emphasise the role of competitive pressures more (but they’re also implicit in Paul’s story), and Paul emphases failures of intent alignment more (but they’re also present in your story) - though this is consistent with having the same underlying threat model?
(Of couse, both you and Paul also have other threat models, e.g. you have Flash War, Paul has part 2 of “What failure looks like”, and also Another (outer) alignment failure story, which seems to be basically a more nuanced version of part 1 of “What failure looks like”. Here, I’m curious specifically about the two theat models I’ve picked out.)
(I could have lots of this totally wrong, and would appreciate being corrected if so)
Thanks a lot for this post, I found it extremely helpful and expect I will refer to it a lot in thinking through different threat models.
I’d be curious to hear how you think the Production Web stories differ from part 1 of Paul’s “What failure looks like”.
To me, the underlying threat model seems to be basically the same: we deploy AI systems with objectives that look good in the short-run, but when those systems become equally or more capable than humans, their objectives don’t generalise “well” (i.e. in ways desirable by human standards), because they’re optimising for proxies (namely, a cluster of objectives that could loosely be described as “maximse production” within their industry sector) that eventually come apart from what we actually want (“maximising production” eventually means using up resources critical to human survival but non-critical to machines).
From reading some of the comment threads between you and Paul, it seems like you disagree about where, on the margin, resources should be spent (improving the cooperative capabilities of AI systems and humans vs improving single-single intent alignment) - but you agree on this particular underlying threat model?
It also seems like you emphasise different aspects of these threat models: you emphasise the role of competitive pressures more (but they’re also implicit in Paul’s story), and Paul emphases failures of intent alignment more (but they’re also present in your story) - though this is consistent with having the same underlying threat model?
(Of couse, both you and Paul also have other threat models, e.g. you have Flash War, Paul has part 2 of “What failure looks like”, and also Another (outer) alignment failure story, which seems to be basically a more nuanced version of part 1 of “What failure looks like”. Here, I’m curious specifically about the two theat models I’ve picked out.)
(I could have lots of this totally wrong, and would appreciate being corrected if so)