Joe Collman comments on Towards more cooperative AI safety strategies

Joe Collman 21 Jul 2024 20:45 UTC
4 points
2
On your bottom line, I entirely agree—to the extent that there are non-power-seeking strategies that’d be effective, I’m all for them. To the extent that we disagree, I think it’s about [what seems likely to be effective] rather than [whether non-power-seeking is a desirable property].
Constrained-power-seeking still seems necessary to me. (unfortunately)
A few clarifications:
- I guess most technical AIS work is net negative in expectation. My ask there is that people work on clearer cases for their work being positive.
- I don’t think my (or Eliezer’s) conclusions on strategy are downstream of [likelihood of doom]. I’ve formed some model of the situation. One output of the model is [likelihood of doom]. Another is [seemingly least bad strategies]. The strategies are based around why doom seems likely, not (primarily) that doom seems likely.
- It doesn’t feel like “I am responding to the situation with the appropriate level of power-seeking given how extreme the circumstances are”.
  - It feels like the level of power-seeking I’m suggesting seems necessary is appropriate.
  - My cognitive biases push me away from enacting power-seeking strategies.
  - Biases aside, confidence in [power seems necessary] doesn’t imply confidence that I know what constraints I’d want applied to the application of that power.
  - In strategies I’d like, [constraints on future use of power] would go hand in hand with any [accrual of power].
    It’s non-obvious that there are good strategies with this property, but the unconstrained version feels both suspicious and icky to me.
    Suspicious, since [I don’t have a clue how this power will need to be directed now, but trust me—it’ll be clear later (and the right people will remain in control until then)] does not justify confidence.
- To me, you seem to be over-rating the applicability of various reference classes in assessing [(inputs to) likelihood of doom]. As I think I’ve said before, it seems absolutely the correct strategy to look for evidence based on all the relevant reference classes we can find.
  - However, all else equal, I’d expect:
    Spending a long time looking for x, makes x feel more important.
    [Wanting to find useful x] tends to shade into [expecting to find useful x] and [perceiving xs as more useful than they are].
    Particularly so when [absent x, we’ll have no clear path to resolving hugely important uncertainties].
  - The world doesn’t owe us convenient reference classes. I don’t think there’s any way around inside-view analysis here—in particular, [how relevant/significant is this reference class to this situation?] is an inside-view question.
    That doesn’t make my (or Eliezer’s, or …’s) analysis correct, but there’s no escaping that you’re relying on inside-view too. Our disagreement only escapes [inside-view dependence on your side] once we broadly agree on [the influence of inside-view properties on the relevance/significance of your reference classes]. I assume that we’d have significant disagreements there.
    Though it still seems useful to figure out where. I expect that there are reference classes that we’d agree could clarify various sub-questions.
    In many non-AI-x-risk situations, we would agree—some modest level of inside-view agreement would be sufficient to broadly agree about the relevance/significance of various reference classes.