Steven Byrnes comments on What do coherence arguments actually prove about agentic behavior?

Steven Byrnes 2 Jun 2024 12:16 UTC
19 points
10
Again, I think Eliezer’s perspective is:
- HYPOTHESIS 1: “future powerful AIs will have preferences purely over states of the world in the distant future”
- CONSEQUENCE 1: “AIs will satisfy coherence theorems, corrigibility is unnatural, etc.”
I think Eliezer is wrong because I think HYPOTHESIS 1 is likely to be false.
(I do think the step “If HYPOTHESIS 1 Then CONSEQUENCE 1” is locally valid—I agree with Eliezer about that.)
I do however believe:
- HYPOTHESIS 2: future powerful AIs will have various preferences, at least some of which concern states of the world in the distant future.
This is weaker than HYPOTHESIS 1. HYPOTHESIS 2 does NOT imply CONSEQUENCE 1. In fact, that if you grant HYPOTHESIS 2, it’s hard to get any solid conclusions out of it at all. More like “well, things might go wrong, but also they might not”. It’s hard to say anything more than that without talking about the AI training approach in some detail.
I think Eliezer’s alleged “homework problems” were about (the correct step) “If HYPOTHESIS 1 then CONSEQUENCE 1”, and that he didn’t do enough “homework problems” to notice that HYPOTHESIS 1 may be false.
You seem to be interested yet another possibility:
- HYPOTHESIS 3: it’s possible for there to be future powerful AIs that have no preferences whatsoever about states of the world in the distant future.
I think this is wrong but I agree that I don’t have a rock-solid argument for it being wrong (I don’t think I ever claimed to). Maybe see §5.3 of my Process-Based Supervision post for some more (admittedly intuitive) chatting about why I think (one popular vision of) Hypothesis 3 is wrong. Again, if you’re just trying to make the point that Eliezer is over-confident in doom for unsound reasons, then the argument over HYPOTHESIS 2 versus HYPOTHESIS 3 is unnecessary for that point. HYPOTHESIS 2 is definitely a real possibility (proof: human brains exist), and that’s already enough to undermine our confidence in CONSEQUENCE 1.
What links here?