Steven Byrnes comments on What do coherence arguments actually prove about agentic behavior?

Steven Byrnes 2 Jun 2024 2:32 UTC
13 points
0
You can try reading my post Consequentialism & Corrigibility, see if you find it helpful.
- sunwillrise 2 Jun 2024 11:06 UTC
  11 points
  4
  Parent
  Thank you for the link, Steve. I recall having read your post a while back, but for some reason it slipped my mind while I was pondering the original question here. That being said, while your writing is tangentially related, it is also not quite on point to my inquiry and concerns.
  Your post is somewhat of a direct answer to @Rohin Shah’s illustration in “Coherence arguments do not entail goal-directed behavior” (yet another post I should have linked to in my original post) of the fact that an agent optimizing for the fulfillment of preferences over universe-histories as opposed to mere future world states can display “any behavior whatsoever.” More specifically, you explored corrigibility proposals in light of this fact, concluding that “preferences purely over future states are just fundamentally counter to corrigibility.” I haven’t thought about this topic enough to come to a definite judgment either way, but in any case, this is much more relevant to Eliezer’s tangent about corrigibility (in the quote I selected at the top of my post) than to the different object-level concern about whether coherence arguments imply the degree of certitude Eliezer has about how sufficiently powerful agents will behave in the real world.
  Indeed, the section of your post that is by far the most relevant to my interests here is the following (which you wrote as more of an aside):
  (Edit to add: There are very good reasons to expect future powerful AGIs to act according to preferences over distant-future states, and I join Eliezer in roundly criticizing people who think we can build an AGI that never does that; see this comment for discussion.)
  For completeness and to save others the effort of clicking on that link onto a new tab, the relevant part of your referenced comment says the following:
  I feel like I’m stuck in the middle…
  On one side of me sits Eliezer, suggesting that future powerful AGIs will make decisions exclusively to advance their explicit preferences over future states
  On the other side of me sits, umm, you, and maybe Richard Ngo, and some of the “tool AI” and GPT-3-enthusiast people, declaring that future powerful AGIs will make decisions based on no explicit preference whatsoever over future states.
  Here I am in the middle, advocating that we make AGIs that do have preferences over future states, but also have other preferences.
  I disagree with the 2nd camp for the same reason Eliezer does: I don’t think those AIs are powerful enough. More specifically: We already have neat AIs like GPT-3 that can do lots of neat things. But we have a big problem: sooner or later, somebody is going to come along and build a dangerous accident-prone consequentialist AGI. We need an AI that’s both safe, and powerful enough to solve that big problem. I usually operationalize that as “able to come up with good original creative ideas in alignment research, and/or able to invent powerful new technologies”. I think that, for an AI to do those things, it needs to do explicit means-end reasoning, autonomously come up with new instrumental goals and pursue them, etc. etc. For example, see discussion of “RL-on-thoughts” here.
  Unfortunately, this still doesn’t present the level of evidence or reasoning that would persuade me (or even move me significantly) in the direction of believing powerful AI will necessarily (or even likely) optimize (at least in large part) for explicit preferences over future world states. It suffers from the same general problems that previous writings on this topic (including Eliezer’s) do, namely the fact that they communicate strongly-held intuitions about “certain structures of cognition [...] that are good at stuff and do the work” without a solid grounding in either formal mathematical reasoning or explicit real-world empirical data or facts (although I suppose this is better than claiming the math actually proves the intuitions are right, when in fact it doesn’t).
  We all know intuitions aren’t magic but are nonetheless useful about complex topics when they function as gears in understanding, so I am certainly not claiming reliance on intuitions is bad per se, especially in dialogues about topics like AGI where empirical analyses are inherently tricky. On the contrary, I think I understand the relevant dynamic here quite well: you (just like Eliezer) have spent a ton of time thinking about and working on determining how powerful optimizers reason, and in the process you have gained a lot of (mostly implicit) understandings of this. Analogously to how nobody starts off with great intuition about chess but can develop it tremendously over time after playing games, working through analyses and doing puzzles, you have trained your intuition about consequentialist reasoning by working hard on the alignment problem and should thus be more attuned to the ground-level territory than someone who hasn’t done the (in Eliezer-speak) “homework problems”. I am nonetheless still left with the (mostly self-imposed) task of figuring out whether those intuitions are correct.
  One way of doing that would be to obtain irrefutable mathematical proof that Expected Utility maximizers come about when we optimize hard enough for intelligence of an AI system, or at the very least that such entities would necessarily be exploitable if they don’t self-modify into an EU maximizer. Indeed, this is the very reason I made this question post, but it seems like this type of proof isn’t actually available given that none of the answers or comments thus far have patched the holes in Eliezer’s arguments or explained how EJT might have been wrong. Another way of doing it would be to defer to the conclusions that smart people with experience in this area like you or Eliezer have reached; however, this also doesn’t work because suffers from a few major issues:
  1. it doesn’t seem like there is a strong consensus about these topics among people who have spent significant portions of their lives studying it. Views that seem plausible and consistent on their face and which oppose Eliezer’s on the question of advanced cognition include lsusr’s, Richard Ngo’s, Quintin Pope’s (1, 2, 3 etc), Alex Turner’s (1, 2, 3, 4, 5 etc), among others. Deference is thus insufficient because it’s unclear who to defer to.
  2. given that I already personally disagree entirely with some of Eliezer’s thinking on important topics where I believe I understand his perspective well, such as whether CEV makes sense conceptually (which I might write a post about at some point), it far less plausible to me that such deference is the correct way to go about this.
  3. the one example of general intelligences we have seen in the real world thus far (namely humans) are not utility maximizers and in any case suffer from serious problems in defining True Values so that “preferences” make coherent sense.
  Of course, the one other way out of this conundrum is for me to follow the standard “think for yourself and reach your own conclusions” advice that’s usually given. Unfortunately, that also can’t work if “think for yourself” means “hypothesize really hard without any actual experimentation, HPJEV-style”. As it turns out, while I am not an alignment researcher, I think of myself as a reasonably smart guy who understands Eliezer’s perspective (“In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility according to its world model to purse a goal that can be extremely different from what humans deem good”) and who has spent a fair bit of time pondering these matters, and even after all that I don’t see why Eliezer’s perspective is even likely to be true (let alone reaching the level of confidence he apparently has in these conclusions).
  Indeed, in order for me (and for other people like me, which I imagine exist out here) to proceed on this matter, I would need something that has more feedback loops that allow me to progress while remaining grounded to reality. In other words, I would need to see for myself the analogues to the aforementioned “games, analyses, and puzzles” that build the chess intuition. This is why an important part of my original post (which nobody has responded to yet) was the following:
  When Eliezer says “they did not even do as many homework problems as I did,” I doubt he is referring to actual undergrad-style homework problems written nicely in LaTeX. Nevertheless, I would like to know whether there is some sort of publicly available repository of problem sets that illustrate the principles he is talking about. Meaning set-ups where you have an agent (of sorts) that is acting in a manner that’s either not utility-maximizing or even simply not consequentialist, followed by explanations of how you can exploit this agent. Given the centrality of consequentialism (and the associated money-pump and Dutch book-type arguments) to his thinking about advanced cognition and powerful AI, it would be nice to be able to verify whether working on these “homework problems” indeed results in the general takeaway Eliezer is trying to communicate.
  What links here?
  - Steven Byrnes 2 Jun 2024 12:16 UTC
    19 points
    10
    Parent
    Again, I think Eliezer’s perspective is:
    HYPOTHESIS 1: “future powerful AIs will have preferences purely over states of the world in the distant future”
    CONSEQUENCE 1: “AIs will satisfy coherence theorems, corrigibility is unnatural, etc.”
    I think Eliezer is wrong because I think HYPOTHESIS 1 is likely to be false.
    (I do think the step “If HYPOTHESIS 1 Then CONSEQUENCE 1” is locally valid—I agree with Eliezer about that.)
    I do however believe:
    HYPOTHESIS 2: future powerful AIs will have various preferences, at least some of which concern states of the world in the distant future.
    This is weaker than HYPOTHESIS 1. HYPOTHESIS 2 does NOT imply CONSEQUENCE 1. In fact, that if you grant HYPOTHESIS 2, it’s hard to get any solid conclusions out of it at all. More like “well, things might go wrong, but also they might not”. It’s hard to say anything more than that without talking about the AI training approach in some detail.
    I think Eliezer’s alleged “homework problems” were about (the correct step) “If HYPOTHESIS 1 then CONSEQUENCE 1”, and that he didn’t do enough “homework problems” to notice that HYPOTHESIS 1 may be false.
    You seem to be interested yet another possibility:
    HYPOTHESIS 3: it’s possible for there to be future powerful AIs that have no preferences whatsoever about states of the world in the distant future.
    I think this is wrong but I agree that I don’t have a rock-solid argument for it being wrong (I don’t think I ever claimed to). Maybe see §5.3 of my Process-Based Supervision post for some more (admittedly intuitive) chatting about why I think (one popular vision of) Hypothesis 3 is wrong. Again, if you’re just trying to make the point that Eliezer is over-confident in doom for unsound reasons, then the argument over HYPOTHESIS 2 versus HYPOTHESIS 3 is unnecessary for that point. HYPOTHESIS 2 is definitely a real possibility (proof: human brains exist), and that’s already enough to undermine our confidence in CONSEQUENCE 1.
    What links here?
    Seth Herd's comment on Seth Herd’s Shortform by Seth Herd (1 Jun 2024 20:19 UTC; 80 points)
    Money Pump Arguments assume Memoryless Agents. Isn’t this Unrealistic? by Dalcy (16 Aug 2024 4:16 UTC; 23 points)
    sunwillrise's comment on The Standard Analogy by Zack_M_Davis (5 Jun 2024 12:17 UTC; 16 points)
    sunwillrise's comment on What is it to solve the alignment problem? by Joe Carlsmith (25 Aug 2024 10:00 UTC; 8 points)
    Noosphere89's comment on A shot at the diamond-alignment problem by TurnTrout (26 Dec 2024 16:11 UTC; 8 points)
    sunwillrise's comment on Instruction-following AGI is easier and more likely than value aligned AGI by Seth Herd (12 Jul 2024 15:34 UTC; 5 points)
    sunwillrise's comment on When is a mind me? by Rob Bensinger (8 Jul 2024 13:51 UTC; 5 points)
    sunwillrise's comment on Money Pump Arguments assume Memoryless Agents. Isn’t this Unrealistic? by Dalcy (16 Aug 2024 8:03 UTC; 3 points)
    sunwillrise's comment on When is a mind me? by Rob Bensinger (9 Jul 2024 23:32 UTC; 3 points)