Matthew Barnett comments on Matthew Barnett’s Shortform

Matthew Barnett 17 Jun 2024 6:27 UTC
LW: 29 AF: 11
5
AF
Me: “Oh ok, that’s a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn.”
This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don’t)
I claim that LLMs do what we want without seeking power, rather than doing what we want as part of a strategy to seek power. In other words, they do not seem to be following any long-term strategy on the path towards a treacherous turn, unlike the AI that is tested in a sandbox in Bostrom’s story. This seems obvious to me.
Note that Bostrom talks about a scenario in which narrow AI systems get safer over time, lulling people into a false sense of security, but I’m explicitly talking about general AI here. I would not have said this about self-driving cars in 2019, even though those were pretty safe. I think LLMs are different because they’re quite general, in precisely the ways that Bostrom imagined could be dangerous. For example, they seem to understand the idea of an off-switch, and can explain to you verbally what would happen if you shut them off, yet this fact alone does not make them develop an instrumentally convergent drive to preserve their own existence by default, contra Bostrom’s theorizing.
I think instruction-tuned LLMs are basically doing what people thought would be hard for general AIs: they allow you to shut them down by default, they do not pursue long-term goals if we do not specifically train them to do that, and they generally follow our intentions by actually satisfying the goals we set out for them, rather than incidentally as part of their rapacious drive to pursue a mis-specified utility function.
The scenario outlined by Bostrom seems clearly different from the scenario with LLMs, which are actual general systems that do what we want and ~nothing more, rather than doing what we want as part of a strategy to seek power instrumentally. What am I missing here?
- Daniel Kokotajlo 17 Jun 2024 14:12 UTC
  LW: 50 AF: 19
  20
  AF Parent
  I thought you would say that, bwahaha. Here is my reply:
  
  (1) Yes, rereading the passage, Bostrom’s central example of a reason why we could see this “when dumb, smarter is safer; yet when smart, smarter is more dangerous” pattern (that’s a direct quote btw) is that they could be scheming/pretending when dumb. However he goes on to say: “A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted to narrowly … A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI’s final goal is to ‘make the project’s sponsor happy.’ Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner… until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor’s brain...” My gloss on this passage is that Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausibly happening with GPT4 etc. -- they aren’t plotting against us yet, but their ‘values’ aren’t exactly what we want, and so if somehow their ‘intelligence’ was amplified dramatically whilst their ‘values’ stayed the same, they would eventually realize this and start plotting against us. (realistically this won’t be how it happens since it’ll probably be future models trained from scratch instead of smarter versions of this model, plus the training process probably would change their values rather than holding them fixed). I’m not confident in this tbc—it’s possible that the ‘values’ so to speak of GPT4 are close enough to perfect that even if they were optimized to a superhuman degree things would be fine. But neither should you be confident in the opposite. I’m curious what you think about this sub-question.
  
  (2) This passage deserves a more direct response:
  I think instruction-tuned LLMs are basically doing what people thought would be hard for general AIs: they allow you to shut them down by default, they do not pursue long-term goals if we do not specifically train them to do that, and they generally follow our intentions by actually satisfying the goals we set out for them, rather than incidentally as part of their rapacious drive to pursue a mis-specified utility function.
  Instruction-tuned LLMs are not powerful general agents. They are pretty general but they are only a tiny bit agentic. They haven’t been trained to pursue long-term goals and when we try to get them to do so they are very bad at it. So they just aren’t the kind of system Bostrom, Yudkowsky, and myself were theorizing about and warning about.
  
  (3) Here’s my positive proposal for what I think is happening. There was an old vision of how we’d get to AGI, in which we’d get agency first and then general world-knowledge second. E.g. suppose we got AGI by training a model through a series of more challenging video games and simulated worlds and then finally letting them out into the real world. If that’s how it went, then plausibly the first time it started to actually seem to be nice to us, was because it was already plotting against us, playing along to gain power, etc. We clearly aren’t in that world, thanks to LLMs. General world-knowledge is coming first, and agency later. And this is probably a good thing for technical alignment research, because e.g. it allows mechinterp to get more of a head start, it allows for nifty scalable oversight schemes in which dumber AIs police smarter AIs, it allows for faithful CoT-based strategies, and many more things besides probably. So the world isn’t as grim as it could have been, from a technical alignment perspective. However, I don’t think me or Yudkowsky or Bostrom or whatever strongly predicted that agency would come first. I do think that LLMs should be an update towards hopefulness about the technical alignment problem being solved in time for the reasons mentioned, but also they are an update towards shorter timelines, for example, and an update towards more profits and greater vested interests racing to build AGI, and many other updates besides, so I don’t think you can say “Yudkowsky’s still super doomy despite this piece of good news, he must be epistemically vicious.” At any rate speaking for myself, I have updated towards hopefulness about the technical alignment problem repeatedly over the past few years, even as I updated towards pessimism about the amount of coordination and safety-research-investment that’ll happen before the end (largely due to my timelines shortening, but also due to observing OpenAI). These updates have left me at p(doom) still north of 50%.
  What links here?
  - (Non-deceptive) Suboptimality Alignment by Sodium (18 Oct 2023 2:07 UTC; 5 points)
  - Matthew Barnett 17 Jun 2024 20:59 UTC
    LW: 8 AF: 5
    1
    AF Parent
    Yes, rereading the passage, Bostrom’s central example of a reason why we could see this “when dumb, smarter is safer; yet when smart, smarter is more dangerous” pattern (that’s a direct quote btw) is that they could be scheming/pretending when dumb. However [...] Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausibly happening with GPT4 etc.
    When stated that way, I think what you’re saying is a reasonable point of view, and it’s not one I would normally object to very strongly. I agree it’s “plausible” that GPT-4 is behaving in the way you are describing, and that current safety guarantees might break down at higher levels of intelligence. I would like to distinguish between two points that you (and others) might have interpreted me to be making:
    We should now think that AI alignment is completely solved, even in the limit of unlimited intelligence and future agentic systems. I am not claiming this.
    We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this
    The fact that Bostrom’s central example of a reason to think that “when dumb, smarter is safer; yet when smart, smarter is more dangerous” doesn’t fit for LLMs, seems adequate for demonstrating (2), even if we can’t go as far as demonstrating (1).
    It remains plausible to me that alignment will become very difficult above a certain intelligence level. I cannot rule that possibility out: I am only saying that we should reasonably update based on the current evidence regardless, not that we are clearly safe from here and we should scale all the way to radical superintellligence without a worry in the world.
    Instruction-tuned LLMs are not powerful general agents. They are pretty general but they are only a tiny bit agentic. They haven’t been trained to pursue long-term goals and when we try to get them to do so they are very bad at it. So they just aren’t the kind of system Bostrom, Yudkowsky, and myself were theorizing about and warning about.
    I have two general points to make here:
    I agree that current frontier models are only a “tiny bit agentic”. I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we’ve seen enough to know that corrigibility probably won’t be that hard to train into a system that’s only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different?
    There’s a bit of a trivial definitional problem here. If it’s easy to create a corrigible, helpful, and useful AI that allows itself to get shut down, one can always say “those aren’t the type of AIs we were worried about”. But, ultimately, if the corrigible AIs that let you shut them down are competitive with the agentic consequentialist AIs, then it’s not clear why we should care? Just create the corrigible AIs. We don’t need to create the things that you were worried about!
    Here’s my positive proposal for what I think is happening. [...] General world-knowledge is coming first, and agency later. And this is probably a good thing for technical alignment research, because e.g. it allows mechinterp to get more of a head start, it allows for nifty scalable oversight schemes in which dumber AIs police smarter AIs, it allows for faithful CoT-based strategies, and many more things besides probably. So the world isn’t as grim as it could have been, from a technical alignment perspective.
    I think this was a helpful thing to say. To be clear: I am in ~full agreement with the reasons you gave here, regarding why current LLM behavior provides evidence that the “world isn’t as grim as it could have been”. For brevity, and in part due to laziness, I omitted these more concrete mechanisms why I think the current evidence is good news from a technical alignment perspective. But ultimately I agree with the mechanisms you offered, and I’m glad you spelled it out more clearly.
    At any rate speaking for myself, I have updated towards hopefulness about the technical alignment problem repeatedly over the past few years, even as I updated towards pessimism about the amount of coordination and safety-research-investment that’ll happen before the end (largely due to my timelines shortening, but also due to observing OpenAI). These updates have left me at p(doom) still north of 50%.
    As we have discussed in person, I remain substantially more optimistic about our ability to coordinate in the face of an intelligence explosion (even a potentially quite localized one). That said, I think it would be best to save that discussion for another time.
    - Daniel Kokotajlo 17 Jun 2024 22:14 UTC
      LW: 9 AF: 5
      4
      AF Parent
      Thanks for this detailed reply!
      We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this
      Depending on what you mean by “on their way towards being solved” I’d agree. The way I’d put it is: “We didn’t know what the path to AGI would look like; in particular we didn’t know whether we’d have agency first and then world-understanding, or world-understanding first and then agency. Now we know we are getting the latter, and while that’s good in some ways and bad in other ways, it’s probably overall good. Huzzah! However, our core problems remain, and we don’t have much time left to solve them.”
      (Also, fwiw, I have myself updated over the course of the last five years or so. First update was reading Paul’s stuff and related literatures convincing me that corrigibility-based stuff would probably work. Second update was all the recent faithful CoT and control and mechinterp progress etc., plus also the LLM stuff. The LLM stuff was less than 50% of the overall update for me, but it mattered.)
      I agree that current frontier models are only a “tiny bit agentic”. I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we’ve seen enough to know that corrigibility probably won’t be that hard to train into a system that’s only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different?
      Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I’m interested in seeing if we can make some bets on this though; if we can, great; if we can’t, then at least we can avoid future disagreements about who should update.
      There’s a bit of a trivial definitional problem here. If it’s easy to create a corrigible, helpful, and useful AI that allows itself to get shut down, one can always say “those aren’t the type of AIs we were worried about”. But, ultimately, if the corrigible AIs that let you shut them down are competitive with the agentic consequentialist AIs, then it’s not clear why we should care? Just create the corrigible AIs. We don’t need to create the things that you were worried about!
      I don’t think that we know how to “just create the corrigible AIs.” The next step on the path to AGI seems to be to make our AIs much more agentic; I am concerned that our current methods of instilling corrigibility (basically: prompting and imperfect training) won’t work on much more agentic AIs. To be clear I think they might work, there’s a lot of uncertainty, but I think they probably won’t. I think it might be easier to see why I think this if you try to prove the opposite in detail—like, write a mini-scenario in which we have something like AutoGPT but much better, and it’s being continually trained to accomplish diverse long-horizon tasks involving pursuing goals in challenging environments, and write down what the corrigibility-related parts of its prompt and/or constitution or whatever are, and write down what the training signal is roughly including the bit about RLHF or whatever, and then imagine that said system is mildly superhuman across the board (and vastly superhuman in some domains) and is being asked to design it’s own successor. (I’m trying to do this myself as we speak. Again I feel like it could work out OK, but it could be disastrous. I think writing some good and bad scenarios will help me decide where to put my probability mass.)
      I think this was a helpful thing to say. To be clear: I am in ~full agreement with the reasons you gave here, regarding why current LLM behavior provides evidence that the “world isn’t as grim as it could have been”. For brevity, and in part due to laziness, I omitted these more concrete mechanisms why I think the current evidence is good news from a technical alignment perspective. But ultimately I agree with the mechanisms you offered, and I’m glad you spelled it out more clearly.
      Yay, thanks!
      - Matthew Barnett 19 Jun 2024 20:09 UTC
        LW: 4 AF: 3
        0
        AF Parent
        Just a quick reply to this:
        Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I’m interested in seeing if we can make some bets on this though; if we can, great; if we can’t, then at least we can avoid future disagreements about who should update.
        I’ll note that my prediction was for the next “few years” and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that point.
        With timelines that short, I think betting is overrated. From my perspective, I’d prefer to simply wait and become vindicated as the world does not end in the meantime. However, I acknowledge that simply waiting is not very satisfying from your perspective, as you want to show the world that you’re right before the catastrophe. If you have any suggestions for what we can bet on that would resolve in such a short period of time, I’m happy to hear them.
        Daniel Kokotajlo 19 Jun 2024 21:35 UTC
        LW: 8 AF: 3
        0
        AF Parent
        It’s not about timelines, it’s about capabilities. My tentative prediction is that the sole remaining major bottleneck/gap between current systems and dangerous powerful agent AGIs is ‘agency skills.’ So, skills relevant to being an agent, i.e. ability to autonomously work towards ambitious goals in diverse challenging environments over long periods. I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic. Thus, whether AGI is reached next year or in 2030, we’ll face the problem of corrigibility breakdowns only really happening right around the time when it’s too late or almost too late.
        Matthew Barnett 19 Jun 2024 21:54 UTC
        LW: 10 AF: 6
        0
        AF Parent
        I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic.
        How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are “getting really agentic” and therefore dangerous? I’m imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It’s possible that your model looks like:
        In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
        In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity
        Whereas my model looks more like,
        In years 1-4 systems will get gradually more agentic
        There isn’t a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
        They will remain ~corrigible throughout the entire development, even after it’s clear they’ve surpassed human-level agency (which, to be clear, might take longer than 4 years)
        Daniel Kokotajlo 19 Jun 2024 23:33 UTC
        LW: 9 AF: 3
        5
        AF Parent
        Good question. I want to think about this more, I don’t have a ready answer. I have a lot of uncertainty about how long it’ll take to get to human-level agency skills; it could be this year, it could be five more years, it could be anything in between. Could even be longer than five more years though I’m skeptical. The longer it takes, the more likely it is that we’ll have a significant period of kinda-agentic-but-not-super-agentic systems, and so then that raises the question of what we should expect to see re: corrigibility in that case. Idk. Would be interesting to discuss sometime and maybe place some bets!
    - Vladimir_Nesov 17 Jun 2024 21:32 UTC
      LW: 6 AF: 3
      −1
      AF Parent
      I’d say the considerations for scheming exist platonically, and dumber AIs only get to concretely instantiate the currently appropriate conclusion of compliance, everything else crumbles as not directly actionable. But smarter AIs might succeed in channeling those considerations in the real world. The hypothesis expects that such AIs are not here yet, given the lack of modern AIs’ ability to coherently reason about complicated or long term plans, or to carry them out. So properties of AIs that are already here don’t work as evidence about this either way.
- Lukas_Gloor 17 Jun 2024 10:21 UTC
  6 points
  4
  Parent
  This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don’t)
  Or that they have a sycophancy drive. Or that, next to “wanting to be helpful,” they also have a bunch of other drives that will likely win over the “wanting to be helpful” part once the system becomes better at long-term planning and orienting its shards towards consequentialist goals.
  On that latter model, the “wanting to be helpful” is a mask that the system is trained to play better and better, but it isn’t the only thing the system wants to do, and it might find that once its gets good at trying on various other masks to see how this will improve its long-term planning, it for some reason prefers a different “mask” to become its locked-in personality.
- Amalthea 17 Jun 2024 7:01 UTC
  3 points
  0
  Parent
  Note that LLMs, while general, are still very weak in many important senses.
  
  Also, it’s not necessary to assume that LLM’s are lying in wait to turn treacherous. Another possibility is that trained LLMs are lacking the mental slack to even seriously entertain the possibility of bad behavior, but that this may well change with more capable AIs.
  - Matthew Barnett 17 Jun 2024 7:06 UTC
    14 points
    6
    Parent
    I am not claiming that the alignment situation is very clear at this point. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable.
    
    I’m just asking people to acknowledge the evidence in front of their eyes, which (from my perspective) clearly contradicts the picture you’d get from a ton of AI alignment writing from before ~2019. This literature talked extensively about the difficulty of specifying goals in general AI in a way that avoided unintended side effects.
    
    To the extent that LLMs are general AIs that can execute our intended instructions, as we want them to, rather than as part of a deceptive strategy to take over the world, this seems like clear evidence that the problem of building safe general AIs might be easy (and indeed easier than we thought).
    
    Yes, this evidence is not conclusive. It is not zero either.
    - Daniel Kokotajlo 17 Jun 2024 14:21 UTC
      28 points
      6
      Parent
      I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make. See my reply elsewhere in thread for a positive account of how LLMs are good news for alignment and how we should update based on them. In some sense I agree with you, basically, that LLMs are good news for alignment for reasons similar to the reasons you give—I just don’t think you are right to allege that this development strongly contradicts something people previously said, or that people have been slow to update.
      - Matthew Barnett 17 Jun 2024 14:40 UTC
        10 points
        −11
        Parent
        
        I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make.
        
        We don’t need to talk about predictions. We can instead talk about whether their proposed problems are on their way towards being solved. For example, we can ask whether the shutdown problem for systems with big picture awareness is being solved, and I think the answer is pretty clearly “Yes”.
        
        (Note that you can trivially claim the problem here isn’t being solved because we haven’t solved the unbounded form of the problem for consequentialist agents, who (perhaps by definition) avoid shutdown by default. But that seems like a red herring: we can just build corrigible agents, rather than consequentialist agents.)
        
        Moreover, I think people generally did not make predictions at all when writing about AI alignment, perhaps because that’s not very common when theorizing about these matters. I’m frustrated about that, because I think if they did make predictions, they would likely have been wrong in roughly the direction I’m pointing at here. That said, I don’t think people should get credit for failing to make any predictions, and as a consequence, failing to get proven wrong.
        
        To the extent their predictions were proven correct, we should give them credit. But to the extent they made no predictions, it’s hard to see why that vindicates them. And regardless of any predictions they may or may not have made, it’s still useful to point out that we seem to be making progress on several problems that people pointed out at the time.
        Daniel Kokotajlo 17 Jun 2024 20:37 UTC
        6 points
        2
        Parent
        Great, let’s talk about whether proposed problems are on their way towards being solved. I much prefer that framing and I would not have objected so strongly if that’s what you had said. E.g. suppose you had said “Hey, why don’t we just prompt AutoGPT-5 with lots of corrigibility instructions?” then we could have a more technical conversation about whether or not that’ll work, and the answer is probably no, BUT I do agree that this is looking promising relative to e.g. the alternative world where we train powerful alien agents in various video games and simulations and then try to teach them English. (I say more about this elsewhere in this conversation, for those just tuning in!)
        ryan_greenblatt 17 Jun 2024 14:55 UTC
        6 points
        2
        Parent
        I don’t think current system systems are well described as having “big picture awareness”. From my experiments with Claude, it makes cartoonish errors reasoning about various AI related situations and can’t do such reasoning except aloud.
        
        I’m not certain this was your claim, but it seems to have been.
        Bogdan Ionut Cirstea 17 Jun 2024 15:28 UTC
        3 points
        2
        Parent
        it makes cartoonish errors reasoning about various AI related situations and can’t do such reasoning except aloud
        Wouldn’t reasoning aloud be enough though, if it was good enough? Also, I expect reasoning aloud first to be the modal scenario, given theoretical results on Chain of Thought and the like.
        Matthew Barnett 17 Jun 2024 15:03 UTC
        3 points
        −2
        Parent
        My claim was not that current LLMs have a high level of big picture awareness.
        
        Instead, I claim current systems have limited situational awareness, which is not yet human-level, but is definitely above zero. I further claim that solving the shutdown problem for AIs with limited (non-zero) situational awareness gives you evidence about how hard it will be to solve the problem for AIs with more situational awareness.
        
        And I’d predict that, if we design a proper situational awareness benchmark, and (say) GPT-5 or GPT-6 passes with flying colors, it will likely be easy to shut down the system, or delete all its copies, with no resistance-by-default from the system.
        
        And if you think that wouldn’t count as an adequate solution to the problem, then it’s not clear the problem was coherent as written in the first place.
      - Seth Herd 17 Jun 2024 17:52 UTC
        4 points
        −2
        Parent
        There were an awful lot of early writings. Some of them did say that the difficulties with getting AGI to understand values is a big part of the alignment problem. The List of Lethalities does make that claim. The difficulty of getting the AGI to care even if it does understand has also been a big part of the public-facing debate. I look at some of the historical arguments in The (partial) fallacy of dumb superintelligence, written partly in response to Matthew’s post on this topic.
        Obsessing about what happened in the past is probably a mistake. It’s probably better to ask: can the strengths of LLMs (WRT understanding values and following directions) be leveraged into working AGI alignment?
        My answer is yes, and in a way that’s not-too-far from default AGI development trends, making it practically achievable even in a messy and self-interested world.
        Naturally that answer is a bit complex, so it’s spread across a few posts. I should organize the set better and write an overview, but in brief we can probably build and align language model agent AGI, using a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility by having a central goal of following instructions. This still has a huge problem of creating a multipolar scenario with multiple humans in charge of ASIs, but those problems might be navigated, too.
        RobertM 18 Jun 2024 0:05 UTC
        6 points
        0
        Parent
        The List of Lethalities does make that claim.
        I don’t think this is true and can’t find anything in the post to that effect. Indeed, the post says things that would be quite incompatible with that claim, such as point 21.
        Seth Herd 18 Jun 2024 0:41 UTC
        4 points
        0
        Parent
        In sum, I see that claim as I remembered it, but it’s probably not applicable to this particular discussion, since it addresses an entirely distinct route to AGI alignment. So I stand corrected, but in a subtle way that bears explication.
        So I apologize for wasting your time. Debating who said what when is probably not the best use of our limited time to work on alignment. But because I made the claim, I went back and thought about and wrote about it some more, again.
        I was thinking of point 21.1:
        The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI. Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try. It’s not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.
        BUT, point 24 in whole is saying that there are two approaches, 1) above, and a quite separate route 2), build a corrigible AI that doesn’t fully understand our values. That is probably the route that Matthew is thinking of in claiming that LLMs are good news. Yudkowsky is explicit that the difficulty of getting AGI to understand values doesn’t apply to that route, so that difficulty isn’t relevant here. That’s an important but subtle distinction.
        Therefore, I’m far from the only one getting confused about that issue, as Yudkowsky states in that section 24. Disentangling those claims and how they’re changed by slow takeoff is the topic of my post cited above.
        I personally think that sovereign AGI that gets our values right is out of reach exactly as Yudkowsky describes in the quotation above. But his arguments against corrigible AGI are much weaker, and I think that route is very much achievable, since it demands that the AGI have only approximate understanding of intent, rather than precise and stable understanding of our values. The above post and my recent one on instruction-following AGI make those arguments in detail. Max Harms’ recent series on corrigible AGI makes a similar point in a different way. He argues that Yudkowsky’s objections to corrigibility as unnatural do not apply if that’s the only or most important goal; and that it’s simple and coherent enough to be teachable.
        That’s me switching back to the object level issues, and again, apologies for wasting your time making poorly-remembered claims about subtle historical statements.
        Vladimir_Nesov 18 Jun 2024 3:43 UTC
        2 points
        0
        Parent
        
        in a subtle way that bears explication
        
        There’s AGI that’s our first try, which should only use least dangerous cognition necessary for preventing immediately following AGIs from destroying the world six months later. There’s misaligned superintelligence that knows, but doesn’t care. Taken together, these points suggest that getting AGI to understand values is not an urgent part of the alignment problem in the sense of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires. Getting AGI to understand corrigibility for example might be more relevant, if we are running with the highly dangerous kinds of cognition implied by general intelligence of LLMs.
        Seth Herd 18 Jun 2024 22:35 UTC
        4 points
        2
        Parent
        I agree with all of that. My post I mentioned, The (partial) fallacy of dumb superintelligence deals with the genie that knows but doesn’t care, and how we get one that cares in a slow takeoff. My other post Instruction-following AGI is easier and more likely than value aligned AGI makes this same argument—nobody is going to bother getting the AGI to understand human values, since it’s harder and unnecessary for the first AGIs. Max Harms makes a similar argument, (and in many ways makes it better), with a slightly different proposed path to corrigibility.
        As you say, these things have been understood for a long time. I’m a bit disturbed that more serious alignment people don’t talk about them more. The difficulty of value alignment makes it likely irrelevant for the current discussion, since we very likely are going to rush ahead into, as you put it and I agree,
        the highly dangerous kinds of cognition implied by general intelligence of LLMs.
        The perfect is the enemy of the good. We should mostly quit worrying about the very difficult problem of full value alignment, and start thinking more about how to get good results with much more achievable corrigible or instruction-following AGI.
    - Seth Herd 17 Jun 2024 17:37 UTC
      5 points
      0
      Parent
      Here we go!
      
      I think if you led with this statement, you’d have a lot less unproductive argumentation. It sounds on a vibe level like you’re saying alignment is probably easy in your first statement. If you’re just saying it’s less hard than originally predicted, that sounds a lot more reasonable.
      
      Rationalists have emotions and intuitions, even if we’d rather not. Framing the discussion in terms of its emotional impact matters.
      - Matthew Barnett 17 Jun 2024 20:39 UTC
        5 points
        0
        Parent
        That’s reasonable. I’ll edit the top comment to make this exact clarification.
- Jonas Hallgren 17 Jun 2024 15:03 UTC
  1 point
  0
  Parent
  Often, disagreements boil down to a set of open questions to answer; here’s my best guess at how to decompose your disagreements.
  
  I think that depending on what hypothesis you’re abiding by when it comes to how LLMs will generalise to AGI, you get different answers:
  
  Hypothesis 1: LLMs are enough evidence that AIs will generally be able to follow what humans care about and that they naturally don’t become power-seeking.
  
  Hypothesis 2: AGI will have a sufficiently different architecture than LLMs or will change a lot, so much that current-day LLMs don’t generally give evidence about AGI.
  
  Depending on your beliefs about these two hypotheses, you will have different opinions on this question.
  
  The scenario outlined by Bostrom seems clearly different from the scenario with LLMs, which are actual general systems that do what we want and ~nothing more, rather than doing what we want as part of a strategy to seek power instrumentally. What am I missing here?
  Let’s say that we believe in hypothesis 1 as the base case; what are some reasons why LLMs wouldn’t give evidence about AGI?
  
  1. Intelligence forces reflective coherence.
  This would essentially entail that the more powerful a system we get, the more it will notice internal inconsistencies and change towards maximising (and therefore not following human values).
  
  2. Agentic AI acting in the real world is different from LLMs.
  If we look at an LLM from the perspective of an action-perception loop, it doesn’t generally get any feedback on when it changes the world. Instead, it is an autoencoder, predicting what the world will look like. This may be so that power-seeking only arises in systems that are able to see the consequences of their own actions and how that affects the world.
  
  3. LLMs optimise for good-harted RLHF that seems well but lacks fundamental understanding. Since human value is fragile, it will be difficult to hit the sweet spot when we get to real-world cases and take that into the complexity of the future.
  Personal belief:
  These are all open questions, in my opinion, but I do see how LLMs give evidence about some of these parts. I, for example, believe that language is a very compressed information channel for alignment information, and I don’t really believe that human values are as fragile as we think.
  I’m more scared of 1 and 2 than I’m of 3, but I would still love for us to have ten more years to figure this out as it seems very non-obvious as to what the answers here are.