Nora Belrose comments on Thoughts on “AI is easy to control” by Pope & Belrose

Nora Belrose 2 Dec 2023 18:24 UTC
LW: 49 AF: 18
−55
AF
(Didn’t consult Quintin on this; I speak for myself)
I flatly deny that our arguments depend on AGI being anything like an LLM. I think the arguments go through in a very wide range of scenarios, basically as long as we’re using some kind of white-box optimization to align them, rather than e.g. carrot-and-stick incentives or prompt engineering. Even if we only relied on prompt engineering, I think we’d be in a better spot than with humans (because we can run many controlled experiments).
A human can harbor a secret desire for years, never acting on it, and their brain won’t necessarily overwrite that desire, even as they think millions of thoughts in the meantime. So evidently, the argument above is inapplicable to human brains.
I’m pretty confused by this claim. Why should we expect the human reward system to overwrite all secret desires? Also how do we know it’s not doing that? Your desires are just causal effects of a bunch of stuff including your reward circuitry.
As a human, I can be sitting in bed, staring into space, and I can think a specific abstruse thought about string theory, and now I’ve figured out something important. If a future AI can do that kind of thing, as I expect, then it’s not so clear that “controlling the AI’s sensory environment” is really all that much control.
1. This is just generally a pretty weak argument. You don’t seem to be contesting the fact that we have full sensory control for AI and we don’t have full sensory control for humans. It’s just a claim that this doesn’t matter. Maybe this ends up being a brute clash of intuitions, but it seems obvious to me that full sensory control matters a lot, even if the AI is doing a lot of long running cognition without supervision.
2. With AI we can choose to cut its reasoning short whenever we want, force it to explain itself in human language, roll it back to a previous state, etc. We just have a lot more control over this ongoing reasoning process for AIs and it’s baffling to me that you seem to think this mostly doesn’t matter.
That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before
You can just include online learning in your experimentation loop. See what happens when you let the AI online learn for a bit in different environments. I don’t think online learning changes the equation very much. It’s known to be less stable than offline RL, but that instability hurts capabilities as well as alignment, so we’d need a specific argument that it will hurt alignment significantly more than capabilities, in ways that we wouldn’t be able to notice during training and evaluation.

I have no idea how I’m supposed to interpret this sentence [“we are the innate reward system”] for brain-like AGI, such that it makes any sense at all. Actually, I’m not quite sure what it means even for LLMs!
It just means we are directly updating the AI’s neural circuitry with white box optimizers. This will be true across a very wide range of scenarios, including (IIUC) your brain-like AGI scenario.
Brains can imitate, but do so in a fundamentally different way from LLM pretraining
I don’t see why any of the differences you listed are relevant for safety.
Relatedly, brains have a distinction between expectations and desires, cleanly baked into the algorithms. I think this is obvious common sense, leaving aside galaxy-brain Free-Energy-Principle takes which try to deny it.
I basically deny this, especially if you’re stipulating that it’s a “clean” distinction. Obviously folk psychology has a fuzzy distinction between beliefs and desires in it, but it’s also well-known both in common sense and among neuroscientists etc. that beliefs and desires get mixed up all the time and there’s not a particularly sharp divide.
What links here?
- Thoughts on “AI is easy to control” by Pope & Belrose by Steven Byrnes (1 Dec 2023 17:30 UTC; 195 points)
- Charbel-Raphaël 1 Jan 2024 21:09 UTC
  2 points
  0
  Parent
  What about Section 1?
  - Nora Belrose 2 Jan 2024 1:50 UTC
    1 point
    −2
    Parent
    Our 1% doom number excludes misuse-flavored failure modes, so I considered it out of scope for my response. I think the fact that good humans have been able to keep rogue bad humans more-or-less under control for millennia is strong evidence that good AIs will be able to keep rogue AIs under control, and I think the evidence is pretty mixed on whether the so-called offense-defense balance will be skewed toward offense or defense— I weakly expect defense will be preferred, mainly through centralization-of-power effects.
    - Charbel-Raphaël 2 Jan 2024 10:46 UTC
      6 points
      1
      Parent
      It’s not strong evidence; it’s a big mess, and it seems really difficult to have any kind of confidence in such a fast-changing world. It feels to me that it’s going to be a roughly ⁵⁰⁄₅₀ bet. Saying the probability is 1% requires much more work that I’m not seeing, even if I appreciate what you are putting up.
      On the offense-defense balance, there is no clear winner in the comment sections here, neither here. We’ve already seen a takeover between two different roughly equal human civilizations (see the story of the conquistadors) under certain circumstances. And AGI is at least more dangerous than nuclear weapons, and we came pretty close to nuclear war several times. Covid seems to come from gain of function research, etc...
      On fast vs slow takeoff, it seems to me that fast takeoff breaks a lot of your assumptions, and I would assign much more than a 1% probability for fast takeoff. Even when you still embrace the compute-centric framework (which I find conservative), you still get wild numbers, like a two-digits probability of takeoff lasting less than a year. If so, we won’t have the time to implement defense strategies.
      - Nora Belrose 2 Jan 2024 12:41 UTC
        1 point
        −6
        Parent
        I don’t think it makes sense to “revert to a uniform prior” over {doom, not doom} here. Uniform priors are pretty stupid in general, because they’re dependent on how you split up the possibility space. So I prefer to stick fairly close to the probabilities I get from induction over human history, which tell me p(doom from unilateral action) << 50%
        I strongly disagree that AGI is “more dangerous” than nukes; I think this equivocates over different meanings of the term “dangerous,” and in general is a pretty unhelpful comparison.
        I find foom pretty ludicrous, and I don’t see a reason to privilege the hypothesis much.
        From the linked report:
        My best guess is that we go from AGI (AI that can perform ~100% of cognitive tasks as well as a human professional) to superintelligence (AI that very significantly surpasses humans at ~100% of cognitive tasks) in less than a year.
        I just agree with this (if “significantly” means like 5x or something), but I wouldn’t call it “foom” in the relevant sense. It just seems orthogonal to the whole foom discussion.
        Charbel-Raphaël 2 Jan 2024 15:36 UTC
        3 points
        1
        Parent
        I don’t think it makes sense to “revert to a uniform prior” over {doom, not doom} here.
        I’m not using a uniform prior; the ⁵⁰⁄₅₀ thing is just me expressing my views, all things considered.
        I’m using a decomposition of the type:
        Does it want to harm us? Yes, because of misuses, ChaosGPT, wars, psychopaths firing in schools, etc...
        Can it harm us? This is really hard to tell.
        I strongly disagree that AGI is “more dangerous” than nukes; I think this equivocates over different meanings of the term “dangerous,” and in general is a pretty unhelpful comparison.
        Okay. Let’s be more precise: “An AGI that has the power to launch nukes is at least more powerful than nukes.” Okay, and now, how would AGI acquire this power? That doesn’t seem that hard in the present world. You can bribe/threaten leaders, use drones to kill a leader during a public visit, and then help someone to gain power and become your puppet during the period of confusion à la conquistadors. The game of thrones is complex and brittle; this list of coups is rather long, and the default for a civilization/family reigning in some kingdom is to be overthrown.
        I prefer to stick fairly close to the probabilities I get from induction over human history, which tell me p(doom from unilateral action) << 50%
        I don’t like the word “doom”. I prefer to use the expression ‘irreversibly messed up future’, inspired by Christiano’s framing (and because of anthropic arguments, it’s meaningless to look at past doom events to compute this proba).
        I’m really not sure what should be the reference class here. Yes, you are still living and the human civilization is still here but:
        Napoleon and Hitler are examples of unilateral actions that led to international wars.
        If you go from unilateral action to multilateral actions, and you allow stuff like collusion, things become easier. And collusion is not that wild, we already see this in Cicero: the AI playing as France, conspired with Germany to trick England.
        As the saying goes: “AI is a wonderful tool for the betterment of humanity; AGI is a potential successor species.” So maybe the reference class is more something like chimps, neanderthals or horses. Another reference class could be something like Slave rebellion.
        I find foom pretty ludicrous, and I don’t see a reason to privilege the hypothesis much.
        We don’t need the strict MIRI-like RSI foom to get in trouble. I’m saying if AI technology does not have the time to percolate in the economy, we won’t have the time to upgrade our infrastructure and add much more defense than what we have today, which seems to be the default.
        Nora Belrose 2 Jan 2024 23:22 UTC
        1 point
        0
        Parent
        
        because of anthropic arguments, it’s meaningless to look at past doom events to compute this proba
        
        I disagree; anthropics is pretty normal (https://www.lesswrong.com/posts/uAqs5Q3aGEen3nKeX/anthropics-is-pretty-normal)
    - Signer 10 Jan 2024 12:57 UTC
      4 points
      2
      Parent
      
      I think the fact that good humans have been able to keep rogue bad humans more-or-less under control for millennia is strong evidence that good AIs will be able to keep rogue AIs under control
      
      Why? Like, what law of nature says that the trend in this terms should continue?
      - Nora Belrose 10 Jan 2024 23:07 UTC
        −21 points
        −20
        Parent
        Game theory
        Signer 11 Jan 2024 5:22 UTC
        1 point
        0
        Parent
        Yes, but available strategies can change for AI vs humans—why assume they will be the same?
        
        Induction from history depends on it’s interpretation—we have more information than 1111111111 over {bad, not-so-bad}. It just feels like at present point the crux between optimists and doomers is not about whether white box access or trained mind-space is better, about how much it all updates you from what prior.