Akash comments on The case for ensuring that powerful AIs are controlled

Akash 25 Jan 2024 12:19 UTC
LW: 15 AF: 3
8
AF
I’m excited to see how the AI control research direction evolves.
After skimming the post, it seems like it makes a few different claims, some of which seem more intuitive than others. For example:
- Claim 1: Labs should control powerful AI systems (very intuitive, even obvious)
- Claim 2: It is (much?) easier to evaluate control than alignment (somewhat intuitive, particularly if one buys the argument that alignment evals are very hard because you always have the “what if the AI is just scheming and subverting the eval” concern.)
- Claim 3: We (Buck and Ryan) have identified some specific techniques that we believe are promising for control (not intuitive; technical arguments needed).
- Claim 4: If we succeed in controlling “transformatively useful AI”, then we may be able to stop the race toward superintelligence and get AI labs or governments to agree to not build superintelligence until it can be controlled (not intuitive; non-technical arguments needed).
I’d be excited to see more posts that specifically engage with the strongest counterpoints to claims #2-4.
Some more on #2 & #4:
The basic problem with evaluating alignment is that no matter what behaviors you observe, you have to worry that your model is just acting that way in order to make you think that it is aligned.
I think those pessimistic about control evals could say something like “the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team.” Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won’t scale toward arbitrarily powerful AIs.
In some ways, this feels analogous to the following situation:
- Alice from the Scaling Department: Look, our current “alignment” techniques are working for GPT-4. So let’s scale up more! I get that at some point we won’t be able to align the AI, but we’re not there yet, so we should be able to keep going.
- Bob from the Preparedness Team: Well, the problem is we don’t know when our alignment techniques will stop working… like, it’s quite plausible that the next AI system will be able to game our evals.
- Alice from the Scaling Department: OK, well let me know when you have clear evidence that our alignment techniques or dangerous capability evals will fail.
Here’s the analogy for control:
- Alice from the Scaling Department: Look, our current control techniques are working for GPT-4. So let’s scale up more! I get that at some point our control evals aren’t going to be able to control arbitrarily powerful AI, but we’re not there yet, so we should be able to keep going.
- Bob from the Preparedness Team: Well, the problem is we don’t know when our control evals will stop working… like, it’s quite plausible that the next AI system will be better at coming up with attacks than our red-team.
- Alice from the Scaling Department: OK, well let me know when you have clear evidence that our control evals will fail.
I’d be curious to hear more about how you’re thinking about this (and apologies if some sections of the post already deal with this– feel free to quote them if I missed them in my initial skim). Specific questions:
1. How do you react to the analogy above? Do you expect that control evals will do a better job than other types of evals at determining when we have reached the point where scaling further is (directly/imminently) dangerous? Or more simply, will it be clear when we need to stop?
2. More broadly, what do you think we should do once we’re starting to see AI systems that are powerful enough to subvert the control evals (or develop attacks that are better than those of the human red teamers)?
3. Suppose control evals do allow us to stop at “transformatively useful systems that do not cause a catastrophe.” Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like “either the system is too weak to get us out of the critical risk window or the system is so strong that it’s capable of subverting your control evals”).
- ryan_greenblatt 27 Jan 2024 5:00 UTC
  LW: 15 AF: 7
  9
  AF Parent
  Suppose control evals do allow us to stop at “transformatively useful systems that do not cause a catastrophe.” Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like “either the system is too weak to get us out of the critical risk window or the system is so strong that it’s capable of subverting your control evals”).
  
  We’re thinking about systems which are capable of autonomously doing nearly all key research tasks. In particular, speeding up R&D by 30x probaly requires heavily autonomous AIs running experiments, thinking of ideas, writing explanations, and thinking about next steps.
  
  I think ML research in particular can plausibly be accelerated by maybe 30x by only making it extremely fast and cheap to go from high level ideas to implemented experiments (rather than needing to generate these high level ideas)^[1], but this will still require the AIs to really understand the ideas and think about various tradeoffs, etc.
  
  So, the overall composed system will be as broadly capable as a quite good human researcher, though in some cases, we might be heavily using specialization or reasoning patterns which aren’t like typical human approaches. (E.g. using large amounts of search on relatively weak models to come up with ideas.)
  
  We expect this system to be built out of AIs which run considerably faster than humans in effective serial cognitive speed (perhaps 5-100x serially faster than humans depending on the exact AI and the task) and which will also be run massively in parallel. (See here for some more discussion.)
  ↩︎
  In some sense this is just an observation that implementing ML experiments correctly is hard. : )
  - Rohin Shah 27 Jan 2024 9:35 UTC
    LW: 8 AF: 6
    0
    AF Parent
    I think ML research in particular can plausibly be accelerated by maybe 30x by only making it extremely fast and cheap to go from high level ideas to implemented experiments (rather than needing to generate these high level ideas)
    Why doesn’t compute become the bottleneck well before the 30x mark? It seems like the AIs have to be superhuman at something to overcome that bottleneck (rather than just making it fast and cheap to implement experiments). Indeed the AIs make the problem somewhat worse, since you have to spend compute to run the AIs.
    - ryan_greenblatt 27 Jan 2024 17:18 UTC
      LW: 8 AF: 7
      0
      AF Parent
      I guess I’m not that sold that compute will actually be that much of a key bottleneck for experiments in the future in a way that can’t be overcome with 2x additional labor and/or a 2x slow down.
      
      Like in many cases you can spend additional labor to reduce compute usage of experiments. (E.g., first run the experiments on smaller models.) And, we’re conditioning on having really powerful AIs which correlates with a high baseline level of compute and that will help whenever we can run experiments at small scales and then use scaling laws etc. Further, the current scaling laws imply huge inference availablity if huge amounts of compute are used for training.
      
      This might depend on what type of ML research we’re talking about.
      - Rohin Shah 28 Jan 2024 16:36 UTC
        LW: 12 AF: 8
        2
        AF Parent
        I agree it helps to run experiments at small scales first, but I’d be pretty surprised if that helped to the point of enabling a 30x speedup—that means that the AI labor allows you get 30x improvement in compute needed beyond what would be done by default by humans (though the 30x can include e.g. improving utilization, it’s not limited just to making individual experiments take less time).
        I think the most plausible case for your position would be that the compute costs for ML research scale much less than quadratically with the size of the pretrained model, e.g. maybe (1) finetuning starts taking fewer data points as model size increases (sample efficiency improves with model capability), and so finetuning runs become a rounding error on compute, and (2) the vast majority of ML research progress involves nothing more expensive than finetuning runs. (Though in this world you have to wonder why we keep training bigger models instead of just investing solely in better finetuning the current biggest model.)
        Another thing that occurred to me is that latency starts looking like another major bottleneck. Currently it seems feasible to make a paper’s worth of progress in ~6 months. With a 30x speedup, you now have to do that in 6 days. At that scale, introducing additional latency via experiments at small scales is a huge cost.
        (I’m assuming here that the ideas and overall workflow are still managed by human researchers, since your hypothetical said that the AIs are just going from high level ideas to implemented experiments. If you have fully automated AI researchers then they don’t need to optimize latency as hard; they can instead get 30x speedup by having 30x as many researchers working but still producing a paper every 6 months.)
        (Another possibility is that human ML researchers get really good at multi-tasking, and so e.g. they have 5 paper-equivalents at any given time, each of which takes 30 calendar days to complete. But I don’t believe that (most) human ML researchers are that good at multitasking on research ideas, and there isn’t that much time for them to learn.)
        It also seems hard for the human researchers to have ideas good enough to turn into paper-equivalents every 6 days. Also hard for those researchers to keep on top of the literature well enough to be proposing stuff that actually makes progress rather than duplicating existing work they weren’t aware of, even given AI tools that help with understanding the literature.
        Further, the current scaling laws imply huge inference availablity if huge amounts of compute are used for training.
        Tbc the fact that running your automated ML implementers takes compute was a side point; I’d be making the same claims even if running the AIs was magically free.
        Though even at a billion token-equivalents per second it seems plausible to me that your automated ML experiment implementers end up being a significant fraction of that compute. It depends quite significantly on how capable a single forward pass is, e.g. can the AI just generate an entire human-level pull request autoregressively (i.e. producing each token of the PR one at a time, without going back to fix errors) vs does it do similar things as humans (write tests and code, test, debug, eventually submit) vs. does it do way more iteration and error correction than humans (in parallel to avoid crazy high latency), do we use best-of-N sampling or similar tricks to improve quality of generations, etc.
        ryan_greenblatt 28 Jan 2024 17:18 UTC
        LW: 7 AF: 6
        2
        AF Parent
        Overall, this has updated me to some extent and it seems less plausible to me that ML research can achieve 30x speedups while having human researchers do all of the high level ideas. (I think the picture looks importantly better when AIs are removing this bottleneck.)
        
        I think the most plausible case for your position would be that the compute costs for ML research scale much less than quadratically
        
        The situation I was imagining is where most experiments use some combination of:
        
        A relatively small amount of finetuning/inference on the biggest models (including for the actual AI researcher)
        Larger (possibly full) training runs, but at much smaller scale (e.g. GPT-3 level performance models)
        
        Then, we can in total afford ~training dataset sized amounts of finetuning/inference for the biggest models (by the inference availability argument). And GPT-3 performance experiments will be pretty cheap. So assuming our base model looks like GPT-6 with the expected compute requirement and model size, this is a huge amount of possible inference availability.
        
        So, the implicit claim is that compute costs scale much less than quadratically. It’s certainly not obvious ML research can be progressed fast enough with this little compute.
        
        Another possibility is that human ML researchers get really good at multi-tasking
        
        I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn’t feel that crazy to me, I already do a moderate amount of multi-tasking.
        
        Currently it seems feasible to make a paper’s worth of progress in ~6 months.
        
        Note that this often involves multiple people working on the same paper. In the AI case, the division of labor might look at least somewhat different. (Though I don’t think this changes the picture very much from what you’re describing because most people now aren’t the “ideas” people.)
        Rohin Shah 28 Jan 2024 18:23 UTC
        LW: 3 AF: 3
        0
        AF Parent
        Cool, that all roughly makes sense to me :)
        I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn’t feel that crazy to me, I already do a moderate amount of multi-tasking.
        Multi-tasking where you are responsible for the entire design of the project? (Designing the algorithm, choosing an experimental setting and associated metrics, knowing the related work, interpreting the results of the experiments, figuring out what the next experiment should be, …)
        Suppose today I gave you a device where you put in moderately detailed instructions for experiments, and the device returns the results^[1] with N minutes of latency and infinite throughput. Do you think you can spend 1 working day using this device to produce the same output as 4 copies of yourself working in parallel for a week (and continue to do that for months, after you’ve exhausted low-hanging fruit)?
        … Having written this hypothetical out, I am finding it more plausible than before, at least for small enough N, though it still feels quite hard at e.g. N = 60.
        ^
        The experiments can’t use too much compute. No solving the halting problem.
        ryan_greenblatt 28 Jan 2024 18:38 UTC
        LW: 4 AF: 4
        2
        AF Parent
        Probably yes for realistic values of N?
        
        Assuming the box is pretty smart at understanding instructions (and has an understanding of my typical ontology to the extent that you would get after working with me a few weeks and reading various posts) and the box will ask follow-up questions in cases where the instructions are unclear. (And we can do small diffs with reduced latency like asking the results to be plotted in a different way.)
        
        My main concern is running out of ideas after a while despite copies of myself with more thinking time having more time to generate ideas.
        Rohin Shah 28 Jan 2024 18:44 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Sounds reasonable, though idk what you think realistic values of N are (my wild guess with hardly any thought is 15 minutes − 1 day).
        EDIT: Tbc in the 1 day case I’m imagining that most of the time goes towards running the experiment—it’s more a claim about what experiments we want to run. If we just talk about the time to write the code and launch the experiment I’m thinking of N in the range of 5 minutes to 1 hour.
- ryan_greenblatt 25 Jan 2024 18:02 UTC
  LW: 4 AF: 4
  1
  AF Parent
  I’ll respond to this more later, but first I’d like to clarify a narrow point about when building superintelligence is acceptable.
  
  Claim 4: If we succeed in controlling “transformatively useful AI”, then we may be able to stop the race toward superintelligence and get AI labs or governments to agree to not build superintelligence until it can be controlled (not intuitive; non-technical arguments needed).
  
  I think that purely black-box AI control is very unlikely to scale to superintelligence, but there are other plausible routes to safety (alignment, various approaches which make usage of the internals of models but which aren’t well described as alignment).
  
  So, the condition is more like “when we can be confident in the safety of superintelligence” and there are a variety of routes here. Separately, if we don’t have an absolutely air tight argument for safety, we would ideally delay building superintelligence if it wouldn’t be needed to solve pressing problems. (And indeed, we don’t think that superintelligence to solve a wide range of problems.)
  - Gerald Monroe 25 Jan 2024 20:28 UTC
    3 points
    1
    Parent
    What do you think about marginal superintelligences?
    
    For example, take the task “implement an x86 CPU as gate masks”. Humans can do this task, but no single human can do the task, and humans are forced to inefficiently subdivide the task. For example a “CPU” that did not have distinct internal buses but was some blob of gates, with the registers and cache right in the middle of the logic mess (or with cache lines descending from dies soldered above), would probably outperform all current designs.
    
    This hypothetical mess of a chip design is not something humans can create but it’s a checkable artifact.
    
    Or another task, “construct a new human kidney from cells. All measurable parameters must meet or exceed a reference kidney”. Similar argument—humans can’t quite do this, the complexity of life support during construction is where humans would fail, or human made designs wouldn’t quite work well enough.
    
    But again this is a checkable artifact. You don’t need superintelligence to validate that the output satisfies (or fails to satisfy) the goal.
    
    A marginal superintelligence would be one that is context unaware and gets assigned tasks like this. It doesn’t know when the task is real or not.
- ryan_greenblatt 27 Jan 2024 4:52 UTC
  LW: 2 AF: 2
  1
  AF Parent
  More broadly, what do you think we should do once we’re starting to see AI systems that are powerful enough to subvert the control evals (or develop attacks that are better than those of the human red teamers)?
  
  We think it’s likely that prior to finding that evals are unworkable, we find that control techniques would fail to ensure reasonable levels of safety and usefulness. That is, in order to ensure safety for typical use cases, we have to impose heavy restrictions that heavily restrict usefulness.
  
  See this comment for more discussion of what could fail.
  
  Regardless of what fails first, we can discuss what we should do.
  
  First, you’d notify the key stake holders that control doesn’t suffice for the next AIs. (And ideally produce legible demonstrations of this etc.)
  
  If we’ve already reached a sufficient level of AI output that we’re operating at pretty crazy and near maximal speeds (e.g. the Amdahl’s law style argument here), then we’re in pretty good shape from the perspective of a control plan (from a technical perspective). We’d hope that in this position we can ensure some reasonable delay prior to being forced to build and use AIs in ways that break control. We might be forced to build more powerful AIs in this situation because of other actors building insanely powerful AIs that we might need to engage in active conflict against (and we can’t just stop these AIs from being trained which might be way less risky) and it could also be that internal forces want to irrationally rush to making more powerful AIs.
  
  In the case where we’re not operating at sufficiently high output, the situation is considerably worse. If the AI lab is operating in a sensible way, then the hope would be that you do a careful tradeoff between likely exogenous risk and the risk that it seems like you would be incuring from various usages of the AI^[1]. So if exogenous risk isn’t there, we’d just stop and work on developing better approaches. If we need to proceed without further progress, than we’d just try to make the best safety usefulness tradeoffs that we can afford.
  ↩︎
  Given reasonable guesses at P(scheming) for different AIs and more detailed views on: whether various AIs might scheme but not cooperate with each other and how bad various bad outcomes would be relative to the exogenous threats (e.g. how does an AI escaping from our lab compare in badness to various exogenous risks).
- ryan_greenblatt 27 Jan 2024 4:21 UTC
  LW: 2 AF: 1
  1
  AF Parent
  
  I think those pessimistic about control evals could say something like “the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team.” Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won’t scale toward arbitrarily powerful AIs.
  
  A key hope here is that we can use various test to get a general sense for whether AIs can importantly outcompete our red team. See this other comment for more discussion.
  
  So, here’s how I would hope the actual situation with Alice and Bob goes:
  
  Alice from the Scaling Department: Look, our current control techniques are working for GPT-5. So let’s scale up more! I get that at some point our control techniques aren’t going to be able to control arbitrarily powerful AI, but we’re not there yet, so we should be able to keep going.
  
  Bob from the Control Red Team: Indeed, our current control techniques are very likely to be working. Our protocol suffices for controlling current models on the currently approved applications assuming we remain vigilant and maintain the necessary invariants (our design basis).
  
  Part of why we think control evaluations currently work for assessing safety is that we we performed a variety of capability evaluations which indicate that as long as we prevent steganography in some key places, AIs are pretty unlikely to think of better attack plans than we thought of as the red team. This involves looking at how good AIs are at generating strategies in a bunch of different contexts and then doing some analysis which allows for us to transfer this to an overall conservative guess about the AI’s abilities to create good attack plans against our actual protocol (but we can’t necessarily produce the AI’s best guesses at attack plans, we’re utilizing that capability evaluation is easier than elicitation as discussed in footnote 8). We also have rough and conservative tests for the extent to which AIs can sandbag on our current evaluations and that seems to be fine for now.
  
  From our current rough extrapolations, the first place where we’ll start running into issues with control as we build more powerful models is that we’ll need to start imposing heavy restrictions on AI affordances in domains XYZ which will really cut into the usefulness (relative to completely YOLO-ing it). The blue team is hoping that they can finish developing some countermeasures that will handle this problem, though it’s unclear when this will be ready. After that it looks like exploration hacking concerns will force us to be much more conservative in the evaluation in a few places which will probably put a considerable dent in usefulness in a few other domains. Shortly after that, building more powerful models probably won’t be worth it because the costs associated with training and sufficient containment will overwhelm the marginal improvements in usefuless we can get (given our current safety targets); the AI will only be usable in extremely restricted domains like submitting formally verified code or doing narrow mathematics. Of course, our larger scale extrapolations are crude and we’ll have a much better sense of the overall situation once we run our tests on the next few generations of model.
  
  Alice from the Scaling Department: OK, well let me know when these evaluations start indicating that training further models isn’t worth it.
  
  (Ideal) Alice from the Scaling Department: We’re already starting to pivot into making easier to control AI architectures more scalable and performant (easier to restrict and more bottlenecked on natural language reasoning). We’re also working on making inference cheaper at a fixed level of capability so that in some cases we can run dumber models for longer rather than using smarter models. This seems like it will be safer based on our current control evaluations. Of course, we’re worried about the potential overhang due to this algorithmic progress, and we’re pretty confident that competent adversaries couldn’t steal it.