Nathan Helm-Burger comments on The case for ensuring that powerful AIs are controlled

Nathan Helm-Burger 24 Jan 2024 19:37 UTC
LW: 7 AF: 5
5
AF
I think this is an important piece of the strategic solution. The lack of control options for open-weight models is a big part of why I think there is a much lower capability threshold for catastrophic danger from an open-weight model.
- ryan_greenblatt 28 Jan 2024 17:54 UTC
  LW: 5 AF: 4
  2
  AF Parent
  I basically agree with you about transformatively powerful open-weight models making the situation much more dangerous such that control is unlikely to suffice.
  
  But, I think there are important control related options for open-weight models which do improve the situation to some extent.
  
  In the open-weight regime, we realistically won’t be able to rule out AIs doing a small to moderate number of arbitrarily bad actions. (Either based on the aims of the humans running the AI or the aims of a scheming AI itself.) So, if the offense-defense balance is such that this is catastrophic, we’re absolutely in trouble. (E.g., if transformatively powerful AIs can pretty easily create catastrophically dangerous bioweapons without using much compute or needing very skilled human labor we’re in big trouble unless we can somehow avoid this by restricting access to labs or something.)
  
  But, we can still ensure:
  - If most compute is controlled by smaller number of reasonable actors and these actors implement control (or other sufficient strategies), then due to compute limitations, AIs won’t be able to do an “arbitrary” number of bad actions.
    Compute limitations might suffice for preventing rogue AIs or rogue humans from rapidly causing an intelligence explosion even if it would otherwise be possible. Thus, we might be able to delay substantially and use this time to improve the situation.
    It might also suffice for avoiding the worst parts of AI enabled cyber attacks which might be heavily scale dependent and not overally offense dominant.
  - We can control the AIs used by this smaller number of actors which could allow them to be used in otherwise dangerous applications.
  - We can try to distribute easy-to-use control approaches widely and possibly we can pass legal restrictions related to these approaches.
  - We can try to aggregate information to figure out if the AI is scheming and if we learn that it is, we can try to take some countermeasures. TBC, even if the AI isn’t scheming, it’s plausible that offense defense issues (given the lack of time for adaptation) suffice to make this situation unrecoverable.
  See also the strategy-stealing assumption and What does it take to defend the world against out-of-control AGIs?.
  
  Overall, my guess is that if the AI is open source, scheming, and transformatively powerful (using the sort of power threshold defined in this post) and our overall alignment related technology looks like it does today, the probability of human loss of control or >1 billion dead is about 80%.
  
  If we have a magic solution to misalignment which makes it easy for anyone to avoid egregious misalignment from their perspective and which scales indefinitely (but such an AI is still open source), then my probability on the bad outcome condition above is more like 20%.
  
  My probabilities are somewhere in-between depending on the exact amount of technical solution and the default level of misalignment. (And various responses from governments also affect this etc.)
  - Nathan Helm-Burger 29 Jan 2024 4:15 UTC
    2 points
    0
    Parent
    I agree with pretty much all of this and appreciate your clear framing of the issues at hand. It seems like where our concerns differ is around these two issues: I believe that the offense-defense balance for AI-enabled biorisk is such that a bad actor with open-weights fine-tuned model could kill billions of people with less than 100k. I don’t think that the fine-tuning or inference would require more than a single server with 8x GPUs (potentially even just 8x 4090s). So unless the compute regulations are monitoring individual 4090 GPUs, then you aren’t blocking inference or fine-tuning. Training takes a bunch of servers (often millions or billions of dollars worth of hardware), and thus seems more plausible to monitor.
    
    I haven’t heard any concrete prposals for compute monitoring at the level of 8x 4090 GPUs, have you?
    - ryan_greenblatt 29 Jan 2024 17:37 UTC
      2 points
      0
      Parent
      I think compute monitoring for literally 8x 4090 is very likely to be way too hard against a reasonably commited adversary. (The hardware is already way too easy to access and broadly distributed. Also, I think if this causes problems, you might also have issues with people’s mac books which is super costly.)
      
      My overall guess is that for realistically sized models (0.1-100 trillion parameters), we can’t prevent doing a small amount of finetuning and inference against an ML competent adversary.
      
      But, maybe there is some hope for models which are more like 100 trillion parameters? (Minimally, I think 8x 4090 isn’t going to work well for models of this size.)
      - Nathan Helm-Burger 29 Jan 2024 17:47 UTC
        2 points
        0
        Parent
        Yeah, the question of where the threshold of dangerous capabilities is seems very important to whether we can hope to make compute governance a part of restricting bad actors from using fine-tuning & inference to do bad things.
        The reason this is important to try to predict is because we really want to not release powerfully dangerous open-source models which bad actors could use for bad things. Once those models have been released, there’s no taking them back. So if there’s a risk that a given model would be dangerous if released openly so that bad actors could have a private copy to fine-tune, then the intervention point is to disallow the release of the model.
- Gerald Monroe 25 Jan 2024 2:52 UTC
  2 points
  0
  Parent
  I have been interpreting control to mean “when you use the model for some task, use strategies that will cause the final output to be correct even if you suspect the model may “want” to sabotage it”. Alignment would be making the model “want” to help.
  
  In real life we can “control” prisoners, we know most likely “want” to escape, but as prisoners don’t have the tools to defeat the various defenses this usually doesn’t happen. Also, the worst case—the prisoner escapes and commits several murders before being stopped—is an acceptable loss to a civilization so long as pEscape is low.
  
  Suppose an escaped prisoner wants to work for you. Can you “control” them and get useful work out of them? Yes. It depends on the task and it depends on what tools you have available, but this is generally possible.
  
  For AI specifically this strategy has been dismissed, I think because many of the people doing the dismissing were imagining somewhat generic future AI and not considering the specific details. For example one of the strongest methods for control is to carefully filter inputs, ideally removing all unnecessary bits—what Ryan talks about when he write “paraphrase ” but I bet we can do much better than that—so the model doesn’t know that now is the time to betray. And erase the models memory after a task, and keep each task as short as possible. (A few hours of wall time total, model doesn’t think between inputs)
  
  This specific method is not something you can do with humans, and if you imagine AGI are like a human brain but smarter, then this method won’t be considered.
  
  Theoretically if most tasks that require the model to have authority are using “controlled” models, then it’s acceptable if open models exist, so long as they are unable to send messages that will be perceived and negotiate with other AI under human “control”.
  
  If all the weapons and infrastructure etc are in the hands of humans directly or models they “control”, open models will be limited to whatever resources whoever runs the AGI is stupid enough to give them. (Running potentially sabotaged AI downloaded from a torrent sounds like a Darwin award)
  - Nathan Helm-Burger 25 Jan 2024 17:04 UTC
    4 points
    0
    Parent
    Darwin awards do happen. Lots of different humans means some are dumb sometimes.
    - Gerald Monroe 25 Jan 2024 18:27 UTC
      3 points
      −1
      Parent
      Absolutely. Main thing is one guy in his garage who hooks up an open source agi to some hand built robots that start assembling more of themselves is a totally different scale of threat than a multi square kilometer...or cubic kilometer....optimized factory that is making weapons.
      
      Billions of times scale difference. And this points to a way to control AI—so long as the humans have a very large resource advantage, ideally billions of times, that they achieved with the help of “controlled” AI models, the humans can defend themselves against superintelligence.
      
      But you won’t get a billion fold advantage with any AI pauses—you need a crash program to accelerate developing ai. You won’t get there without ai either, we don’t have enough living humans and besides, these large factories won’t have any concessions to humans. No hand rails, probably no breathable gas inside, etc.