FireStormOOO comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

FireStormOOO 17 Jan 2024 19:45 UTC
4 points
3
From an operational perspective, this is eye-opening in terms of how much trust is being placed in the companies that train models, and the degree to which nobody coming in later in the pipeline is going to be able to fully vouch for the behavior of the model, even if they spend time hammering on it. In particular, it seems like it took vastly less effort to sabotage those models than would be required to detect this.
That’s relevant to the models that are getting deployed today. I think the prevailing thinking among those deploying AI models in businesses today is that the supply chain is less vulnerable to quietly slipping malware into an LLM compared to traditional software. That’s not seeming like a safe assumption.
- RogerDearnaley 20 Jan 2024 20:00 UTC
  2 points
  0
  Parent
  There have been quite a few previous papers on backdooring models that have also demonstrated the feasibility of this. So anyone operating under that impression hasn’t been reading the literature.
  - FireStormOOO 20 Jan 2024 20:22 UTC
    1 point
    0
    Parent
    That is a big part of the threat here. Many of the current deployments are many steps removed from anyone reading research papers. E.g. sure, people at MS and OpenAI involved with that roll-out are presumably up on the literature. But the IT director deciding when and how to deploy copilot, what controls need to be in place, etc? Trade publications, blogs, maybe they ask around on Reddit to see what others are doing.