Nathan Helm-Burger comments on MIRI 2024 Communications Strategy

Nathan Helm-Burger 8 Jun 2024 15:22 UTC
3 points
0
I think my model of AI causing increasing amounts of trouble in the world, eventually even existential risk for humanity, doesn’t look like a problem which is well addressed by an ‘off switch’. To me, the idea of an ‘off switch’ suggests that there will be a particular group (e.g. an AI company) which is running a particular set of models on a particular datacenter. Some alarm is triggered and either the company or their government decides to shut down the company’s datacenter.
I anticipate that, despite the large companies being ahead in AI technology, they will also be ahead in AI control, and thus the problems they first exhibit will likely be subtle ones like gradual manipulation of users. At what point would such behavior, if detected, lead to a sufficiently alarmed government response that they would trigger the ‘off switch’ for that company? I worry that even if such subversive manipulation were detected, the slow nature of such threats would give the company time to issue and apology and say that they were deploying a fixed version of their model. This seems much more like a difficult to regulate grey area than would be, for instance, the model being caught illicitly independently controlling robots to construct weapons of war. So I do have concerns that in the longer term, if the large companies continue to be unsafe, they will eventually build AI so smart and capable and determined to escape that it will succeed. I just expect that to not be the first dangerous effect we observe.
In contrast, I expect that the less powerful open weights models will be more likely to be the initial cause of catastrophic harms which lead clearly to significant crimes (e.g. financial crimes) or many deaths (e.g. aiding terrorists in acquiring weapons). The models aren’t behind an API which can filter for harmful use, and the users can remove any ‘safety inclinations’ which have been trained into the model. The users can fine-tune the model to be an expert in their illegal use-case. For such open weights models, there is no way for the governments of the world to monitor them or have an off-switch. They can be run on the computers of individuals. Having monitors and off-switches for every sufficiently powerful individual computer in the world seems implausible.
Thus, I think the off-switch only addresses a subset of potential harms. I don’t think it’s a bad idea to have, but I also don’t think it should be the main focus of discussion around preventing AI harms.
My expectation is that the greatest dangers we are likely to first encounter (and thus likely to constitute our ‘warning shots’ if we get any) are probably going to be one of two types:
1. A criminal or terrorist actor using a customized open-weights model to allow them to undertake a much more ambitious crime or attack than they could have achieved without the model.
2. Eager hobbyists pushing models into being self-modifying agents with the goal of launching a recursive self-improvement cycle, or the goal of launching a fully independent AI agent into the internet for some dumb reason. People do dumb things sometimes. People are already trying to do both these things. The only thing stopping this from being harmful at present is that the open source models are not yet powerful enough to effectively become independent rogue agents or to self-improve.
  1. Certainly the big AI labs will get to the point of being able to do these things first, but I think they will be very careful not to let their expensive models escape onto the internet as rogue agents.
  2. I do expect the large labs to try to internally work on recursive self-improvement, but I have some hope that they will do so cautiously enough that a sudden larger-than-expected success won’t take them unawares and escape before they can stop it.
  3. So the fact that the open source hobbyist community is actively trying to do these dangerous activities, and no one is even discussion regulations to shut this sort of activity down, means that we have a time bomb with an unknown fuse ticking away. How long will it be until the open source technology improves to the point that these independently run AIs cross their ‘criticality point’ and successfully start to make themselves increasingly wealthy / smart / powerful / dangerous?
Another complicating factor is that trying to plan for ‘defense from AI’ is a lot like trying to plan for ‘defense from humans’. Sufficiently advanced general AIs are intelligent agents like humans are. I would indeed expect an AI which has gain independence and wealth to hire and/or persuade humans to work for it (perhaps without even realizing that they are working for an AI rather than a remote human boss). Such an AI might very well set up shell companies with humans playing the role of CEOs but secretly following orders from the AI. Similarly, an AI which gets very good at persuasion might be able to manipulate, radicalize, and fund terrorist groups into taking specific violent actions which secretly happen to be arranged to contribute to the AI’s schemes (without the terrorist groups even realizing that their funding and direction is coming from an AI).
These problems, and others like them, have been forecasted by AI safety groups like MIRI. I don’t, however, think that MIRI is well-placed to directly solve these problems. I think many of the needed measures are more social / legal rather than technical. MIRI seems to agree, which is why they’ve pivoted towards mainly trying to communicate about the dangers they see to the public and to governments. I think our best hope to tackle these problems comes from action being taken by government organizations, who are pressured by concerns expressed by the public.