I think this is actually wrong, because of synthetic data letting us control what the AI learns and what they value, and in particular we can place honeypots that are practically indistinguishable from the real world, such that if we detected an AI trying to deceive or gain power, the AI almost certainly doesn’t know whether we tested it or whether it’s in the the real world.
I agree that this seems true to me, with an important caveat. This is true for responsible actors taking sensible actions to maintain control and security.
So this seems to me like we should expect a limited window of safety, if the frontier labs behave with appropriate preparation and caution, followed by the open-source community catching up sufficiently that Bobby McEdgeLord can launch ChaosGPT from his mom’s basement. So taking comfort in the window of safety means planning to do stuff like use this window of time to create really persuasive demonstrations of the potential danger of a bad actor unleashing a harm-directed AI agent, and showing these demonstrations to key government officials (e.g. Paul Christiano, as of recently) and hoping that you and they can come up with some way to head off the impending critical danger.
If instead you pat yourself on the back and just go about making money from customers using your API to your Safe-By-Construction Agent.… you’ve just delayed the critical failure point, not truly avoided it.
I agree with the claim that if open-source gets ahead of the big labs and is willing to ask their AI systems to do massive harm via misuse, things get troublesome fast even in my optimistic world.
I think the big labs will probably just be far, far ahead of open source by default due to energy and power costs, but yeah misuse is an actual problem.
I think the main way we differ is probably how offense-biased progress in tech will be, combined with how much do we need to optimize for pivotal acts compared to continuous changes.
Oh, and also, to clarify a point: I agree that the big labs will get to AGI well before open-source.
Maybe 1-4 years sooner is my guess. But my point is that, once we get there we have use that window carefully. We need to prove to the world that dangerous technology is coming, and we need to start taking preventative measures against catastrophe while we have that window to act.
Also, I expect that there will be a very low rate of bad actors. Something well under 1% of people who have the capacity to download and run an open-weights model. Just as I don’t suspect most school students of becoming school shooters. It’s just that I see ways for very bad things to come from even small numbers of poorly resourced bad actors equipped with powerful AI.
Yes, I am pretty worried about how offense-dominant current tech is. I’ve been working on AI Biorisk evals, so I’ve had a front-row seat to a lot of scary stuff recently.
I think a defensive-tech catch-up is possible once leading labs have effective AGI R&D teams, but I don’t think the push for this happens by default. I think that’d need to be part of the plan if it were to happen in time, and would likely need significant government buy-in.
I don’t think we need a pivotal act to handle this. Stuff like germ-killing-but-skin-safe far UV being installed in all public spaces, and better wastewater monitoring. Not pivotal acts, just… preparing ahead of time. We need to close the worst of vulnerabilities to cheap and easy attacks. So this fits more with the idea of continuous change, but it’s important to note that it’s too late to start making the continuous change to a safer world after the critical failure has happened. At that point it’s too late for prevention, you’d need something dramatic like a pivotal act to counter the critical failure.
Most of these actions are relatively cheap and could be started on today, but the relevant government officials don’t believe AI will progress far enough to lower the barrier to biorisk threats enough that this is a key issue.
Yeah, in this post I’m mostly focusing on AI misalignment, because if my points on synthetic data being very helpful for alignment or alignment generalizing farther than capabilities ends up right as AI grows more powerful, it would be a strong counter-argument to the view that people need to slow down AI progress.
Yeah, we’re in agreement on that point. And for me, this has been an update over the past couple years. I used to think that slowing down the top labs was a great idea. Now, having thought through the likely side-effects of that, and having thought through the implications of being able to control the training data, I have come to a position which agrees with yours on this point.
To make this explicit:
I currently believe (Sept 2024) that the best possible route to safety for humanity routes through accelerating the current safety-leading lab, Anthropic, to highly capable tool-AI and/or AGI as fast as possible without degrading their safety culture and efforts.
I think we’re in a lot of danger, as a species, from bad actors deploying self-replicating weapons (bioweapons, computer worms exploiting zero days, nanotech). I think this danger is going to be greatly increased by AI progress, biotech progress, and increased integration of computers into the world economy.
I think our best hope for taking preventative action to head off disaster is to start soon, by convincing important decision makers that the threats are real and their lives are on the line. I think the focus on a pivotal act is incorrect, and we should instead focus on gradual defensive-tech development and deployment.
I worry that we might not get a warning shot which we can survive with civilization intact. The first really bad incident could devastate us in a single blow. So I think that demonstrations of danger are really important.
I agree that this seems true to me, with an important caveat. This is true for responsible actors taking sensible actions to maintain control and security.
So this seems to me like we should expect a limited window of safety, if the frontier labs behave with appropriate preparation and caution, followed by the open-source community catching up sufficiently that Bobby McEdgeLord can launch ChaosGPT from his mom’s basement. So taking comfort in the window of safety means planning to do stuff like use this window of time to create really persuasive demonstrations of the potential danger of a bad actor unleashing a harm-directed AI agent, and showing these demonstrations to key government officials (e.g. Paul Christiano, as of recently) and hoping that you and they can come up with some way to head off the impending critical danger.
If instead you pat yourself on the back and just go about making money from customers using your API to your Safe-By-Construction Agent.… you’ve just delayed the critical failure point, not truly avoided it.
I agree with the claim that if open-source gets ahead of the big labs and is willing to ask their AI systems to do massive harm via misuse, things get troublesome fast even in my optimistic world.
I think the big labs will probably just be far, far ahead of open source by default due to energy and power costs, but yeah misuse is an actual problem.
I think the main way we differ is probably how offense-biased progress in tech will be, combined with how much do we need to optimize for pivotal acts compared to continuous changes.
Oh, and also, to clarify a point: I agree that the big labs will get to AGI well before open-source.
Maybe 1-4 years sooner is my guess. But my point is that, once we get there we have use that window carefully. We need to prove to the world that dangerous technology is coming, and we need to start taking preventative measures against catastrophe while we have that window to act.
Also, I expect that there will be a very low rate of bad actors. Something well under 1% of people who have the capacity to download and run an open-weights model. Just as I don’t suspect most school students of becoming school shooters. It’s just that I see ways for very bad things to come from even small numbers of poorly resourced bad actors equipped with powerful AI.
Yes, I am pretty worried about how offense-dominant current tech is. I’ve been working on AI Biorisk evals, so I’ve had a front-row seat to a lot of scary stuff recently.
I think a defensive-tech catch-up is possible once leading labs have effective AGI R&D teams, but I don’t think the push for this happens by default. I think that’d need to be part of the plan if it were to happen in time, and would likely need significant government buy-in.
I don’t think we need a pivotal act to handle this. Stuff like germ-killing-but-skin-safe far UV being installed in all public spaces, and better wastewater monitoring. Not pivotal acts, just… preparing ahead of time. We need to close the worst of vulnerabilities to cheap and easy attacks. So this fits more with the idea of continuous change, but it’s important to note that it’s too late to start making the continuous change to a safer world after the critical failure has happened. At that point it’s too late for prevention, you’d need something dramatic like a pivotal act to counter the critical failure.
Most of these actions are relatively cheap and could be started on today, but the relevant government officials don’t believe AI will progress far enough to lower the barrier to biorisk threats enough that this is a key issue.
Yeah, in this post I’m mostly focusing on AI misalignment, because if my points on synthetic data being very helpful for alignment or alignment generalizing farther than capabilities ends up right as AI grows more powerful, it would be a strong counter-argument to the view that people need to slow down AI progress.
Yeah, we’re in agreement on that point. And for me, this has been an update over the past couple years. I used to think that slowing down the top labs was a great idea. Now, having thought through the likely side-effects of that, and having thought through the implications of being able to control the training data, I have come to a position which agrees with yours on this point.
To make this explicit:
I currently believe (Sept 2024) that the best possible route to safety for humanity routes through accelerating the current safety-leading lab, Anthropic, to highly capable tool-AI and/or AGI as fast as possible without degrading their safety culture and efforts.
I think we’re in a lot of danger, as a species, from bad actors deploying self-replicating weapons (bioweapons, computer worms exploiting zero days, nanotech). I think this danger is going to be greatly increased by AI progress, biotech progress, and increased integration of computers into the world economy.
I think our best hope for taking preventative action to head off disaster is to start soon, by convincing important decision makers that the threats are real and their lives are on the line. I think the focus on a pivotal act is incorrect, and we should instead focus on gradual defensive-tech development and deployment.
I worry that we might not get a warning shot which we can survive with civilization intact. The first really bad incident could devastate us in a single blow. So I think that demonstrations of danger are really important.