Here’s an admittedly-uncharitable analogy that might still illuminate what I think is lacking from this picture:
Suppose someone in 1947 wrote a paper called “How we could solve nuclear reactor design,” and it focused entirely on remotely detecting nuclear bombs, modeling piles of uranium to predict when they might be nuclear bombs, making modifications to piles of uranium to make them less like nuclear bombs, ultrasound scanning piles of uranium to see if they have bomb-like structures hidden inside, etc.
Then in a subsection, they touch on how to elicit electricity from uranium, and they say that many different designs could work, just boil water and drive a turbine or something, people will work out the details later if they don’t blow themselves up with a nuclear bomb. And that the nuclear reactor design solution they mean isn’t about building some maximum efficiency product, it’s about getting any electricity at all without blowing yourself up.
This paper seems strange because all the stuff people have to figure out later is most of the nuclear reactor design. What’s the fuel composition and geometry? What’s the moderator material and design? How is radiation managed? Pressure contained? Etc.
Or to leave the analogy, it seems like building safe, useful, and super-clever AI has positive content; if you start from the inverse of an AI that does the bad takeover behavior, you still have most of the work left to go. How to initialize the AI’s world model and procedure for picking actions? What to get human feedback on, and how to get that feedback? How to update or modify the AI in response to human feedback? Maybe even how to maintain transparency to other human stakeholders, among other social problems? These decisions are the meat and potatoes of building aligned AI, and they have important consequences even if you’ve ruled out bad takeover behavior.
If you make a nuclear reactor that’s not a bomb, you can still get radiation poisoning, or contaminate the groundwater, or build a big pile of uranium that someone else can use to make a bomb, or just have little impact and get completely eclipsed by the next reactor project.
Sumilarly, an AI that doesn’t take over can still take actions optimized to achieve some goal humans wouldn’t like, or have a flawed notion of what it is humans want from it (e.g. generalized sycophancy), or have negative social impacts, or have components that someone will use for more-dangerous AI, or just have low impact.
Considering that one of the primary barriers to civilian nuclear power plants was, and remains, nuclear bomb proliferation risk, I’m not sure how telling this analogy is. There’s a big reason that nuclear power plants right now are associated with either avowed nuclear powers or powers that want to be nuclear powers at some point (eg. France, South Korea, North Korea, Japan, Iran, Pakistan...) or countries closely aligned with said nuclear powers. Or rather, it seems to me that the analogy goes the opposite of how you wanted: if someone had ‘solved’ nuclear reactor design by coming up with a type of nuclear reactor which was provably impossible to abuse for nukes, that would have been a lot more useful for nuclear reactor design than fiddling with details about how exactly to boil water to drive a turbine. If you solve the latter, you have not solved the former at all; if you solve the former, someone will solve the latter. And if you don’t, ‘nuclear power plant’ suddenly becomes a design problem which includes things like ‘resistant to Israeli jet strikes’ or ‘enables manipulation of international inspectors’ or ‘relies on a trustworthy closely-allied superpower rather than untrustworthy one for support like spent fuel reprocessing and refueling’.
I don’t think Joe is proposing we find an AI design that is impossible to abuse even by malicious humans. The point so far seems to be making sure your own AI is not going to do some specific bad stuff.
If you solve the latter, you have not solved the former at all; if you solve the former, someone will solve the latter.
Insofar as this is true in your extended analogy, I think that’s a reflection of “completely proliferation-proof reactor” being a bad thing to just assume you can solve.
The key point is assuming you have avoided the bad takeover behavior, the solutions become more like a normal scientific or engineering solution, which we have known records of solving, and in particular it removes one of the most nasty difficulties in AI safety, where you can’t iterate on the system very much or at all, because the AI will fight you.
So this makes us much more likely to succeed at the problems you’ve mentioned, conditional on getting controlled AI.
Well, it makes things better. But it doesn’t assure humanity’s success by any means. Basically I agree but will just redirect you back to my analogy about why the paper “How to solve nuclear reactor design” is strange.
The paper you describe in your comment would have a lot of it’s details filled in by default by the capabilities people inside an AI lab, and the alignment team would outsource most of the details to the people who would want to make the AI go fast.
While I don’t think it would ensure humanity’s success by any means, I do think that the alignment field could mostly declare victory and stop working if we knew there were no problems that were resistant to iterative correction, since other people will solve it for us.
Here’s an admittedly-uncharitable analogy that might still illuminate what I think is lacking from this picture:
Suppose someone in 1947 wrote a paper called “How we could solve nuclear reactor design,” and it focused entirely on remotely detecting nuclear bombs, modeling piles of uranium to predict when they might be nuclear bombs, making modifications to piles of uranium to make them less like nuclear bombs, ultrasound scanning piles of uranium to see if they have bomb-like structures hidden inside, etc.
Then in a subsection, they touch on how to elicit electricity from uranium, and they say that many different designs could work, just boil water and drive a turbine or something, people will work out the details later if they don’t blow themselves up with a nuclear bomb. And that the nuclear reactor design solution they mean isn’t about building some maximum efficiency product, it’s about getting any electricity at all without blowing yourself up.
This paper seems strange because all the stuff people have to figure out later is most of the nuclear reactor design. What’s the fuel composition and geometry? What’s the moderator material and design? How is radiation managed? Pressure contained? Etc.
Or to leave the analogy, it seems like building safe, useful, and super-clever AI has positive content; if you start from the inverse of an AI that does the bad takeover behavior, you still have most of the work left to go. How to initialize the AI’s world model and procedure for picking actions? What to get human feedback on, and how to get that feedback? How to update or modify the AI in response to human feedback? Maybe even how to maintain transparency to other human stakeholders, among other social problems? These decisions are the meat and potatoes of building aligned AI, and they have important consequences even if you’ve ruled out bad takeover behavior.
If you make a nuclear reactor that’s not a bomb, you can still get radiation poisoning, or contaminate the groundwater, or build a big pile of uranium that someone else can use to make a bomb, or just have little impact and get completely eclipsed by the next reactor project.
Sumilarly, an AI that doesn’t take over can still take actions optimized to achieve some goal humans wouldn’t like, or have a flawed notion of what it is humans want from it (e.g. generalized sycophancy), or have negative social impacts, or have components that someone will use for more-dangerous AI, or just have low impact.
Considering that one of the primary barriers to civilian nuclear power plants was, and remains, nuclear bomb proliferation risk, I’m not sure how telling this analogy is. There’s a big reason that nuclear power plants right now are associated with either avowed nuclear powers or powers that want to be nuclear powers at some point (eg. France, South Korea, North Korea, Japan, Iran, Pakistan...) or countries closely aligned with said nuclear powers. Or rather, it seems to me that the analogy goes the opposite of how you wanted: if someone had ‘solved’ nuclear reactor design by coming up with a type of nuclear reactor which was provably impossible to abuse for nukes, that would have been a lot more useful for nuclear reactor design than fiddling with details about how exactly to boil water to drive a turbine. If you solve the latter, you have not solved the former at all; if you solve the former, someone will solve the latter. And if you don’t, ‘nuclear power plant’ suddenly becomes a design problem which includes things like ‘resistant to Israeli jet strikes’ or ‘enables manipulation of international inspectors’ or ‘relies on a trustworthy closely-allied superpower rather than untrustworthy one for support like spent fuel reprocessing and refueling’.
I don’t think Joe is proposing we find an AI design that is impossible to abuse even by malicious humans. The point so far seems to be making sure your own AI is not going to do some specific bad stuff.
Insofar as this is true in your extended analogy, I think that’s a reflection of “completely proliferation-proof reactor” being a bad thing to just assume you can solve.
The key point is assuming you have avoided the bad takeover behavior, the solutions become more like a normal scientific or engineering solution, which we have known records of solving, and in particular it removes one of the most nasty difficulties in AI safety, where you can’t iterate on the system very much or at all, because the AI will fight you.
So this makes us much more likely to succeed at the problems you’ve mentioned, conditional on getting controlled AI.
Well, it makes things better. But it doesn’t assure humanity’s success by any means. Basically I agree but will just redirect you back to my analogy about why the paper “How to solve nuclear reactor design” is strange.
The paper you describe in your comment would have a lot of it’s details filled in by default by the capabilities people inside an AI lab, and the alignment team would outsource most of the details to the people who would want to make the AI go fast.
While I don’t think it would ensure humanity’s success by any means, I do think that the alignment field could mostly declare victory and stop working if we knew there were no problems that were resistant to iterative correction, since other people will solve it for us.