In Buck’s transcript for his talk on Cruxes for working on AI safety there’s an example of a bad argument for why we should be worried about people building spaceships too small.
Imagine if I said to you: “One day humans are going to try and take humans to Mars. And it turns out that most designs of a spaceship to Mars don’t have enough food on them for humans to not starve over the course of their three-month-long trip to Mars. We need to work on this problem. We need to work on the problem of making sure that when people build spaceships to Mars they have enough food in them for the people who are in the spaceships.”
It’s implied that you could turn this into an argument for why people are not going to build unaligned AGI. I pars it as an analogical argument. People would not build spaceships that don’t have enough food on them because they don’t want people to die. Analogously, people will not build unaligned AGI because they don’t want people to die.
So what are the disanalogies? One is that it is harder to tell whether an AGI is aligned than whether a spaceship has enough food on it. I don’t think this can do much of the work, because then people would just spend more effort on telling whether spaceships have enough food on them, or not build them. Similarly, if this were the only problem, then people would just put more effort into determining whether an AGI is aligned before turning it on, or they would not build one until it got cheaper to tell. A related disanalogy is that there is more agreement about what spaceships designs have enough food and how to tell than there is about what AGI designs are aligned and how to tell.
Another disanalogy is that everybody knows that if you design a spaceship without enough food on it and send people to Mars with it, then those people will die. Not very many people know that if you design an AGI that is not aligned and turn it on that people will likely die (or some other bad outcome will happen).
Are these disanologies enough to make the argument not go through? Are there other important disanaologies? If you don’t find the analogical argument convincing, how come?
The traditional arguments for why AGI could go wrong imply that AGI could go wrong even if you put an immense amount of effort into trying to patch errors. In machine learning, when we validate our models, we will ideally do so in an environment that we think matches the real world, but it’s common for the real world to turn out to be subtly different. In the extreme case, you could perform comprehensive testing and verification and still fail to properly assess the real world impact.
If the cost of properly ensuring safety is arbitrarily high, there is a point at which people will begin deploying unsafe systems. This is inevitable, unless you could somehow either ban computer hardware or stop AI research insights from proliferating.
I talked about this issue with Buck in the comments (my comment, Buck’s answer)
What I pointed was that the spaceship examples had very specific features:
Both personal and economic incentives are against the issue.
The problem are obvious when one is confronted with the situation
At the point where the problem becomes obvious, you can still solve it.
My intuition is that the main disanalogies with the AGI case are the first one (at least the economic incentives that might push people to try dangerous things when the returns are potentially great) and the last one, depending on your position on takeoffs.
One big difference is that “having enough food” admits a value function (“quantity of food”) that is both well understood and for the most part smooth and continuous over the design space, given today’s design methodology (if we try to design a ship with a particular amount of food and make a tiny mistake it’s unlikely that the quantity of food will change that much). In contrast, the “how well is it aligned” metric is very poorly understood (at least compared with “amount of food on a spaceship”) and a lot more discontinuous (using today’s techniques of designing AIs, a tiny error in alignment is almost certain to cause catastrophic failure). Basically—we do not know what exactly if means to get it right, and even if we knew, we do not know what the acceptable error tolerances are, and even if we knew, we do not know how to meet them. None of that applies to the amount of food on a spaceship.