Summary
I believe that advanced AI systems will likely be aligned with the goals of their human operators, at least in a narrow sense. I’ll give three main reasons for this:
The transition to AI may happen in a way that does not give rise to the alignment problem as it’s usually conceived of.
While work on the alignment problem appears neglected at this point, it’s likely that large amounts of resources will be used to tackle it if and when it becomes apparent that alignment is a serious problem.
Even if the previous two points do not hold, we have already come up with a couple of smart approaches that seem fairly likely to lead to successful alignment.
This argument lends some support to work on non-technical interventions like moral circle expansionor improving AI-related policy, as well as work on special aspects of AI safety like decision theory or worst-case AI safety measures.
Number two is sometimes known as the Adams Law of Slow-Moving Disasters:
In general, however, it seems that you believe something you want to believe and find justifications for this belief, because it is more comfortable to think that things will magically work out. Eliezer wrote at length about this failure mode.
I get the same impression from the AI doomsayers. I think it is more likely to be true for the AI doomsayers on priors, because a big part of their selling proposition to their donors is that they have some kind of special insight about the difficulty of the alignment problem that the mainstream AI research community lacks. And we all know how corrupting these kind of financial incentives can be.
The real weak point of the AI doomsayer argument is not discussed anywhere in the sequences, but Eliezer does defend it here. The big thing Eliezer seems to believe, which I don’t think any mainstream AI people believe, is that shoving a consequentialist with preferences about the real world into your optimization algorithm is gonna be the key to making it a lot more powerful. I don’t see any reasons to believe this, it seems kinda anthropomorphic to be honest, and his point about “greedy local properties” is a pretty weak one IMO. We have algorithms like Bayesian optimization which don’t have these greedy local properties but still don’t have consequentialist objectives in the real world, because their “desires”, “knowledge”, “ontology” etc. deal only with the loss surface they’re trying to optimize over. It seems weird and implausible that giving the algorithm consequentialist desires involving the “outside world” would somehow be the key to making optimization (and therefore learning) work a lot better.
(Note: This comment of mine feels uncomfortably like a quasi ad-hominem attack of the kind that generates more heat than light. I’m only writing it because shminux’s comment also struck me that way, and I’m currently playing tit for tat on this. I don’t endorse writing these sort of comments in general. I don’t necessarily think you should stop donating to MIRI or stop encouraging researchers to be paranoid about AI safety. I’m just annoyed by the dismissiveness I sometimes see towards anyone who doesn’t think about AI safety the same way MIRI does, and I think it’s worth sharing that the more I think & learn about AI and ML, the more wrong MIRI seems.)
From the article you linked:
[emphasis mine]
The piece seems to be about how trying to control AI by dividing power is a bad idea, because then we’re doomed if they ever figure out how to get along with each other.
Why would you put two consequentialists in your system that are optimizing for different sets of consequences? A consequentialist is a high-level component, not a low-level one. Anthropomorphic bias might lead you to believe that a “consequentialist agent” is ontologically fundamental, a conceptual atom which can’t be divided. But this doesn’t really seem to be true from a software perspective.
You can have an AI that isn’t a consequentialist. Many deep learning algorithms are pure discriminators, they are not very dangerous or very useful. If I want to make a robot that tidies my room, the simplest conceptual framework for this is a consequentialist with real world goals. (I could also make a hackish patchwork of heuristics, like evolution would). If I want the robot to deal with circumstances that I haven’t considered, most hardcoded rules approaches fail, you need something that behaves like a consequentialist with real world preferences.
I’m not saying that all AI’s will be real world consequentialists, just that there are many tasks only real world consequentialists can do. So someone will build one.
Also, they set up the community after they realized the problem, and they could probably make more money elsewhere. So there doesn’t seem to be strong incentives to lie.
Nowadays people don’t use hardcoded rules, they use machine learning. Then the problem of AI safety boils down to the problem of doing really good machine learning: having models with high accuracy that generalize well. Once you’ve got a really good model for your preferences, and for what constitutes corrigible behavior, then you can hook it up to an agent if you want it to be able to do a wide range of tasks. (Note: I wouldn’t recommend a “consequentialist” agent, because consequentialism sounds like the system believes the ends justify the means, and that’s not something we want for our first AGI—see corrigibility.)
I’m not accusing them of lying, I think they are communicating their beliefs accurately. “It’s difficult to get a man to understand something, when his salary depends on his not understanding it.” MIRI has a lot invested in the idea that AI safety is a hard problem which must have a difficult solution. So there’s a sense in which the salaries of their employees depend on them not understanding how a simple solution to FAI might work. This is really unfortunate because simple solutions tend to be the most reliable & robust.
Donald Knuth
MIRI has started with the opposite assumption. Insofar as I’m pessimistic about them as an organization, this is the main reason why.
Inadequate Equilibria talks about the problem of the chairman of Japan’s central bank, who doesn’t have a financial incentive to help Japan’s economy. Does it change the picture if the chairman of Japan’s central bank could make a lot more money in investment banking? Not really. He still isn’t facing a good set of incentives when he goes into work every day, meaning he is not going to do a good job. He probably cares more about local social incentives than his official goal of helping the Japanese economy. Same for MIRI employees.
That doesn’t sound correct. My understanding is that they’re looking for simple solutions, in the sense that quantum mechanics and general relativity are simple. What they’ve invested a lot in is the idea that it’s hard to even ask the right questions about how AI alignment might work. They’re biased against easy solutions, but they might also be biased in favor of simple solutions.
We value quantum mechanics and relativity because there are specific phenomena which they explain well. If I’m a Newtonian physics advocate, you can point me to solid data my theory doesn’t predict in order to motivate the development of a more sophisticated theory. We were able to advance beyond Newtonian physics because we were able to collect data which disconfirmed the theory. Similarly, if someone suggests a simple approach to FAI, you should offer a precise failure mode, in the sense of a toy agent in a toy environment which clearly exhibits undesired behavior (writing actual code to make the conversation precise and resolve disagreements as necessary), before dismissing it. This is how science advances. If you add complexity to your theory without knowing what the deficiencies of the simple theory are, you probably won’t add the right sort of complexity because your complexity isn’t well motivated.
Math and algorithms are both made up of thought stuff. So these physics metaphors can only go so far. If I start writing a program to solve some problem, and I choose a really bad set of abstractions, I may get halfway through writing my program and think to myself “geez this is a really hard problem”. The problem may be very difficult to think about given the set of abstractions I’ve chosen, but it could be much easier to think about given a different set of abstractions. It’d be bad for me to get invested in the idea that the problem is hard to think about, because that could cause me to get attached to my current set of inferior abstractions. If you want the simplest solution possible, you should exhort yourself to rethink your abstractions if things get complicated. You should always be using your peripheral vision to watch out for alternative sets of abstractions you could be using.
Adam’s law of slow moving disasters only applies when the median individual can understand the problem, and the evidence that it is a problem. We didn’t get nuclear protests or treaties until overwhelming evidence that nukes were possible in the form of detonations. No one was motivated to protest or sign treaties based on abstract physics arguments about what might be possible some day. Action regarding climate change didn’t start until the evidence became quite clear. The outer space treaty wasn’t signed until 1967, 5 years after human spaceflight and only 2 before the moon landings.
Human morals are specific and complex (in the formal, high information sense of the word complexity) They also seem hard to define. A strict definition of human morality, or a good referent to it would be morality. Could you have powerful and useful AI that didn’t have this. This would be some kind of whitelisting or low impact optimization, as a general optimization over all possible futures is a disaster without morality. These AI may be somewhat useful, but not nearly as useful as they would be with fewer constraints.
I would make a distinction between math first AI, like logical induction and AIXI, where we understand the AI before it is built. Compare to code first AI, like anything produced by an evolutionary algorithm, anything “emergent” and most deep neural networks, where we build the AI then see what it does. The former approach has a chance of working, a code first ASI is almost certain doom.
I would question the phrase “becomes apparent that alignment is a serious problem”, I do not think this is going to happen. Before ASI, we will have the same abstract and technical arguments we have now for why alignment might be a problem. We will have a few more alpha go moments, but while some go “wow, AGI near”, others will say “go isn’t that hard, we are a long way from this scifi AGI”, or “Superintelligence will be friendly by default”. A few more people might switch sides, but we have already had one alpha go moment, and that didn’t actually make a lot of difference. There is no giant neon sign flashing “ALIGNMENT NOW!”. See no fire alarm on AGI.
Even if we do have a couple of approaches that seem likely to work, it is still difficult to turn a rough approach into a formal technical specification into programming code. The code has to have reasonable runtime. Then the first team to develop AGI have to be using a math first approach and implement alignment without serious errors. I admit that there are probably a few disjunctive possibilities I’ve missed. And these events aren’t independent. Conditional on friendly ASI I would expect a large amount of talent and organizational competence working on AI safety.
I want to mention that this link was also posted to EA Forum, and I posted a number of comments there.
Typo thread:
moral circle expansionor