It seems like Paul acknowledges the issues here, but his argument is that you can amplify humans without routing through “the hard parts” that are articulated in this post. i.e. it seems like you can use current ML to build something that helps a human effectively “think longer”, and he thinks one can do this without routing through the dangerous-plan-searchspace. I don’t know if there’s much counterargument beyond “no, if you’re building an ML system that helps you think longer about anything important, you already need to have solved the hard problem of searching through plan-space for actually helpful plans.”
But, here were some comments from that thread I found helpful to reread.
Eliezer:
I have a new oversimplied straw way to misdescribe the disagreement between myself and Paul Christiano, and I’m interested to hear what Paul thinks of it:
Paul sees “learn this function” or “learn and draw from this probability distribution” as a kind of primitive op that modern machine learning is increasingly good at and can probably give us faithful answers to. He wants to figure out how to compose these primitives into something aligned.
Eliezer thinks that what is inside the black box inexorably kills you when the black box is large enough, like how humans are cognitive daemons of natural selection (the outer optimization process operating on the black box of genes accidentally constructed a (sapient) inner optimization process inside the black box) and this is chicken-and-egg unavoidable whenever the black box is powerful enough to do something like predict complicated human judgments, since in this case the outer optimization was automatically powerful enough to consider and select among multiple hypotheses the size of humans, and the inner process is automatically as powerful as human intelligence.
Paul thinks it may be possible to do something with this anyway given further expenditures of cleverness and computation, like taking a million such black boxes and competing them to produce the best predictions.
Eliezer expects an attempt like this to come with catastrophically exponential sample-complexity costs in theory, and to always fail in practice because you are trying to corral hostile superintelligences which means you’ve already lost. E.g. we can tell stories about how inner predictors could take over AIXI-tl by using bad off-policy predictions to manipulate AIXI-tl into a position where only that predictor (or LDT-symmetrized predictor class) can predict the answer to a one-way-hash problem it set up; and this isn’t an isolated flaw, it faithfully reflects the fact that once you are trying to outwit a hostile superintelligence you’re already screwed. Plus maybe it can do the equivalent of Rowhammering you, or using a bad but “testable” prediction just once that gets somebody to let it out of the box, etcetera. Only it doesn’t do any of those things, it does something cleverer, etcetera. Eliezer thinks that once there’s a hostile superintelligence running anywhere inside the system you are almost surely screwed *in practice*, which means Eliezer thinks Paul never gets to the point of completing one primitive op of the system before the system kills him.
Paul:
I think this is a mostly correct picture of the disagreement. I *would* agree with “what is inside the black box inexorably kills you when the black box is large enough,” if we imagine humans interacting with an arbitrarily large black box. This is a key point of agreement.
I am optimistic anyway because before humanity tries to produce “aligned AI with IQ 200” we can produce “aligned AI with IQ 199.” Indeed, if we train our systems with gradient descent then the intelligence will almost necessarily increase continuously. The goal is to maintain alignment as an inductive invariant, not to abruptly inject it into an extremely powerful system. So the gap between “smartest aligned intelligence we have access to” and “AI we are currently trying to train” is always quite small. This doesn’t make the problem easy, but I do think it’s a central feature of the problem that isn’t well accounted for in your arguments for pessimism.
Buck:
My guess of Eliezer’s reply is:
If we had an IQ 199 aligned AGI, that would indeed be super handy for building the IQ 200 one. But that seems quite unlikely.
Firstly, the black box learner required to build an AI that is aligned at all (eg the first step of capability amplification), even if that learned AI is very dumb, must itself be a really powerful learner, powerful enough that it is susceptible to scary internal unaligned optimizers.
Secondly, building an IQ 200 aligned agent via imitation of a sequence of progressively smarter aligned agents seems quite unlikely to be competitive, so without unlikely amounts of coordination someone will just directly build the IQ 200 agent.
Paul:
The first step of capability amplification is a subhuman AI similar in kind to the AI we have today; so if this is someone’s objection then they ought to be able to stick their neck out today (e.g. by saying that we can’t solve the alignment problem for systems we build today, or by saying that systems we can build today definitely won’t be able to participate in amplification).
The AlphaGo Zero example really seems to take much of the wind out of the concerns about feasibility. It’s the most impressive example of RL to date, and it was literally trained as a sequence of increasingly good go players learning to imitate one another.
I think the worst concerns are daemons (which are part of the unreasonably-innocuous-sounding “reliability” in my breakdown) and the impossibility alignment-preserving amplification. Setting up imitation / making informed oversight work also seems pretty hard, but I think it’s less related to Eliezer’s concerns.
Buck:
My Eliezer-model says that systems we can build today definitely won’t be able to participate that helpfully in amplification. Amplification of human tasks seems like an extremely hard problem—our neural nets don’t seem like they’d be up to it without adding in a bunch of features like attention, memory, hierarchical planning systems, and so on. The daemons come in once you start adding in all of that, and if you don’t add in all that, your neural nets aren’t powerful enough to help you.
Skipping ahead a bit, to the next Paul rejoinder that felt relevant:
Paul
I think the most important general type of amplification is “think longer.” I think breaking a question down into pieces is a relatively easy case for amplification that can probably work with existing models. MCTS is a lot easier to get working than that.
> My Eliezer-model says that systems we can build today definitely won’t be able to participate that helpfully in amplification.
To the extent that this is actually anyone’s view, it would be great if they could be much clearer about what exactly they think can’t be done.
The first step of capability amplification is a subhuman AI similar in kind to the AI we have today; so if this is someone’s objection then they ought to be able to stick their neck out today (e.g. by saying that we can’t solve the alignment problem for systems we build today, or by saying that systems we can build today definitely won’t be able to participate in amplification).
It seems non-obvious that the systems we have today can be aligned with human values. They certainly aren’t smart enough to model all of human morality, but they may be able to have some corrigibility properties? This presents the research directions of:
Train a model to have corrigibility properties, as an existence proof. This also provides the opportunity to study the architecture of such a model.
Develop some theory relating corrigibility properties, and expressiveness of your model.
Eliezer thinks that what is inside the black box inexorably kills you when the black box is large enough, like how humans are cognitive daemons of natural selection (the outer optimization process operating on the black box of genes accidentally constructed a (sapient) inner optimization process inside the black box) and this is chicken-and-egg unavoidable whenever the black box is powerful enough to do something like predict complicated human judgments, since in this case the outer optimization was automatically powerful enough to consider and select among multiple hypotheses the size of humans, and the inner process is automatically as powerful as human intelligence.
It might be worth pointing out that evolution seems to be doing something different from the oracle in the Original Post.
Evolution:
building something piece by piece, and testing those pieces (in reality), and then building things from those
Oracle:
Wandering the space, adrift from that connection to reality, w/out the checking throughout.
I don’t know if there’s much counterargument beyond “no, if you’re building an ML system that helps you think longer about anything important, you already need to have solved the hard problem of searching through plan-space for actually helpful plans.”
This is definitely a problem, but I would say human amplification further isn’t a solution because humans aren’t aligned.
I don’t really have a good what human values are, even in an abstract English definition sense, but I’m pretty confident that “human values” are not, and are not easily transformable from, “a human’s values.”
Though maybe that’s just most of the reason why you’d have to have your amplifier already aligned, and not a separate problem itself.
While writing this, I was reminded of an older (2017) conversation between Eliezer and Paul on FB. I reread it to see whether Paul seemed like he’d be making the set of mistakes this post is outlining.
It seems like Paul acknowledges the issues here, but his argument is that you can amplify humans without routing through “the hard parts” that are articulated in this post. i.e. it seems like you can use current ML to build something that helps a human effectively “think longer”, and he thinks one can do this without routing through the dangerous-plan-searchspace. I don’t know if there’s much counterargument beyond “no, if you’re building an ML system that helps you think longer about anything important, you already need to have solved the hard problem of searching through plan-space for actually helpful plans.”
But, here were some comments from that thread I found helpful to reread.
Eliezer:
Paul:
Buck:
Paul:
Buck:
Skipping ahead a bit, to the next Paul rejoinder that felt relevant:
Paul
It seems non-obvious that the systems we have today can be aligned with human values. They certainly aren’t smart enough to model all of human morality, but they may be able to have some corrigibility properties? This presents the research directions of:
Train a model to have corrigibility properties, as an existence proof. This also provides the opportunity to study the architecture of such a model.
Develop some theory relating corrigibility properties, and expressiveness of your model.
It might be worth pointing out that evolution seems to be doing something different from the oracle in the Original Post.
Evolution:
building something piece by piece, and testing those pieces (in reality), and then building things from those
Oracle:
Wandering the space, adrift from that connection to reality, w/out the checking throughout.
This is definitely a problem, but I would say human amplification further isn’t a solution because humans aren’t aligned.
I don’t really have a good what human values are, even in an abstract English definition sense, but I’m pretty confident that “human values” are not, and are not easily transformable from, “a human’s values.”
Though maybe that’s just most of the reason why you’d have to have your amplifier already aligned, and not a separate problem itself.