It seems to me that Eliezer’s post is just wrong. His argument boils down to this:
the AI needs goals in order to decide how to think: that is, the AI has to act as a powerful optimization process in order to plan its acquisition of knowledge, effectively distill sensory information, pluck “answers” to particular questions out of the space of all possible responses, and of course, to improve its own source code up to the level where the AI is a powerful intelligence. [...] the AI needs a goal of answering questions, and that has to give rise to subgoals of choosing efficient problem-solving strategies, improving its code, and acquiring necessary information.
It’s not obvious that a Solomonoff-approximating AI must have a “goal”. It could just be a box that, y’know, predicts the next bit in a sequence. After all, if we had an actual uncomputable black box that printed correct Solomonoff-derived probability values for the next bit according to the mathematical definition, that box wouldn’t try to manipulate the human operator by embeding epileptic patterns in its predictions or something.
Maybe you could make a case that self-improvement requires real-world goals and is scary (instead of “superintelligence requires real-world goals and is scary”). But I’m not convinced of that either. In fact Karl’s post shows that it’s not necessarily the case. Also see Schmidhuber’s work on Goedel machines, etc. Most self-improving thingies I can rigorously imagine are not scary at all.
It is indeed true that reinforcement learning AIs are scary. For example, AIXI can and will manipulate you into rewarding it. But there are many ideas besides reinforcement learning.
ETA: I gave an idea for AI containment sometime ago, and it didn’t get shot down. There are probably many other ways to build a non-dangerous strong AI that don’t involve encoding or inferring the unitlity function of humanity.
ETA 2: it turns out that the connotations of this comment are wrong, thanks roystgnr.
A box that does nothing except predict the next bit in a sequence seems pretty innocuous, in the unlikely event that its creators managed to get its programming so awesomely correct on the first try that they didn’t bother to give it any self-improvement goals at all.
But even in that case there are probably still gotchas. Once you start providing the box with sequences that correspond to data about the real-world results of the previous and current predictions, then even a seemingly const optimization problem statement like “find the most accurate approximation of the probability distribution function for the next data set” becomes a form of a real-world goal. Stochastic approximation accuracy is typically positively correlated with the variance of the true solution, for instance, and it’s clear that the output variance of the world’s future would be greatly reduced if only there weren’t all those random humans mucking it up...
That doesn’t sound right. The box isn’t trying to minimize the “variance of the true solution”. It is stating its current beliefs that were computed from the input bit sequence by using a formula. If you think it will manipulate the operator when some of its output bits are fed into itself, could you explain that a little more technically?
I never said the box was trying to minimize the variance of the true solution for it’s own sake, just that it was trying to find an efficient accurate approximation to the true solution. That this efficiency typically increases as the variance of the true solution decreases means that the possibility of increasing efficiency by manipulating the true solution follows. Surely, no matter how goal-agnostic your oracle is, you’re going to try to make it as accurate as possible for a given computational cost, right?
That’s just the first failure mode that popped into my mind, and I think it’s a good one for any real computing device, but let’s try to come up with an example that even applies to oracles with infinite computational capability (and that explains how that manipulation occurs in either case). Here’s a slightly more technical but still grossly oversimplified discussion:
Suppose you give me the sequence of real world data y1, y2, y3, y4… and I come up with a superintelligent way to predict y5, so I tell you y5 := x5. You tell me the true y5 later, I use this new data to predict y6 := x6.
But wait! No matter how good my rule xn = f(y1...y{n-1}) was, it’s now giving me the wrong answers! Even if y4 was a function of {y1,y2,y3}, the very fact that you’re using my prediction x5 to affect the future of the real world means that y5 is now a function of {y1, y2, y3, y4, x5}. Eventually I’m going to notice this, and now I’m going to have to come up with a new, implicit rule for xn = f(y1...y{n-1},xn).
So now we’re not just trying to evaluate an f, we’re trying to find fixed points for an f—where in this context “a fixed point” is math lingo for “a self-fulfilling prophecy”. And depending on what predictions are called for, that’s a very different problem. “What would the stock market be likely to do tomorrow in a world with no oracles?” may give you a much more stable answer than “What is the stock market likely to do tomorrow after everybody hears the announcement of what a super-intelligent AI thinks the stock market is likely to do tomorrow?” “Who would be likely to kill someone tomorrow in a world with no oracles?” will probably result in a much shorter list than “Who is likely to kill someone tomorrow, after the police receives this answer from the oracle and sends SWAT to break down their doors?” “What is the probability of WW3 within ten years have been without an oracle?” may have a significantly more pleasant answer than “What would the probability of WW3 within ten years be, given that anyone whom the oracle convinces of a high probability has motivation to react with arms races and/or pre-emptive strikes?”
Ooh, this looks right. A predictor that “notices” itself in the outside world can output predictions that make themselves true, e.g. by stopping us from preventing predicted events, or something even more weird. Thanks!
(At first I thought Solomonoff induction doesn’t have this problem, because it’s uncomputable and thus cannot include a model of itself. But it seems that a computable approximation to Solomonoff induction may well exhibit such “UDT-ish” behavior, because it’s computable.)
This idea is probably hard to notice at first, since it requires recognizing that a future with a fixed definition can still be controlled by other things with fixed definitions (you don’t need to replace the question in order to control its answer). So even if a “predictor” doesn’t “act”, it still does determine facts that control other facts, and anything that we’d call intelligent cares about certain facts. For a predictor, this would be the fact that its prediction is accurate, and this fact could conceivably be controlled by its predictions, or even by some internal calculations not visible to its builders. With acausal control, air-tight isolation is more difficult.
I am pretty sure that Solomonoff induction doesn’t have this problem. Not because it is uncomputable, but because it’s not attempting to minimise its error rate.
If you play taboo with the word “goals” I think the argument may be dissolved.
My laptop doesn’t have a “goal” of satisfying my desire to read LessWrong. I simply open the web browser and type in the URL, initiating a basically deterministic process which the computer merely executes. No need to imbue it with goals at all.
Except now my browser is smart enough to auto-fill the LessWrong URL after just a couple of letters. Is that goal-directed behavior? I think we’re already at the point of hairsplitting semantic distinctions and we’re talking about web browsers, not advanced AI.
Likewise, it isn’t material whether an advanced predictor/optimizer has goals, what is relevant is that it will follow its programming when that programming tells it to “tell me the answer.” If it needs more information to tell you the answer, it will get it, and it won’t worry about how it gets it.
I think your taboo wasn’t strong enough and you allowed some leftover essence of anthropomorphic “goaliness” to pollute your argument.
When you talk about an “advanced optimizer” that “needs more information” to do something and goes out there to “get it”, that presupposes a model of AIs that I consider wrong (or maybe too early to talk about). If the AI’s code consists of navigating chess position trees, it won’t smash you in the face with a rook in order to win, no matter how strongly it “wants” to win or how much “optimization power” it possesses. If an AI believes with 100% probability that its Game of Life universe is the only one that exists, it won’t set out to conquer ours. AIXI is the only rigorously formulated dangerous AI that I know of, its close cousin Solomonoff Induction is safe, both these conclusions are easy, and neither requires CEV.
ETA: if someone gets a bright new idea for a general AI, of course they’re still obliged to ensure safety. I’m just saying that it may be easy to demonstrate for some AI designs.
It seems to me that Eliezer’s post is just wrong. His argument boils down to this:
It’s not obvious that a Solomonoff-approximating AI must have a “goal”. It could just be a box that, y’know, predicts the next bit in a sequence. After all, if we had an actual uncomputable black box that printed correct Solomonoff-derived probability values for the next bit according to the mathematical definition, that box wouldn’t try to manipulate the human operator by embeding epileptic patterns in its predictions or something.
Maybe you could make a case that self-improvement requires real-world goals and is scary (instead of “superintelligence requires real-world goals and is scary”). But I’m not convinced of that either. In fact Karl’s post shows that it’s not necessarily the case. Also see Schmidhuber’s work on Goedel machines, etc. Most self-improving thingies I can rigorously imagine are not scary at all.
It is indeed true that reinforcement learning AIs are scary. For example, AIXI can and will manipulate you into rewarding it. But there are many ideas besides reinforcement learning.
ETA: I gave an idea for AI containment sometime ago, and it didn’t get shot down. There are probably many other ways to build a non-dangerous strong AI that don’t involve encoding or inferring the unitlity function of humanity.
ETA 2: it turns out that the connotations of this comment are wrong, thanks roystgnr.
A box that does nothing except predict the next bit in a sequence seems pretty innocuous, in the unlikely event that its creators managed to get its programming so awesomely correct on the first try that they didn’t bother to give it any self-improvement goals at all.
But even in that case there are probably still gotchas. Once you start providing the box with sequences that correspond to data about the real-world results of the previous and current predictions, then even a seemingly const optimization problem statement like “find the most accurate approximation of the probability distribution function for the next data set” becomes a form of a real-world goal. Stochastic approximation accuracy is typically positively correlated with the variance of the true solution, for instance, and it’s clear that the output variance of the world’s future would be greatly reduced if only there weren’t all those random humans mucking it up...
That doesn’t sound right. The box isn’t trying to minimize the “variance of the true solution”. It is stating its current beliefs that were computed from the input bit sequence by using a formula. If you think it will manipulate the operator when some of its output bits are fed into itself, could you explain that a little more technically?
I never said the box was trying to minimize the variance of the true solution for it’s own sake, just that it was trying to find an efficient accurate approximation to the true solution. That this efficiency typically increases as the variance of the true solution decreases means that the possibility of increasing efficiency by manipulating the true solution follows. Surely, no matter how goal-agnostic your oracle is, you’re going to try to make it as accurate as possible for a given computational cost, right?
That’s just the first failure mode that popped into my mind, and I think it’s a good one for any real computing device, but let’s try to come up with an example that even applies to oracles with infinite computational capability (and that explains how that manipulation occurs in either case). Here’s a slightly more technical but still grossly oversimplified discussion:
Suppose you give me the sequence of real world data y1, y2, y3, y4… and I come up with a superintelligent way to predict y5, so I tell you y5 := x5. You tell me the true y5 later, I use this new data to predict y6 := x6.
But wait! No matter how good my rule xn = f(y1...y{n-1}) was, it’s now giving me the wrong answers! Even if y4 was a function of {y1,y2,y3}, the very fact that you’re using my prediction x5 to affect the future of the real world means that y5 is now a function of {y1, y2, y3, y4, x5}. Eventually I’m going to notice this, and now I’m going to have to come up with a new, implicit rule for xn = f(y1...y{n-1},xn).
So now we’re not just trying to evaluate an f, we’re trying to find fixed points for an f—where in this context “a fixed point” is math lingo for “a self-fulfilling prophecy”. And depending on what predictions are called for, that’s a very different problem. “What would the stock market be likely to do tomorrow in a world with no oracles?” may give you a much more stable answer than “What is the stock market likely to do tomorrow after everybody hears the announcement of what a super-intelligent AI thinks the stock market is likely to do tomorrow?” “Who would be likely to kill someone tomorrow in a world with no oracles?” will probably result in a much shorter list than “Who is likely to kill someone tomorrow, after the police receives this answer from the oracle and sends SWAT to break down their doors?” “What is the probability of WW3 within ten years have been without an oracle?” may have a significantly more pleasant answer than “What would the probability of WW3 within ten years be, given that anyone whom the oracle convinces of a high probability has motivation to react with arms races and/or pre-emptive strikes?”
Ooh, this looks right. A predictor that “notices” itself in the outside world can output predictions that make themselves true, e.g. by stopping us from preventing predicted events, or something even more weird. Thanks!
(At first I thought Solomonoff induction doesn’t have this problem, because it’s uncomputable and thus cannot include a model of itself. But it seems that a computable approximation to Solomonoff induction may well exhibit such “UDT-ish” behavior, because it’s computable.)
This idea is probably hard to notice at first, since it requires recognizing that a future with a fixed definition can still be controlled by other things with fixed definitions (you don’t need to replace the question in order to control its answer). So even if a “predictor” doesn’t “act”, it still does determine facts that control other facts, and anything that we’d call intelligent cares about certain facts. For a predictor, this would be the fact that its prediction is accurate, and this fact could conceivably be controlled by its predictions, or even by some internal calculations not visible to its builders. With acausal control, air-tight isolation is more difficult.
I am pretty sure that Solomonoff induction doesn’t have this problem.
Not because it is uncomputable, but because it’s not attempting to minimise its error rate. It doesn’t care if its predictions don’t match reality.
If reality ~ computable, then minimizing error rate ~ matching reality.
(Retracted because I misread your comment. Will think more.)
I am pretty sure that Solomonoff induction doesn’t have this problem. Not because it is uncomputable, but because it’s not attempting to minimise its error rate.
Also, what about this?
If you play taboo with the word “goals” I think the argument may be dissolved.
My laptop doesn’t have a “goal” of satisfying my desire to read LessWrong. I simply open the web browser and type in the URL, initiating a basically deterministic process which the computer merely executes. No need to imbue it with goals at all.
Except now my browser is smart enough to auto-fill the LessWrong URL after just a couple of letters. Is that goal-directed behavior? I think we’re already at the point of hairsplitting semantic distinctions and we’re talking about web browsers, not advanced AI.
Likewise, it isn’t material whether an advanced predictor/optimizer has goals, what is relevant is that it will follow its programming when that programming tells it to “tell me the answer.” If it needs more information to tell you the answer, it will get it, and it won’t worry about how it gets it.
I think your taboo wasn’t strong enough and you allowed some leftover essence of anthropomorphic “goaliness” to pollute your argument.
When you talk about an “advanced optimizer” that “needs more information” to do something and goes out there to “get it”, that presupposes a model of AIs that I consider wrong (or maybe too early to talk about). If the AI’s code consists of navigating chess position trees, it won’t smash you in the face with a rook in order to win, no matter how strongly it “wants” to win or how much “optimization power” it possesses. If an AI believes with 100% probability that its Game of Life universe is the only one that exists, it won’t set out to conquer ours. AIXI is the only rigorously formulated dangerous AI that I know of, its close cousin Solomonoff Induction is safe, both these conclusions are easy, and neither requires CEV.
ETA: if someone gets a bright new idea for a general AI, of course they’re still obliged to ensure safety. I’m just saying that it may be easy to demonstrate for some AI designs.