Well, strategy 1 is “Keep it from thinking that it’s in an interactive environment”. Things like “don’t adjust the weights of the network while we ask questions” is a way to prevent it from thinking that it’s in an interactive environment based on first-hand experience—we’re engineering the experience to not leave traces in its knowledge. But to succeed in strategy 1, we also need to make sure that it doesn’t come to believe it’s in an interactive environment by other means besides first-hand experience, namely by abstract reasoning. More details in this comment, but basically an AGI with introspective information and world-knowledge will naturally over time figure out that it’s an AGI, and to figure out the sorts of environments that AGIs are typically in, and thus to hypothesize the existence of interactions even if those interactions have never happened before, and were not intended by the designer (e.g. the “Help I’m trapped in a GPU!” type interactions).
Hm, I think we’re talking past each other a bit. What I was trying to get at was: When we’re doing self-supervised learning, we’re optimizing an objective function related to the quality of system’s internal knowledge representations. My suggestion was that this internal objective function should have a term for the accuracy with which the system is able to predict masked bits of existing knowledge, but not a term for the accuracy of hypothesized future predictions a la beam search. Then we can use the system interactively as follows:
Give it some data.
Do self-supervised learning on the data, optimizing the quality of internal knowledge representations with a “short-sighted” objective function like I described.
Use these knowledge representations to make predictions of interest.
Repeat as needed.
What I’m looking for is a crisp description of why accurate self-knowledge (including knowledge of the interaction loop) is dangerous in this framework.
OK, hmm, let me try again then. This would be the section of the post entitled “A self-supervised learning algorithm in an interactive environment can become a manipulative goal-seeker”.
I’ve been assuming all along that the objective function only rewards the next word. Unfortunately, it seems that the way to achieve this objective in practice is to search for higher-level longer-term contexts that surround the next word, like when we’re watching TV and we think, “A commercial break is starting.” Knowing that a commercial break is starting is essential for predicting the very next frame on the TV screen, but it is also incidentally a (implicit) prediction about what will appear on the screen for the next few minutes. In other words, you could say that making accurate (possibly implicit) probabilistic predictions about the next many words is instrumentally useful for making accurate probabilistic predictions about the next one word, and is thus rewarded by the objective function. I expect that systems that work well will have to be designed this way (i.e. finding “contexts” that entail implicit predictions about many future words, as a step towards picking the single next word). I think this kind of thing is necessary to implement even very basic things like object permanence.
Then the next step is to suppose that the system (being highly intelligent) comes to believe that the prediction X will cause other aspects of the longer-term context to be Y. (See the “Hypothesis 1” vs “Hypothesis 2″ examples in the post.) If the system was previously thinking that P(X) is high and P(Y) is low, then ideally, the realization that X implies Y will cause the system to raise P(Y), while keeping P(X) at its previous value. This is, after all, the logically correct update, based on the direction of causality!
But if the system screws up, and lowers P(X) instead of raising P(Y), then it will make a manipulative prediction—the output is being chosen partially for its downstream interactive effects. (Not all manipulative predictions are dangerous, and there might be limits to how strongly it optimizes its outputs for their downstream effects, but I suspect that this particular case can indeed lead to catastrophic outcomes, just like we generically expect from AIs with real-world human-misaligned goals.)
Why should the system screw up this way? Just because the system’s causal models will sometimes have mistakes, and sometimes have uncertainties or blank spaces (statistical-regularities-of-unknown-cause), and also because humans make this type of mistake all the time (“One man’s modus ponens is another man’s modus tollens”). I suspect it will make the right update more often than chance, I just don’t see how we can guarantee that it will never make the wrong update in the manipulative Y-->X direction.
This description seems rather different than your original beam search story, no? In your original story, you were describing an incentive the system had to direct the world in order to make it easier to predict. I don’t see how this incentive arises here.
I’m not entirely convinced that predictions should be made in a way that’s completely divorced from their effects on the world. For example, the prediction “You aren’t going to think about ice cream” would appear to be self-falsifying. It seems like the most useful AI system would be one whose predictions tend to remain true even after being made.
(By the way, I hope I’m not coming across as antagonistic in this thread—I’m still replying because I think this is a really important topic and I’m hoping we can hammer it out together! And I think a crisp description of a problem is frequently the first step to solving it.)
This is great, thanks again for your time and thoughtful commentary!
RE “I’m not entirely convinced that predictions should be made in a way that’s completely divorced from their effects on the world.”: My vision is to make a non-agential question-answering AGI, thus avoiding value alignment. I don’t claim that this is definitely the One Right Answer To AGI Safety (see “4. Give up, and just make an agent with value-aligned goals” in the post), but I think it is a plausible (and neglected) candidate answer. See also my post In defense of oracle (tool) AI research for why I think it would solve the AGI safety problem.
If an AGI applies its intelligence and world model to its own output, choosing that output partly for its downstream effects as predicted by the model, then I say it’s a goal-seeking agent. In this case, we need to solve value alignment—even if the goal is as simple as “answer my question”. (We would need to make sure that the goal is what it’s supposed to be, as opposed to a proxy goal, or a weird alien interpretation where rewiring the operator’s brain counts as “answer my question”.) Again, I’m not opposed to building agents after solving value alignment, but we haven’t solved value alignment yet, and thus it’s worth exploring the other option: build a non-agent which does not intelligently model the downstream effects of its output at all (or if it does model it incidentally, to not do anything with that information).
Interfacing with a non-agential AGI is generally awkward. You can’t directly ask it to do things, or to find a better way to communicate. My proposal here is to ask questions like “If there were no AGIs in the world, what’s the likeliest way that a person would find a cure for Alzheimer’s?” This type of question does not require the AGI to think through the consequence of its output, and it also has other nice properties (it should give less weird and alien and human-unfriendly answers than the solutions a direct goal-seeking agent would find).
OK, that’s my grand vision and motivation, and why I’m hoping for “no reasoning about the consequences of one’s output whatsoever”, as opposed to finding self-fulfilling predictions. (Maybe very very mild optimization for the consequences of one’s outputs is OK, but I’m nervous.)
Your other question was: if a system is making manipulative predictions, towards what goal is it manipulating? Well, you noticed correctly, I’m not sure, and I keep changing my mind. And it may also be different answers depending on the algorithm details.
My top expectation is that it will manipulate towards getting further inputs that its model thinks are typical, high-probability inputs. If X implies Y, and P(Y) is low, that might sometimes spuriously push down P(X), and thus the system will pick those X’s that result in high P(Y).
My secondary expectation is that it might manipulate towards unambiguous, low-entropy outputs. This is the expectation if the system picks out the single most likely ongoing long-term context, and output a prediction contingent on that. (If instead the system randomly draws from the probability distribution of all possible contexts, this wouldn’t happen, as suggested by interstice’s comments on this page.) So if X1 leads to one of 500 slightly different Y1′s (Y1a, Y1b,...), while X2 definitely leads to only one specific Y2, then Y2 is probably the most likely single Y, even if all the Y1′s in aggregate are likelier than Y2; so X2 is at an unfair advantage.
Beyond those two, I suspect there can be other goals but they depend on the algorithm and its heuristics.
Yeah, I think something like that would probably work for 1B, but 1B is the easy part. It’s 1C & 1D that are keeping me up at night...
Can you be crisper about why you think 1C & 1D are necessary?
Well, strategy 1 is “Keep it from thinking that it’s in an interactive environment”. Things like “don’t adjust the weights of the network while we ask questions” is a way to prevent it from thinking that it’s in an interactive environment based on first-hand experience—we’re engineering the experience to not leave traces in its knowledge. But to succeed in strategy 1, we also need to make sure that it doesn’t come to believe it’s in an interactive environment by other means besides first-hand experience, namely by abstract reasoning. More details in this comment, but basically an AGI with introspective information and world-knowledge will naturally over time figure out that it’s an AGI, and to figure out the sorts of environments that AGIs are typically in, and thus to hypothesize the existence of interactions even if those interactions have never happened before, and were not intended by the designer (e.g. the “Help I’m trapped in a GPU!” type interactions).
Hm, I think we’re talking past each other a bit. What I was trying to get at was: When we’re doing self-supervised learning, we’re optimizing an objective function related to the quality of system’s internal knowledge representations. My suggestion was that this internal objective function should have a term for the accuracy with which the system is able to predict masked bits of existing knowledge, but not a term for the accuracy of hypothesized future predictions a la beam search. Then we can use the system interactively as follows:
Give it some data.
Do self-supervised learning on the data, optimizing the quality of internal knowledge representations with a “short-sighted” objective function like I described.
Use these knowledge representations to make predictions of interest.
Repeat as needed.
What I’m looking for is a crisp description of why accurate self-knowledge (including knowledge of the interaction loop) is dangerous in this framework.
OK, hmm, let me try again then. This would be the section of the post entitled “A self-supervised learning algorithm in an interactive environment can become a manipulative goal-seeker”.
I’ve been assuming all along that the objective function only rewards the next word. Unfortunately, it seems that the way to achieve this objective in practice is to search for higher-level longer-term contexts that surround the next word, like when we’re watching TV and we think, “A commercial break is starting.” Knowing that a commercial break is starting is essential for predicting the very next frame on the TV screen, but it is also incidentally a (implicit) prediction about what will appear on the screen for the next few minutes. In other words, you could say that making accurate (possibly implicit) probabilistic predictions about the next many words is instrumentally useful for making accurate probabilistic predictions about the next one word, and is thus rewarded by the objective function. I expect that systems that work well will have to be designed this way (i.e. finding “contexts” that entail implicit predictions about many future words, as a step towards picking the single next word). I think this kind of thing is necessary to implement even very basic things like object permanence.
Then the next step is to suppose that the system (being highly intelligent) comes to believe that the prediction X will cause other aspects of the longer-term context to be Y. (See the “Hypothesis 1” vs “Hypothesis 2″ examples in the post.) If the system was previously thinking that P(X) is high and P(Y) is low, then ideally, the realization that X implies Y will cause the system to raise P(Y), while keeping P(X) at its previous value. This is, after all, the logically correct update, based on the direction of causality!
But if the system screws up, and lowers P(X) instead of raising P(Y), then it will make a manipulative prediction—the output is being chosen partially for its downstream interactive effects. (Not all manipulative predictions are dangerous, and there might be limits to how strongly it optimizes its outputs for their downstream effects, but I suspect that this particular case can indeed lead to catastrophic outcomes, just like we generically expect from AIs with real-world human-misaligned goals.)
Why should the system screw up this way? Just because the system’s causal models will sometimes have mistakes, and sometimes have uncertainties or blank spaces (statistical-regularities-of-unknown-cause), and also because humans make this type of mistake all the time (“One man’s modus ponens is another man’s modus tollens”). I suspect it will make the right update more often than chance, I just don’t see how we can guarantee that it will never make the wrong update in the manipulative Y-->X direction.
Does that help?
Thanks for the thoughts!
This description seems rather different than your original beam search story, no? In your original story, you were describing an incentive the system had to direct the world in order to make it easier to predict. I don’t see how this incentive arises here.
I’m not entirely convinced that predictions should be made in a way that’s completely divorced from their effects on the world. For example, the prediction “You aren’t going to think about ice cream” would appear to be self-falsifying. It seems like the most useful AI system would be one whose predictions tend to remain true even after being made.
(By the way, I hope I’m not coming across as antagonistic in this thread—I’m still replying because I think this is a really important topic and I’m hoping we can hammer it out together! And I think a crisp description of a problem is frequently the first step to solving it.)
This is great, thanks again for your time and thoughtful commentary!
RE “I’m not entirely convinced that predictions should be made in a way that’s completely divorced from their effects on the world.”: My vision is to make a non-agential question-answering AGI, thus avoiding value alignment. I don’t claim that this is definitely the One Right Answer To AGI Safety (see “4. Give up, and just make an agent with value-aligned goals” in the post), but I think it is a plausible (and neglected) candidate answer. See also my post In defense of oracle (tool) AI research for why I think it would solve the AGI safety problem.
If an AGI applies its intelligence and world model to its own output, choosing that output partly for its downstream effects as predicted by the model, then I say it’s a goal-seeking agent. In this case, we need to solve value alignment—even if the goal is as simple as “answer my question”. (We would need to make sure that the goal is what it’s supposed to be, as opposed to a proxy goal, or a weird alien interpretation where rewiring the operator’s brain counts as “answer my question”.) Again, I’m not opposed to building agents after solving value alignment, but we haven’t solved value alignment yet, and thus it’s worth exploring the other option: build a non-agent which does not intelligently model the downstream effects of its output at all (or if it does model it incidentally, to not do anything with that information).
Interfacing with a non-agential AGI is generally awkward. You can’t directly ask it to do things, or to find a better way to communicate. My proposal here is to ask questions like “If there were no AGIs in the world, what’s the likeliest way that a person would find a cure for Alzheimer’s?” This type of question does not require the AGI to think through the consequence of its output, and it also has other nice properties (it should give less weird and alien and human-unfriendly answers than the solutions a direct goal-seeking agent would find).
OK, that’s my grand vision and motivation, and why I’m hoping for “no reasoning about the consequences of one’s output whatsoever”, as opposed to finding self-fulfilling predictions. (Maybe very very mild optimization for the consequences of one’s outputs is OK, but I’m nervous.)
Your other question was: if a system is making manipulative predictions, towards what goal is it manipulating? Well, you noticed correctly, I’m not sure, and I keep changing my mind. And it may also be different answers depending on the algorithm details.
My top expectation is that it will manipulate towards getting further inputs that its model thinks are typical, high-probability inputs. If X implies Y, and P(Y) is low, that might sometimes spuriously push down P(X), and thus the system will pick those X’s that result in high P(Y).
My secondary expectation is that it might manipulate towards unambiguous, low-entropy outputs. This is the expectation if the system picks out the single most likely ongoing long-term context, and output a prediction contingent on that. (If instead the system randomly draws from the probability distribution of all possible contexts, this wouldn’t happen, as suggested by interstice’s comments on this page.) So if X1 leads to one of 500 slightly different Y1′s (Y1a, Y1b,...), while X2 definitely leads to only one specific Y2, then Y2 is probably the most likely single Y, even if all the Y1′s in aggregate are likelier than Y2; so X2 is at an unfair advantage.
Beyond those two, I suspect there can be other goals but they depend on the algorithm and its heuristics.