Example 1 basically seems to be the problem of output diversity in generative models. This can be a problem in generative models, but there are ways around it. e.g. instead of outputting the highest-probability individual sequence, which will certainly look “manipulative” as you say, sample from the implied distribution over sequences. Then the sentence involving “pyrite” will be output with probability proportional to how likely the model thinks “pyrite” is on its own, disregarding subsequent tokens.
For example 2, I wrote a similar post a few months ago (and in fact, this idea seems to have been proposed and forgotten a few times on LW). But for gradient descent-based learning systems, I don’t think the effect described will take place.
The reason is that gradient-descent-based systems are only updated towards what they actually observe. Let’s say we’re training a system to predict EU laws. If it predicts “The EU will pass potato laws...” but sees “The EU will pass corn laws...” the parameters will be updated to make “corn” more likely to have been output than “potato”. There is no explicit global optimization for prediction accuracy.
As you train to convergence, the predictions of the model will attempt to approach a fixed point, a set of predictions that imply themselves. However, due to the local nature of the update, this fixed-point will not be selected to be globally minimal, it will just be the first minima the model falls into. (This is different from the problems with “local minima” you may have heard about in ordinary neural network training—those go away in the infinite-capacity limit, whereas local minima among fixed-points do not) The fixed-point should look something like “what I would predict if I output [what I would predict if I output [what I would predict .. ]]]” where the initial prediction is some random gibberish. This might look pretty weird, but it’s not optimizing for global prediction accuracy.
Thank you for the links!! Sorry I missed them! I’m not sure I understand your comments though and want to clarify:
I’m going to try to rephrase what you said about example 1. Maybe the text in any individual journal article about pyrite is perplexing, but given that the system expects some article about pyrite there, it should ramp the probabilities of individual articles up or down such that the total probability of seeing a journal article about pyrite, conditional on the answer “pyrite”, is 100%. (By the same token, “The following is a random number: 2113164″ is, in a sense, an unsurprising text string.) I agree with you that a system that creates a sensible, self-consistent probability distribution for text strings would not have a problem with example 1 if we sample from that distribution. (Thanks.) I am concerned that we will build a system with heuristic-guided search processes, not self-consistent probability estimates, and that this system will have a problem with example 1. After all, humans are subject to the conjunction fallacy etc., I assume AGIs will be too, right? Unless we flag this as a critical safety requirement and invent good techniques to ensure it. (I updated the post in a couple places to clarify this point, thanks again.)
For gradient descent, yes they are “only updated towards what they actually observe”, but they may “observe” high-level abstractions and not just low-level features. It can learn about a new high-level context in which the low-level word sequence statistics would be very different than when superficially-similar text appeared in the past. So I don’t understand how you’re ruling out example 2 on that basis.
I mostly agree with what you say about fixed points in principle, but with the additional complication that the system’s beliefs may not reflect reality, especially if the beliefs come about through abstract reasoning (in the presence of imperfect information) rather than trial-and-error. If the goal is “No manipulative answers at all ever, please just try to predict the most likely masked bits in this data-file!”—then hopefully that trial-and-error will not happen, and in this case I think fixed points becomes a less useful framework to think about what’s going on.
No worries, I also missed the earlier posts when I wrote mine. There’s lots of stuff on this website.
I endorse your rephrasing of example 1. I think my position is that it’s just not that hard to create a “self-consistent probability distribution”. For example, say you trained an RNN to predict sequences, like in this post. Despite being very simple, it already implicitly represents a probability distribution over sequences. If you train it with back-propagation on a confusing article involving pyrite, then its weights will be updated to try to model the article better. However, if “pyrite” itself was easy to predict, then the weights that lead to it outputting “pyrite” will *not* be updated. The same thing holds for modern Transformer networks, which predict the next token based only on what it has seen so far. (Here is a paper with a recent example using GPT-2. Note the degeneracy of maximum likelihood sampling, but how this becomes less of a problem when just sampling from the implied distribution)
I agree that this sort of manipulative prediction could be a problem in principle, but it does not seem to occur in recent ML systems. (Although, there are some things which are somewhat like this; the earlier paper I linked and mode collapse do involve neglecting high-entropy components of the distribution. However, the most straightforward generation and training schemes do not incentivize this)
For example 2, the point about gradient descent is this: while it might be the case that outputting “Help I’m stuck in a GPU Factory000” would ultimately result in a higher accuracy, the way the gradient is propagated would not encourage the agent to behave manipulatively. This is because, *locally*, “Help I’m stuck in a GPU Factory” decreases accuracy, so that behavior(or policies leading to it) will be dis-incentivized by gradient descent. It may be the case that this will result in easier predictions later, but the structure of the reward function does not lead to any optimization pressure towards such manipulative strategies. Learning taking place over high-level abstractions doesn’t change anything, because any high-level abstractions leading to locally bad behavior will likewise be dis-incentivized by gradient descent
Thanks, that’s helpful! I’ll have to think about the “self-consistent probability distribution” issue more, and thanks for the links. (ETA: Meanwhile I also added an “Update 2″ to the post, offering a different way to think about this, which might or might not be helpful.)
Let me try the gradient descent argument again (and note that I am sympathetic, and indeed I made (what I think is) that exact argument a few weeks ago, cf. Self-Supervised Learning and AGI Safety, section title “Why won’t it try to get more predictable data?”). My argument here is not assuming there’s a policy of trying to get more predictable data for its own sake, but rather that this kind of behavior arises as a side-effect of an algorithmic process, and that all the ingredients of that process are either things we would program into the algorithm ourselves or things that would be incentivized by gradient descent.
The ingredients are things like “Look for and learn patterns in all accessible data”, which includes both low-level patterns in the raw data, higher-level patterns in the lower-level patterns, and (perhaps unintentionally) patterns in accessible information about its own thought process (“After I visualize the shape of an elephant tusk, I often visualize an elephant shortly thereafter”). It includes searching for transformations (cause-effect, composition, analogies, etc.) between any two patterns it already knows about (“sneakers are a type of shoe”, or more problematically, “my thought processes resemble the associative memory of an AGI”), and cataloging these transformations when they’re found. Stuff like that.
So, “make smart hypotheses about one’s own embodied situation” is definitely an unintended side-effect, and not rewarded by gradient descent as such. But as its world-model becomes more comprehensive, and as it continues to automatically search for patterns in whatever information it has access to, “make smart hypotheses about one’s own embodied situation” would just be something that happens naturally, unless we somehow prevent it (and I can’t see how to prevent it). Likewise, “model one’s own real-world causal effects on downstream data” is neither desired by us nor rewarded (as such) by gradient descent. But it can happen anyway, as a side-effect of the usually-locally-helpful rule of “search through the world-model for any patterns and relationships which may impact our beliefs about the upcoming data”. Likewise, we have the generally-helpful rule “Hypothesize possible higher-level contexts that span an extended swathe of text surrounding the next word to be predicted, and pick one such context based on how surprising it would be based on what it knows about the preceding text and the world-model, and then make a prediction conditional on that context”. All these ingredients combine to get the pathological behavior of choosing “Help I’m trapped in a GPU”. That’s my argument, anyway...
Example 1 basically seems to be the problem of output diversity in generative models. This can be a problem in generative models, but there are ways around it. e.g. instead of outputting the highest-probability individual sequence, which will certainly look “manipulative” as you say, sample from the implied distribution over sequences. Then the sentence involving “pyrite” will be output with probability proportional to how likely the model thinks “pyrite” is on its own, disregarding subsequent tokens.
For example 2, I wrote a similar post a few months ago (and in fact, this idea seems to have been proposed and forgotten a few times on LW). But for gradient descent-based learning systems, I don’t think the effect described will take place.
The reason is that gradient-descent-based systems are only updated towards what they actually observe. Let’s say we’re training a system to predict EU laws. If it predicts “The EU will pass potato laws...” but sees “The EU will pass corn laws...” the parameters will be updated to make “corn” more likely to have been output than “potato”. There is no explicit global optimization for prediction accuracy.
As you train to convergence, the predictions of the model will attempt to approach a fixed point, a set of predictions that imply themselves. However, due to the local nature of the update, this fixed-point will not be selected to be globally minimal, it will just be the first minima the model falls into. (This is different from the problems with “local minima” you may have heard about in ordinary neural network training—those go away in the infinite-capacity limit, whereas local minima among fixed-points do not) The fixed-point should look something like “what I would predict if I output [what I would predict if I output [what I would predict .. ]]]” where the initial prediction is some random gibberish. This might look pretty weird, but it’s not optimizing for global prediction accuracy.
Thank you for the links!! Sorry I missed them! I’m not sure I understand your comments though and want to clarify:
I’m going to try to rephrase what you said about example 1. Maybe the text in any individual journal article about pyrite is perplexing, but given that the system expects some article about pyrite there, it should ramp the probabilities of individual articles up or down such that the total probability of seeing a journal article about pyrite, conditional on the answer “pyrite”, is 100%. (By the same token, “The following is a random number: 2113164″ is, in a sense, an unsurprising text string.) I agree with you that a system that creates a sensible, self-consistent probability distribution for text strings would not have a problem with example 1 if we sample from that distribution. (Thanks.) I am concerned that we will build a system with heuristic-guided search processes, not self-consistent probability estimates, and that this system will have a problem with example 1. After all, humans are subject to the conjunction fallacy etc., I assume AGIs will be too, right? Unless we flag this as a critical safety requirement and invent good techniques to ensure it. (I updated the post in a couple places to clarify this point, thanks again.)
For gradient descent, yes they are “only updated towards what they actually observe”, but they may “observe” high-level abstractions and not just low-level features. It can learn about a new high-level context in which the low-level word sequence statistics would be very different than when superficially-similar text appeared in the past. So I don’t understand how you’re ruling out example 2 on that basis.
I mostly agree with what you say about fixed points in principle, but with the additional complication that the system’s beliefs may not reflect reality, especially if the beliefs come about through abstract reasoning (in the presence of imperfect information) rather than trial-and-error. If the goal is “No manipulative answers at all ever, please just try to predict the most likely masked bits in this data-file!”—then hopefully that trial-and-error will not happen, and in this case I think fixed points becomes a less useful framework to think about what’s going on.
No worries, I also missed the earlier posts when I wrote mine. There’s lots of stuff on this website.
I endorse your rephrasing of example 1. I think my position is that it’s just not that hard to create a “self-consistent probability distribution”. For example, say you trained an RNN to predict sequences, like in this post. Despite being very simple, it already implicitly represents a probability distribution over sequences. If you train it with back-propagation on a confusing article involving pyrite, then its weights will be updated to try to model the article better. However, if “pyrite” itself was easy to predict, then the weights that lead to it outputting “pyrite” will *not* be updated. The same thing holds for modern Transformer networks, which predict the next token based only on what it has seen so far. (Here is a paper with a recent example using GPT-2. Note the degeneracy of maximum likelihood sampling, but how this becomes less of a problem when just sampling from the implied distribution)
I agree that this sort of manipulative prediction could be a problem in principle, but it does not seem to occur in recent ML systems. (Although, there are some things which are somewhat like this; the earlier paper I linked and mode collapse do involve neglecting high-entropy components of the distribution. However, the most straightforward generation and training schemes do not incentivize this)
For example 2, the point about gradient descent is this: while it might be the case that outputting “Help I’m stuck in a GPU Factory000” would ultimately result in a higher accuracy, the way the gradient is propagated would not encourage the agent to behave manipulatively. This is because, *locally*, “Help I’m stuck in a GPU Factory” decreases accuracy, so that behavior(or policies leading to it) will be dis-incentivized by gradient descent. It may be the case that this will result in easier predictions later, but the structure of the reward function does not lead to any optimization pressure towards such manipulative strategies. Learning taking place over high-level abstractions doesn’t change anything, because any high-level abstractions leading to locally bad behavior will likewise be dis-incentivized by gradient descent
Thanks, that’s helpful! I’ll have to think about the “self-consistent probability distribution” issue more, and thanks for the links. (ETA: Meanwhile I also added an “Update 2″ to the post, offering a different way to think about this, which might or might not be helpful.)
Let me try the gradient descent argument again (and note that I am sympathetic, and indeed I made (what I think is) that exact argument a few weeks ago, cf. Self-Supervised Learning and AGI Safety, section title “Why won’t it try to get more predictable data?”). My argument here is not assuming there’s a policy of trying to get more predictable data for its own sake, but rather that this kind of behavior arises as a side-effect of an algorithmic process, and that all the ingredients of that process are either things we would program into the algorithm ourselves or things that would be incentivized by gradient descent.
The ingredients are things like “Look for and learn patterns in all accessible data”, which includes both low-level patterns in the raw data, higher-level patterns in the lower-level patterns, and (perhaps unintentionally) patterns in accessible information about its own thought process (“After I visualize the shape of an elephant tusk, I often visualize an elephant shortly thereafter”). It includes searching for transformations (cause-effect, composition, analogies, etc.) between any two patterns it already knows about (“sneakers are a type of shoe”, or more problematically, “my thought processes resemble the associative memory of an AGI”), and cataloging these transformations when they’re found. Stuff like that.
So, “make smart hypotheses about one’s own embodied situation” is definitely an unintended side-effect, and not rewarded by gradient descent as such. But as its world-model becomes more comprehensive, and as it continues to automatically search for patterns in whatever information it has access to, “make smart hypotheses about one’s own embodied situation” would just be something that happens naturally, unless we somehow prevent it (and I can’t see how to prevent it). Likewise, “model one’s own real-world causal effects on downstream data” is neither desired by us nor rewarded (as such) by gradient descent. But it can happen anyway, as a side-effect of the usually-locally-helpful rule of “search through the world-model for any patterns and relationships which may impact our beliefs about the upcoming data”. Likewise, we have the generally-helpful rule “Hypothesize possible higher-level contexts that span an extended swathe of text surrounding the next word to be predicted, and pick one such context based on how surprising it would be based on what it knows about the preceding text and the world-model, and then make a prediction conditional on that context”. All these ingredients combine to get the pathological behavior of choosing “Help I’m trapped in a GPU”. That’s my argument, anyway...