No worries, I also missed the earlier posts when I wrote mine. There’s lots of stuff on this website.
I endorse your rephrasing of example 1. I think my position is that it’s just not that hard to create a “self-consistent probability distribution”. For example, say you trained an RNN to predict sequences, like in this post. Despite being very simple, it already implicitly represents a probability distribution over sequences. If you train it with back-propagation on a confusing article involving pyrite, then its weights will be updated to try to model the article better. However, if “pyrite” itself was easy to predict, then the weights that lead to it outputting “pyrite” will *not* be updated. The same thing holds for modern Transformer networks, which predict the next token based only on what it has seen so far. (Here is a paper with a recent example using GPT-2. Note the degeneracy of maximum likelihood sampling, but how this becomes less of a problem when just sampling from the implied distribution)
I agree that this sort of manipulative prediction could be a problem in principle, but it does not seem to occur in recent ML systems. (Although, there are some things which are somewhat like this; the earlier paper I linked and mode collapse do involve neglecting high-entropy components of the distribution. However, the most straightforward generation and training schemes do not incentivize this)
For example 2, the point about gradient descent is this: while it might be the case that outputting “Help I’m stuck in a GPU Factory000” would ultimately result in a higher accuracy, the way the gradient is propagated would not encourage the agent to behave manipulatively. This is because, *locally*, “Help I’m stuck in a GPU Factory” decreases accuracy, so that behavior(or policies leading to it) will be dis-incentivized by gradient descent. It may be the case that this will result in easier predictions later, but the structure of the reward function does not lead to any optimization pressure towards such manipulative strategies. Learning taking place over high-level abstractions doesn’t change anything, because any high-level abstractions leading to locally bad behavior will likewise be dis-incentivized by gradient descent
Thanks, that’s helpful! I’ll have to think about the “self-consistent probability distribution” issue more, and thanks for the links. (ETA: Meanwhile I also added an “Update 2″ to the post, offering a different way to think about this, which might or might not be helpful.)
Let me try the gradient descent argument again (and note that I am sympathetic, and indeed I made (what I think is) that exact argument a few weeks ago, cf. Self-Supervised Learning and AGI Safety, section title “Why won’t it try to get more predictable data?”). My argument here is not assuming there’s a policy of trying to get more predictable data for its own sake, but rather that this kind of behavior arises as a side-effect of an algorithmic process, and that all the ingredients of that process are either things we would program into the algorithm ourselves or things that would be incentivized by gradient descent.
The ingredients are things like “Look for and learn patterns in all accessible data”, which includes both low-level patterns in the raw data, higher-level patterns in the lower-level patterns, and (perhaps unintentionally) patterns in accessible information about its own thought process (“After I visualize the shape of an elephant tusk, I often visualize an elephant shortly thereafter”). It includes searching for transformations (cause-effect, composition, analogies, etc.) between any two patterns it already knows about (“sneakers are a type of shoe”, or more problematically, “my thought processes resemble the associative memory of an AGI”), and cataloging these transformations when they’re found. Stuff like that.
So, “make smart hypotheses about one’s own embodied situation” is definitely an unintended side-effect, and not rewarded by gradient descent as such. But as its world-model becomes more comprehensive, and as it continues to automatically search for patterns in whatever information it has access to, “make smart hypotheses about one’s own embodied situation” would just be something that happens naturally, unless we somehow prevent it (and I can’t see how to prevent it). Likewise, “model one’s own real-world causal effects on downstream data” is neither desired by us nor rewarded (as such) by gradient descent. But it can happen anyway, as a side-effect of the usually-locally-helpful rule of “search through the world-model for any patterns and relationships which may impact our beliefs about the upcoming data”. Likewise, we have the generally-helpful rule “Hypothesize possible higher-level contexts that span an extended swathe of text surrounding the next word to be predicted, and pick one such context based on how surprising it would be based on what it knows about the preceding text and the world-model, and then make a prediction conditional on that context”. All these ingredients combine to get the pathological behavior of choosing “Help I’m trapped in a GPU”. That’s my argument, anyway...
No worries, I also missed the earlier posts when I wrote mine. There’s lots of stuff on this website.
I endorse your rephrasing of example 1. I think my position is that it’s just not that hard to create a “self-consistent probability distribution”. For example, say you trained an RNN to predict sequences, like in this post. Despite being very simple, it already implicitly represents a probability distribution over sequences. If you train it with back-propagation on a confusing article involving pyrite, then its weights will be updated to try to model the article better. However, if “pyrite” itself was easy to predict, then the weights that lead to it outputting “pyrite” will *not* be updated. The same thing holds for modern Transformer networks, which predict the next token based only on what it has seen so far. (Here is a paper with a recent example using GPT-2. Note the degeneracy of maximum likelihood sampling, but how this becomes less of a problem when just sampling from the implied distribution)
I agree that this sort of manipulative prediction could be a problem in principle, but it does not seem to occur in recent ML systems. (Although, there are some things which are somewhat like this; the earlier paper I linked and mode collapse do involve neglecting high-entropy components of the distribution. However, the most straightforward generation and training schemes do not incentivize this)
For example 2, the point about gradient descent is this: while it might be the case that outputting “Help I’m stuck in a GPU Factory000” would ultimately result in a higher accuracy, the way the gradient is propagated would not encourage the agent to behave manipulatively. This is because, *locally*, “Help I’m stuck in a GPU Factory” decreases accuracy, so that behavior(or policies leading to it) will be dis-incentivized by gradient descent. It may be the case that this will result in easier predictions later, but the structure of the reward function does not lead to any optimization pressure towards such manipulative strategies. Learning taking place over high-level abstractions doesn’t change anything, because any high-level abstractions leading to locally bad behavior will likewise be dis-incentivized by gradient descent
Thanks, that’s helpful! I’ll have to think about the “self-consistent probability distribution” issue more, and thanks for the links. (ETA: Meanwhile I also added an “Update 2″ to the post, offering a different way to think about this, which might or might not be helpful.)
Let me try the gradient descent argument again (and note that I am sympathetic, and indeed I made (what I think is) that exact argument a few weeks ago, cf. Self-Supervised Learning and AGI Safety, section title “Why won’t it try to get more predictable data?”). My argument here is not assuming there’s a policy of trying to get more predictable data for its own sake, but rather that this kind of behavior arises as a side-effect of an algorithmic process, and that all the ingredients of that process are either things we would program into the algorithm ourselves or things that would be incentivized by gradient descent.
The ingredients are things like “Look for and learn patterns in all accessible data”, which includes both low-level patterns in the raw data, higher-level patterns in the lower-level patterns, and (perhaps unintentionally) patterns in accessible information about its own thought process (“After I visualize the shape of an elephant tusk, I often visualize an elephant shortly thereafter”). It includes searching for transformations (cause-effect, composition, analogies, etc.) between any two patterns it already knows about (“sneakers are a type of shoe”, or more problematically, “my thought processes resemble the associative memory of an AGI”), and cataloging these transformations when they’re found. Stuff like that.
So, “make smart hypotheses about one’s own embodied situation” is definitely an unintended side-effect, and not rewarded by gradient descent as such. But as its world-model becomes more comprehensive, and as it continues to automatically search for patterns in whatever information it has access to, “make smart hypotheses about one’s own embodied situation” would just be something that happens naturally, unless we somehow prevent it (and I can’t see how to prevent it). Likewise, “model one’s own real-world causal effects on downstream data” is neither desired by us nor rewarded (as such) by gradient descent. But it can happen anyway, as a side-effect of the usually-locally-helpful rule of “search through the world-model for any patterns and relationships which may impact our beliefs about the upcoming data”. Likewise, we have the generally-helpful rule “Hypothesize possible higher-level contexts that span an extended swathe of text surrounding the next word to be predicted, and pick one such context based on how surprising it would be based on what it knows about the preceding text and the world-model, and then make a prediction conditional on that context”. All these ingredients combine to get the pathological behavior of choosing “Help I’m trapped in a GPU”. That’s my argument, anyway...