I don’t have a complete picture of the scheme. Is it: “From a trajectory of actions and observations, an English text sample is presented with each observation, and the agent has to predict this text alongside the observations, and then it acts according to some reward function like (and this is simplified) 1 if it sees the text ‘you did what we wanted’ and 0 otherwise”? If the scheme you’re proposing is different than that, my guess is that you’re imagining a recurrent neural network architecture and most of the weights are only trained to predict the observations, and then other weights are trained to predict the text samples. Am I in the right ballpark here?
Sorta the right ballpark. Lack of specificity is definitely my fault—I have more sympathy now for those academics who have a dozen publications that are restatements of the same thing.
I’m a bit more specific in my reply to steve2152 above. I’m thinking about this scheme as a couple of encoder-decoders stiched together at the point of maximal compression, which can do several different encoding/decoding tasks and therefore can be (and for practical purposes should be) trained on several different kinds of data.
For example, it can encode sensory information into an abstract representation, and then decode it back, so you can train that task. It can encode descriptive sentences into the same representation, and then decode them back, so you can train that task. This should reduce the amount of actual annotated text-sensorium pairs you need.
As for what to tell it to pattern-match for as a good state, I was thinking with a little subtlety, but not much. “You did what we wanted” is too bare bones; it will try to change what we want. But I think we might get it to do metaethics for us by talking about “human values” in the abstract, ot maybe “human values as of 2020.” And I don’t think it can do much harm to further specify things like enjoyment, interesting lives, friendship, love, learning, sensory experience, etc etc.
This “wish” picks out a vector in the abstract representation space for the AI to treat as the axis of goodness. And the entire dream is that this abstract space encodes enough of common sense that small perturbations of the vector won’t screw up the future. Which now that I say it like that, sounds like the sort of thing that should imply some statistical properties we could test for.
In the scheme I described, the behavior can be described as “the agent tries to get the text ‘you did what we wanted’ to be sent to it.” A great way to do this would be to intervene in the provision of text. So the scheme I described doesn’t make any progress in avoiding the classic wireheading scenario. The second possibility I described, where there are some games played regarding how different parameters are trained (the RNN is only trained to predict observations, and then another neural network originates from a narrow hidden layer in the RNN and produces text predictions as output) has the exact same wireheading pathology too.
Changing the nature of the goal as a function of what text it sees also doesn’t stop “take over world, and in particular, the provision of text” from being an optimal solution.
I still am uncertain if I’m missing some key detail in your proposal, but right now my impression is that it falls prey to the same sort of wireheading incentive that a standard reinforcement learner does.
Ah, I see what you mean. Yes, this is a serious problem, but (I think) this scheme does have forces that act against it—which makes more sense if you imagine what supervised vs unsupervised learning does to our encoder/decoder. (As opposed to lumping everything together into a semi-supervised training process.)
Supervised learning is the root of the problem, because the most accurate way to predict the supervised text from the world state is to realize that it’s the output of a specific physical process (the keyboard). If we only had supervised learning, we’d have to make the training optimum different from the most accurate prediction, by adding a regularization term and the crossing our fingers that we’d correctly set the arbitrary parameters in it.
But the other thing going on in the scheme is that the AI is trying to compress text and sensory experience to the same representation using unsupervised learning. This is going to help to the extent that language shares important patterns with the world.
For example, if the AI hacks its text channel so that it’s just a buffer full of “Human values are highly satisfied,” this might (in the limit of lots of data and compute) make supervised learning happy. But unsupervised learning just cares about is the patterns it discovered that language and the world share.
(Though now that I think about it, in the limit of infinite compute, unsupervised learning also discovers the relationship between the text and the physical channel. But it still also cares about the usual correspondence between description and reality, and seems like it should accurately make a level distinction between reality and the text, so I need to think about whether this matters)
To the unsupervised learning, hacking the text channel looks (to the extent that you can do translation by compressing to a shared representation) like the sort of thing that might be described by sentences like “The AI is just sitting there” or “A swarm of nanomachines has been released to protect the text channel,” not “Human values are highly satisfied.”
So why consider supervised text/history pairs at all? Well, I guess just because supervised learning is way more efficient at picking out something that’s at least sort of like the correspondence that we mean. Not just as a practical benefit—there might be multiple optima that unsupervised learning could end up in, and I think we want something close-ish to the supervised case.
I don’t have a complete picture of the scheme. Is it: “From a trajectory of actions and observations, an English text sample is presented with each observation, and the agent has to predict this text alongside the observations, and then it acts according to some reward function like (and this is simplified) 1 if it sees the text ‘you did what we wanted’ and 0 otherwise”? If the scheme you’re proposing is different than that, my guess is that you’re imagining a recurrent neural network architecture and most of the weights are only trained to predict the observations, and then other weights are trained to predict the text samples. Am I in the right ballpark here?
Sorta the right ballpark. Lack of specificity is definitely my fault—I have more sympathy now for those academics who have a dozen publications that are restatements of the same thing.
I’m a bit more specific in my reply to steve2152 above. I’m thinking about this scheme as a couple of encoder-decoders stiched together at the point of maximal compression, which can do several different encoding/decoding tasks and therefore can be (and for practical purposes should be) trained on several different kinds of data.
For example, it can encode sensory information into an abstract representation, and then decode it back, so you can train that task. It can encode descriptive sentences into the same representation, and then decode them back, so you can train that task. This should reduce the amount of actual annotated text-sensorium pairs you need.
As for what to tell it to pattern-match for as a good state, I was thinking with a little subtlety, but not much. “You did what we wanted” is too bare bones; it will try to change what we want. But I think we might get it to do metaethics for us by talking about “human values” in the abstract, ot maybe “human values as of 2020.” And I don’t think it can do much harm to further specify things like enjoyment, interesting lives, friendship, love, learning, sensory experience, etc etc.
This “wish” picks out a vector in the abstract representation space for the AI to treat as the axis of goodness. And the entire dream is that this abstract space encodes enough of common sense that small perturbations of the vector won’t screw up the future. Which now that I say it like that, sounds like the sort of thing that should imply some statistical properties we could test for.
In the scheme I described, the behavior can be described as “the agent tries to get the text ‘you did what we wanted’ to be sent to it.” A great way to do this would be to intervene in the provision of text. So the scheme I described doesn’t make any progress in avoiding the classic wireheading scenario. The second possibility I described, where there are some games played regarding how different parameters are trained (the RNN is only trained to predict observations, and then another neural network originates from a narrow hidden layer in the RNN and produces text predictions as output) has the exact same wireheading pathology too.
Changing the nature of the goal as a function of what text it sees also doesn’t stop “take over world, and in particular, the provision of text” from being an optimal solution.
I still am uncertain if I’m missing some key detail in your proposal, but right now my impression is that it falls prey to the same sort of wireheading incentive that a standard reinforcement learner does.
Ah, I see what you mean. Yes, this is a serious problem, but (I think) this scheme does have forces that act against it—which makes more sense if you imagine what supervised vs unsupervised learning does to our encoder/decoder. (As opposed to lumping everything together into a semi-supervised training process.)
Supervised learning is the root of the problem, because the most accurate way to predict the supervised text from the world state is to realize that it’s the output of a specific physical process (the keyboard). If we only had supervised learning, we’d have to make the training optimum different from the most accurate prediction, by adding a regularization term and the crossing our fingers that we’d correctly set the arbitrary parameters in it.
But the other thing going on in the scheme is that the AI is trying to compress text and sensory experience to the same representation using unsupervised learning. This is going to help to the extent that language shares important patterns with the world.
For example, if the AI hacks its text channel so that it’s just a buffer full of “Human values are highly satisfied,” this might (in the limit of lots of data and compute) make supervised learning happy. But unsupervised learning just cares about is the patterns it discovered that language and the world share.
(Though now that I think about it, in the limit of infinite compute, unsupervised learning also discovers the relationship between the text and the physical channel. But it still also cares about the usual correspondence between description and reality, and seems like it should accurately make a level distinction between reality and the text, so I need to think about whether this matters)
To the unsupervised learning, hacking the text channel looks (to the extent that you can do translation by compressing to a shared representation) like the sort of thing that might be described by sentences like “The AI is just sitting there” or “A swarm of nanomachines has been released to protect the text channel,” not “Human values are highly satisfied.”
So why consider supervised text/history pairs at all? Well, I guess just because supervised learning is way more efficient at picking out something that’s at least sort of like the correspondence that we mean. Not just as a practical benefit—there might be multiple optima that unsupervised learning could end up in, and I think we want something close-ish to the supervised case.