In the scheme I described, the behavior can be described as “the agent tries to get the text ‘you did what we wanted’ to be sent to it.” A great way to do this would be to intervene in the provision of text. So the scheme I described doesn’t make any progress in avoiding the classic wireheading scenario. The second possibility I described, where there are some games played regarding how different parameters are trained (the RNN is only trained to predict observations, and then another neural network originates from a narrow hidden layer in the RNN and produces text predictions as output) has the exact same wireheading pathology too.
Changing the nature of the goal as a function of what text it sees also doesn’t stop “take over world, and in particular, the provision of text” from being an optimal solution.
I still am uncertain if I’m missing some key detail in your proposal, but right now my impression is that it falls prey to the same sort of wireheading incentive that a standard reinforcement learner does.
Ah, I see what you mean. Yes, this is a serious problem, but (I think) this scheme does have forces that act against it—which makes more sense if you imagine what supervised vs unsupervised learning does to our encoder/decoder. (As opposed to lumping everything together into a semi-supervised training process.)
Supervised learning is the root of the problem, because the most accurate way to predict the supervised text from the world state is to realize that it’s the output of a specific physical process (the keyboard). If we only had supervised learning, we’d have to make the training optimum different from the most accurate prediction, by adding a regularization term and the crossing our fingers that we’d correctly set the arbitrary parameters in it.
But the other thing going on in the scheme is that the AI is trying to compress text and sensory experience to the same representation using unsupervised learning. This is going to help to the extent that language shares important patterns with the world.
For example, if the AI hacks its text channel so that it’s just a buffer full of “Human values are highly satisfied,” this might (in the limit of lots of data and compute) make supervised learning happy. But unsupervised learning just cares about is the patterns it discovered that language and the world share.
(Though now that I think about it, in the limit of infinite compute, unsupervised learning also discovers the relationship between the text and the physical channel. But it still also cares about the usual correspondence between description and reality, and seems like it should accurately make a level distinction between reality and the text, so I need to think about whether this matters)
To the unsupervised learning, hacking the text channel looks (to the extent that you can do translation by compressing to a shared representation) like the sort of thing that might be described by sentences like “The AI is just sitting there” or “A swarm of nanomachines has been released to protect the text channel,” not “Human values are highly satisfied.”
So why consider supervised text/history pairs at all? Well, I guess just because supervised learning is way more efficient at picking out something that’s at least sort of like the correspondence that we mean. Not just as a practical benefit—there might be multiple optima that unsupervised learning could end up in, and I think we want something close-ish to the supervised case.
In the scheme I described, the behavior can be described as “the agent tries to get the text ‘you did what we wanted’ to be sent to it.” A great way to do this would be to intervene in the provision of text. So the scheme I described doesn’t make any progress in avoiding the classic wireheading scenario. The second possibility I described, where there are some games played regarding how different parameters are trained (the RNN is only trained to predict observations, and then another neural network originates from a narrow hidden layer in the RNN and produces text predictions as output) has the exact same wireheading pathology too.
Changing the nature of the goal as a function of what text it sees also doesn’t stop “take over world, and in particular, the provision of text” from being an optimal solution.
I still am uncertain if I’m missing some key detail in your proposal, but right now my impression is that it falls prey to the same sort of wireheading incentive that a standard reinforcement learner does.
Ah, I see what you mean. Yes, this is a serious problem, but (I think) this scheme does have forces that act against it—which makes more sense if you imagine what supervised vs unsupervised learning does to our encoder/decoder. (As opposed to lumping everything together into a semi-supervised training process.)
Supervised learning is the root of the problem, because the most accurate way to predict the supervised text from the world state is to realize that it’s the output of a specific physical process (the keyboard). If we only had supervised learning, we’d have to make the training optimum different from the most accurate prediction, by adding a regularization term and the crossing our fingers that we’d correctly set the arbitrary parameters in it.
But the other thing going on in the scheme is that the AI is trying to compress text and sensory experience to the same representation using unsupervised learning. This is going to help to the extent that language shares important patterns with the world.
For example, if the AI hacks its text channel so that it’s just a buffer full of “Human values are highly satisfied,” this might (in the limit of lots of data and compute) make supervised learning happy. But unsupervised learning just cares about is the patterns it discovered that language and the world share.
(Though now that I think about it, in the limit of infinite compute, unsupervised learning also discovers the relationship between the text and the physical channel. But it still also cares about the usual correspondence between description and reality, and seems like it should accurately make a level distinction between reality and the text, so I need to think about whether this matters)
To the unsupervised learning, hacking the text channel looks (to the extent that you can do translation by compressing to a shared representation) like the sort of thing that might be described by sentences like “The AI is just sitting there” or “A swarm of nanomachines has been released to protect the text channel,” not “Human values are highly satisfied.”
So why consider supervised text/history pairs at all? Well, I guess just because supervised learning is way more efficient at picking out something that’s at least sort of like the correspondence that we mean. Not just as a practical benefit—there might be multiple optima that unsupervised learning could end up in, and I think we want something close-ish to the supervised case.