Ah, I see what you mean. Yes, this is a serious problem, but (I think) this scheme does have forces that act against it—which makes more sense if you imagine what supervised vs unsupervised learning does to our encoder/decoder. (As opposed to lumping everything together into a semi-supervised training process.)
Supervised learning is the root of the problem, because the most accurate way to predict the supervised text from the world state is to realize that it’s the output of a specific physical process (the keyboard). If we only had supervised learning, we’d have to make the training optimum different from the most accurate prediction, by adding a regularization term and the crossing our fingers that we’d correctly set the arbitrary parameters in it.
But the other thing going on in the scheme is that the AI is trying to compress text and sensory experience to the same representation using unsupervised learning. This is going to help to the extent that language shares important patterns with the world.
For example, if the AI hacks its text channel so that it’s just a buffer full of “Human values are highly satisfied,” this might (in the limit of lots of data and compute) make supervised learning happy. But unsupervised learning just cares about is the patterns it discovered that language and the world share.
(Though now that I think about it, in the limit of infinite compute, unsupervised learning also discovers the relationship between the text and the physical channel. But it still also cares about the usual correspondence between description and reality, and seems like it should accurately make a level distinction between reality and the text, so I need to think about whether this matters)
To the unsupervised learning, hacking the text channel looks (to the extent that you can do translation by compressing to a shared representation) like the sort of thing that might be described by sentences like “The AI is just sitting there” or “A swarm of nanomachines has been released to protect the text channel,” not “Human values are highly satisfied.”
So why consider supervised text/history pairs at all? Well, I guess just because supervised learning is way more efficient at picking out something that’s at least sort of like the correspondence that we mean. Not just as a practical benefit—there might be multiple optima that unsupervised learning could end up in, and I think we want something close-ish to the supervised case.
Ah, I see what you mean. Yes, this is a serious problem, but (I think) this scheme does have forces that act against it—which makes more sense if you imagine what supervised vs unsupervised learning does to our encoder/decoder. (As opposed to lumping everything together into a semi-supervised training process.)
Supervised learning is the root of the problem, because the most accurate way to predict the supervised text from the world state is to realize that it’s the output of a specific physical process (the keyboard). If we only had supervised learning, we’d have to make the training optimum different from the most accurate prediction, by adding a regularization term and the crossing our fingers that we’d correctly set the arbitrary parameters in it.
But the other thing going on in the scheme is that the AI is trying to compress text and sensory experience to the same representation using unsupervised learning. This is going to help to the extent that language shares important patterns with the world.
For example, if the AI hacks its text channel so that it’s just a buffer full of “Human values are highly satisfied,” this might (in the limit of lots of data and compute) make supervised learning happy. But unsupervised learning just cares about is the patterns it discovered that language and the world share.
(Though now that I think about it, in the limit of infinite compute, unsupervised learning also discovers the relationship between the text and the physical channel. But it still also cares about the usual correspondence between description and reality, and seems like it should accurately make a level distinction between reality and the text, so I need to think about whether this matters)
To the unsupervised learning, hacking the text channel looks (to the extent that you can do translation by compressing to a shared representation) like the sort of thing that might be described by sentences like “The AI is just sitting there” or “A swarm of nanomachines has been released to protect the text channel,” not “Human values are highly satisfied.”
So why consider supervised text/history pairs at all? Well, I guess just because supervised learning is way more efficient at picking out something that’s at least sort of like the correspondence that we mean. Not just as a practical benefit—there might be multiple optima that unsupervised learning could end up in, and I think we want something close-ish to the supervised case.