This might have a better chance of working if you give the AI strong inductive biases to perceive rewarding stimuli not as intrinsically rewarding in themselves, but rather as evidence of something of value happening that merely generates the stimulus. We want the AI to see smiles as good things, not because upwardly curving mouths are good, but because they are generated by good mental states that we actually value.
For that, it would need a generative model (likelihood function) whose hidden states (hypotheses) are mapped to its value function, rather than mapping the sensory data (evidence) directly. The latter type of feed-forward mapping, which most current deep learning is all about, can only offer cheap heuristics, as far as I can tell.
The AI should also priviledge hypotheses about the true nature of the value function where the mental/physiological states of eudaimonic agents are the key predictors. With your happy-video example, the existence of the video player and video files is sufficient to explain its training data, but that hypothesis (R_2) does not involve other agents. R_1, on the other hand, does assume that the happiness of other agents is what predicts both its input data and its value labels, so it should be given priority. (For this to work, it would also need a model of human emotional expression to compare its sensory data against in the likelihood function.)
This might have a better chance of working if you give the AI strong inductive biases to perceive rewarding stimuli not as intrinsically rewarding in themselves, but rather as evidence of something of value happening that merely generates the stimulus. We want the AI to see smiles as good things, not because upwardly curving mouths are good, but because they are generated by good mental states that we actually value.
For that, it would need a generative model (likelihood function) whose hidden states (hypotheses) are mapped to its value function, rather than mapping the sensory data (evidence) directly. The latter type of feed-forward mapping, which most current deep learning is all about, can only offer cheap heuristics, as far as I can tell.
The AI should also priviledge hypotheses about the true nature of the value function where the mental/physiological states of eudaimonic agents are the key predictors. With your happy-video example, the existence of the video player and video files is sufficient to explain its training data, but that hypothesis (R_2) does not involve other agents. R_1, on the other hand, does assume that the happiness of other agents is what predicts both its input data and its value labels, so it should be given priority. (For this to work, it would also need a model of human emotional expression to compare its sensory data against in the likelihood function.)
Thanks—this feels somewhat similar to my vague idea of “define wireheading, tell AI to avoid situations like that”.