What about simulating the user being sad? If you train on both sides of transcripts, this would straightforwardly happen, and even if you only trained on the assistant side it would still “simulate” the user as a generalization of pretraining.
Good point, this seems like a substantial concern. I don’t know of any straightforward intervention to resolve this issue.
I suppose you could try to train the AI to not condition on sad user text in most cases (e.g. train it so that it typically continues sad user text with happy text) while still training it to respond normally on the assistant side. I’m unsure to what extent this breaks the generalization from pretrain.
Separately, I have some reasons for thinking this concern is likely to be smaller in magnitude than other welfare issues (in particular the character concern I highlight):
User text is only rarely actively sad (same with internet text from the training corpus).
It seems likely to me (but not overwhelming likely) that most importance weighted AI experience will be from AIs running autonomously rather than AIs interacting with real users.
That said, it seems plausible that the character concern is handled by default due to AI labs training their AIs to seem happy while user sadness seems less likely to be handled by default.
What about simulating the user being sad? If you train on both sides of transcripts, this would straightforwardly happen, and even if you only trained on the assistant side it would still “simulate” the user as a generalization of pretraining.
Good point, this seems like a substantial concern. I don’t know of any straightforward intervention to resolve this issue.
I suppose you could try to train the AI to not condition on sad user text in most cases (e.g. train it so that it typically continues sad user text with happy text) while still training it to respond normally on the assistant side. I’m unsure to what extent this breaks the generalization from pretrain.
Separately, I have some reasons for thinking this concern is likely to be smaller in magnitude than other welfare issues (in particular the character concern I highlight):
User text is only rarely actively sad (same with internet text from the training corpus).
It seems likely to me (but not overwhelming likely) that most importance weighted AI experience will be from AIs running autonomously rather than AIs interacting with real users.
That said, it seems plausible that the character concern is handled by default due to AI labs training their AIs to seem happy while user sadness seems less likely to be handled by default.