Good point, this seems like a substantial concern. I don’t know of any straightforward intervention to resolve this issue.
I suppose you could try to train the AI to not condition on sad user text in most cases (e.g. train it so that it typically continues sad user text with happy text) while still training it to respond normally on the assistant side. I’m unsure to what extent this breaks the generalization from pretrain.
Separately, I have some reasons for thinking this concern is likely to be smaller in magnitude than other welfare issues (in particular the character concern I highlight):
User text is only rarely actively sad (same with internet text from the training corpus).
It seems likely to me (but not overwhelming likely) that most importance weighted AI experience will be from AIs running autonomously rather than AIs interacting with real users.
That said, it seems plausible that the character concern is handled by default due to AI labs training their AIs to seem happy while user sadness seems less likely to be handled by default.
Good point, this seems like a substantial concern. I don’t know of any straightforward intervention to resolve this issue.
I suppose you could try to train the AI to not condition on sad user text in most cases (e.g. train it so that it typically continues sad user text with happy text) while still training it to respond normally on the assistant side. I’m unsure to what extent this breaks the generalization from pretrain.
Separately, I have some reasons for thinking this concern is likely to be smaller in magnitude than other welfare issues (in particular the character concern I highlight):
User text is only rarely actively sad (same with internet text from the training corpus).
It seems likely to me (but not overwhelming likely) that most importance weighted AI experience will be from AIs running autonomously rather than AIs interacting with real users.
That said, it seems plausible that the character concern is handled by default due to AI labs training their AIs to seem happy while user sadness seems less likely to be handled by default.