IIUC, OP is proposing that the Predictor agent be directly incentivized (trained) to predict/model the User.
I think it’s important to note that “predict the User” is very different from “optimize for the User’s values”. I think that, if one were to optimize really hard for “predict the User”, one would end up doing things like simulating the User in all kinds of situations, most of which would be pretty horrible for the User (assuming the User’s preferences/values were anywhere near as fragile/selective as human values).
I think training an AGI to predict/understand human values would likely end in something a lot worse than death. I think “predict/understand the User” should probably never be directly incentivized at all. (We’d want the AI to do that only for instrumental reasons, in its attempts to optimize for the user’s values.)
(I suspect lots of people, when imagining “simulating lots of situations in order to understand User’s values”, end up sampling those imagined situations from something like “the set of situations I’ve experienced”. But our experiences are very strongly biased towards desirable situations—we spend most of our time optimizing our lives, after all. Given that human values are rather fragile/selective, I suspect the vast majority of all possible experiences/situations would rank very very low according to “human values”; no adversarial selection needed: most actually random experiences are probably pretty unpleasant.)
Given the above, I’m confused as to why one would propose directly incentivizing “predict the User”. Do others disagree with the risks, or think the risks are worth the benefits, or something else?
That is explicitly why the predictor is scored on how well it fulfills the user’s values, and not merely on how well it predicts them. I noted that an AI merely trained to predict would likely destroy the world and do things like dissecting human brains to better model our values.
Yep, I understood that you intended for the Predictor to also/primarily be scored on how well it fulfills the User’s values.
I’m modeling our disagreement something like this:
Aiyen: It could be a good idea to directly incentivize a powerful AI to learn to predict humans, so long as one also directly incentivizes it to optimize for humans values.
rvnnt: Directly incentivizing a powerful AI to learn to predict humans would likely lead to the AI allocating at least some fraction of its (eventually vast) resources to e.g. simulating humans experiencing horrible things. Thus it would probably be a very bad idea to directly incentivize a powerful AI to learn to predict humans, even if one also incentivizes it to optimize for human values.
Does that seem roughly correct to you?
(If yes, I’m curious how you’d guarantee that the Predictor does not end up allocating lots of resources to some kind of mindcrime?)
IIUC, OP is proposing that the Predictor agent be directly incentivized (trained) to predict/model the User.
I think it’s important to note that “predict the User” is very different from “optimize for the User’s values”. I think that, if one were to optimize really hard for “predict the User”, one would end up doing things like simulating the User in all kinds of situations, most of which would be pretty horrible for the User (assuming the User’s preferences/values were anywhere near as fragile/selective as human values).
I think training an AGI to predict/understand human values would likely end in something a lot worse than death. I think “predict/understand the User” should probably never be directly incentivized at all. (We’d want the AI to do that only for instrumental reasons, in its attempts to optimize for the user’s values.)
(I suspect lots of people, when imagining “simulating lots of situations in order to understand User’s values”, end up sampling those imagined situations from something like “the set of situations I’ve experienced”. But our experiences are very strongly biased towards desirable situations—we spend most of our time optimizing our lives, after all. Given that human values are rather fragile/selective, I suspect the vast majority of all possible experiences/situations would rank very very low according to “human values”; no adversarial selection needed: most actually random experiences are probably pretty unpleasant.)
Given the above, I’m confused as to why one would propose directly incentivizing “predict the User”. Do others disagree with the risks, or think the risks are worth the benefits, or something else?
That is explicitly why the predictor is scored on how well it fulfills the user’s values, and not merely on how well it predicts them. I noted that an AI merely trained to predict would likely destroy the world and do things like dissecting human brains to better model our values.
Yep, I understood that you intended for the Predictor to also/primarily be scored on how well it fulfills the User’s values.
I’m modeling our disagreement something like this:
Aiyen: It could be a good idea to directly incentivize a powerful AI to learn to predict humans, so long as one also directly incentivizes it to optimize for humans values.
rvnnt: Directly incentivizing a powerful AI to learn to predict humans would likely lead to the AI allocating at least some fraction of its (eventually vast) resources to e.g. simulating humans experiencing horrible things. Thus it would probably be a very bad idea to directly incentivize a powerful AI to learn to predict humans, even if one also incentivizes it to optimize for human values.
Does that seem roughly correct to you? (If yes, I’m curious how you’d guarantee that the Predictor does not end up allocating lots of resources to some kind of mindcrime?)