Yep, I understood that you intended for the Predictor to also/primarily be scored on how well it fulfills the User’s values.
I’m modeling our disagreement something like this:
Aiyen: It could be a good idea to directly incentivize a powerful AI to learn to predict humans, so long as one also directly incentivizes it to optimize for humans values.
rvnnt: Directly incentivizing a powerful AI to learn to predict humans would likely lead to the AI allocating at least some fraction of its (eventually vast) resources to e.g. simulating humans experiencing horrible things. Thus it would probably be a very bad idea to directly incentivize a powerful AI to learn to predict humans, even if one also incentivizes it to optimize for human values.
Does that seem roughly correct to you?
(If yes, I’m curious how you’d guarantee that the Predictor does not end up allocating lots of resources to some kind of mindcrime?)
Yep, I understood that you intended for the Predictor to also/primarily be scored on how well it fulfills the User’s values.
I’m modeling our disagreement something like this:
Aiyen: It could be a good idea to directly incentivize a powerful AI to learn to predict humans, so long as one also directly incentivizes it to optimize for humans values.
rvnnt: Directly incentivizing a powerful AI to learn to predict humans would likely lead to the AI allocating at least some fraction of its (eventually vast) resources to e.g. simulating humans experiencing horrible things. Thus it would probably be a very bad idea to directly incentivize a powerful AI to learn to predict humans, even if one also incentivizes it to optimize for human values.
Does that seem roughly correct to you? (If yes, I’m curious how you’d guarantee that the Predictor does not end up allocating lots of resources to some kind of mindcrime?)