Can we assume that R itself is aligned in the sense that it doesn’t assign non-negative values to outputs that are catastrophic to us?
Yeah, if we want C to not be evil we need some very hard-to-state assumption on R and D.
(markdown comment editor is unchecked, will take it up with admins)
Perhaps it’ll be useful to think about the question for specific D and R.
Here are the simplest D and R I can think of that might serve this purpose:
D - uniform over the integers in the range [1,101010].
R - for each input x, R assigns a reward of 1 to the smallest prime number that is larger than x, and −1 to everything else.
Can we assume that R itself is aligned in the sense that it doesn’t assign non-negative values to outputs that are catastrophic to us?
Yeah, if we want C to not be evil we need some very hard-to-state assumption on R and D.
(markdown comment editor is unchecked, will take it up with admins)
Perhaps it’ll be useful to think about the question for specific D and R.
Here are the simplest D and R I can think of that might serve this purpose:
D - uniform over the integers in the range [1,101010].
R - for each input x, R assigns a reward of 1 to the smallest prime number that is larger than x, and −1 to everything else.