Could we approximate a naive function that agents would be attempting to maximize (for the sake of understanding)? I imagine it would include:
1. If the user were to rate this answer, where supplemental & explanatory information is allowed, what would be their expected rating?
2. How much did the actions of this agent positively or negatively affect the system’s expected corrigibility?
3. If a human were to rank the overall safety of this action, without the corrigibility, what is their expected rating?
*Note: maybe for #1, #3, the user should be able to call HCH additional times in order to evaluate the true quality of the answer. Also, #3 is mostly a “catch-all”, it would be better of course to define it in more concrete details, and preferably break it up.
A very naive answer value function would be something like:
Could we approximate a naive function that agents would be attempting to maximize (for the sake of understanding)? I imagine it would include:
1. If the user were to rate this answer, where supplemental & explanatory information is allowed, what would be their expected rating?
2. How much did the actions of this agent positively or negatively affect the system’s expected corrigibility?
3. If a human were to rank the overall safety of this action, without the corrigibility, what is their expected rating?
*Note: maybe for #1, #3, the user should be able to call HCH additional times in order to evaluate the true quality of the answer. Also, #3 is mostly a “catch-all”, it would be better of course to define it in more concrete details, and preferably break it up.
A very naive answer value function would be something like:
HumanAnswerRating + CorrigibilityRating + SafetyRating