I’m not 100% sold on explaining actions as a solution here. It seems like the basic sorts of “attack” (exploiting human biases or limitations, sending an unintended message to the supervisor, sneaking a message to a 3rd party that will help contol the reward signal) still work fine—so long as the search process includes tue explainer as part of the environment. And if it doesn’t, we run into the usual issue with such schemes: the AI predictably gets its predictions wrong, and so you need some guarantee that you can keep this AI and its descendants in this unnatural state.
I’m not 100% sold on explaining actions as a solution here. It seems like the basic sorts of “attack” (exploiting human biases or limitations, sending an unintended message to the supervisor, sneaking a message to a 3rd party that will help contol the reward signal) still work fine—so long as the search process includes tue explainer as part of the environment. And if it doesn’t, we run into the usual issue with such schemes: the AI predictably gets its predictions wrong, and so you need some guarantee that you can keep this AI and its descendants in this unnatural state.