Let’s consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.
How would you define a distance measure on transitions? Since this would be a continuous measure of how good transitions are, rather than a discrete list of good transitions, in what sense is it a form of whitelisting?
This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You’re basically saying “optimize this utility the best you can without letting there be an actual impact”. However, I actually hadn’t thought of this formulation before, and it’s plausible it’s even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.
I expect that in complex tasks where we don’t know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement. The distance measure on transitions could also be traded off with reward (or some other task-specific objective function), so if an action is sufficiently useful for the task, the high reward would dominate the distance penalty.
This would still have offsetting issues though. In the asteroid example, if the agent deflects the asteroid, then future transitions (involving human actions) are very different from default transitions (involving no human actions), so the agent would have an offsetting incentive.
You’re right, it isn’t. I should have been more precise:
“Suppose we have an impact measure that considers whitelist-esque object transitions, but doesn’t use a whitelist. Instead, it penalizes how dissimilar the observed object transitions are at a time step to those which were counterfactually expected.”
I expect that in complex tasks where we don’t know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement.
I think this failure mode on its own is relatively benign, given querying.
What I find more worrying is that an intelligent agent would likely be able to hard-optimize while avoiding penalties (either through the allowed transitions, by skating by on technicalities re: object recognition, etc).
I suspect the/a ideal solution will have far fewer parameters (if any).
How would you define a distance measure on transitions? Since this would be a continuous measure of how good transitions are, rather than a discrete list of good transitions, in what sense is it a form of whitelisting?
I expect that in complex tasks where we don’t know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement. The distance measure on transitions could also be traded off with reward (or some other task-specific objective function), so if an action is sufficiently useful for the task, the high reward would dominate the distance penalty.
This would still have offsetting issues though. In the asteroid example, if the agent deflects the asteroid, then future transitions (involving human actions) are very different from default transitions (involving no human actions), so the agent would have an offsetting incentive.
You’re right, it isn’t. I should have been more precise:
“Suppose we have an impact measure that considers whitelist-esque object transitions, but doesn’t use a whitelist. Instead, it penalizes how dissimilar the observed object transitions are at a time step to those which were counterfactually expected.”
I think this failure mode on its own is relatively benign, given querying.
What I find more worrying is that an intelligent agent would likely be able to hard-optimize while avoiding penalties (either through the allowed transitions, by skating by on technicalities re: object recognition, etc).
I suspect the/a ideal solution will have far fewer parameters (if any).