You’re right, it isn’t. I should have been more precise:
“Suppose we have an impact measure that considers whitelist-esque object transitions, but doesn’t use a whitelist. Instead, it penalizes how dissimilar the observed object transitions are at a time step to those which were counterfactually expected.”
I expect that in complex tasks where we don’t know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement.
I think this failure mode on its own is relatively benign, given querying.
What I find more worrying is that an intelligent agent would likely be able to hard-optimize while avoiding penalties (either through the allowed transitions, by skating by on technicalities re: object recognition, etc).
I suspect the/a ideal solution will have far fewer parameters (if any).
You’re right, it isn’t. I should have been more precise:
“Suppose we have an impact measure that considers whitelist-esque object transitions, but doesn’t use a whitelist. Instead, it penalizes how dissimilar the observed object transitions are at a time step to those which were counterfactually expected.”
I think this failure mode on its own is relatively benign, given querying.
What I find more worrying is that an intelligent agent would likely be able to hard-optimize while avoiding penalties (either through the allowed transitions, by skating by on technicalities re: object recognition, etc).
I suspect the/a ideal solution will have far fewer parameters (if any).