So I don’t know how we could quantify “stopping humans from having effects” as an effect without a strong offsetting incentive.
Let’s consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.
This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You’re basically saying “optimize this utility the best you can without letting there be an actual impact”. However, I actually hadn’t thought of this formulation before, and it’s plausible it’s even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.
There’s another problem, however: “people conclude that this AI design doesn’t work and try another variant” is a pretty plausible result of this naive counterfactual. When people imagine the counterfactual, it seems they usually think about “what would happen if the agent did nothing and then people shrugged and went about their lives, forgetting about AGI”. The odds of that being the counterfactual are pretty slim. It’s even possible that any agents/variants people would make in the counterfactual would have undefined behavior… Sufficiently-similar agents would also simulate what would happen if they did nothing, got tweaked and rebooted, and then ran the same simulation… where would it bottom out, and with what conclusion? Probably with a wholly-different kind of agent being tried out.
The iterative formulation doesn’t seem to have that failure mode.
Let’s consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.
How would you define a distance measure on transitions? Since this would be a continuous measure of how good transitions are, rather than a discrete list of good transitions, in what sense is it a form of whitelisting?
This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You’re basically saying “optimize this utility the best you can without letting there be an actual impact”. However, I actually hadn’t thought of this formulation before, and it’s plausible it’s even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.
I expect that in complex tasks where we don’t know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement. The distance measure on transitions could also be traded off with reward (or some other task-specific objective function), so if an action is sufficiently useful for the task, the high reward would dominate the distance penalty.
This would still have offsetting issues though. In the asteroid example, if the agent deflects the asteroid, then future transitions (involving human actions) are very different from default transitions (involving no human actions), so the agent would have an offsetting incentive.
You’re right, it isn’t. I should have been more precise:
“Suppose we have an impact measure that considers whitelist-esque object transitions, but doesn’t use a whitelist. Instead, it penalizes how dissimilar the observed object transitions are at a time step to those which were counterfactually expected.”
I expect that in complex tasks where we don’t know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement.
I think this failure mode on its own is relatively benign, given querying.
What I find more worrying is that an intelligent agent would likely be able to hard-optimize while avoiding penalties (either through the allowed transitions, by skating by on technicalities re: object recognition, etc).
I suspect the/a ideal solution will have far fewer parameters (if any).
So I don’t know how we could quantify “stopping humans from having effects” as an effect without a strong offsetting incentive.
Let’s consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.
This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You’re basically saying “optimize this utility the best you can without letting there be an actual impact”. However, I actually hadn’t thought of this formulation before, and it’s plausible it’s even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.
There’s another problem, however: “people conclude that this AI design doesn’t work and try another variant” is a pretty plausible result of this naive counterfactual. When people imagine the counterfactual, it seems they usually think about “what would happen if the agent did nothing and then people shrugged and went about their lives, forgetting about AGI”. The odds of that being the counterfactual are pretty slim. It’s even possible that any agents/variants people would make in the counterfactual would have undefined behavior… Sufficiently-similar agents would also simulate what would happen if they did nothing, got tweaked and rebooted, and then ran the same simulation… where would it bottom out, and with what conclusion? Probably with a wholly-different kind of agent being tried out.
The iterative formulation doesn’t seem to have that failure mode.
How would you define a distance measure on transitions? Since this would be a continuous measure of how good transitions are, rather than a discrete list of good transitions, in what sense is it a form of whitelisting?
I expect that in complex tasks where we don’t know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement. The distance measure on transitions could also be traded off with reward (or some other task-specific objective function), so if an action is sufficiently useful for the task, the high reward would dominate the distance penalty.
This would still have offsetting issues though. In the asteroid example, if the agent deflects the asteroid, then future transitions (involving human actions) are very different from default transitions (involving no human actions), so the agent would have an offsetting incentive.
You’re right, it isn’t. I should have been more precise:
“Suppose we have an impact measure that considers whitelist-esque object transitions, but doesn’t use a whitelist. Instead, it penalizes how dissimilar the observed object transitions are at a time step to those which were counterfactually expected.”
I think this failure mode on its own is relatively benign, given querying.
What I find more worrying is that an intelligent agent would likely be able to hard-optimize while avoiding penalties (either through the allowed transitions, by skating by on technicalities re: object recognition, etc).
I suspect the/a ideal solution will have far fewer parameters (if any).