You leave money on the table in all the problems where the most efficient-in-money solution involves violating your constraint. So there’s some selection pressure against you if selection is based on money. We can (kinda) turn this into a money-pump by charging the agent a fee to violate the constraint for it. Whenever it encounters such a situation, it pays you a fee and you do the killing. Whether or not this counts as a money pump, I think it satisfies the reasons I actually care about money pumps, which are something like “adversarial agents can cheaply construct situations where I pay them money, but the world isn’t actually different”.
Thanks. I don’t think this is as bad as you make it sound. For one thing, you might also have a deontological constraint against paying other people to do your dirty work for you, such that less-principled competitors can’t easily benefit from your principles. For another, the benefits of having deontological constraints might outweigh the costs—for example, suppose you are deontologically constrained to never say anything you don’t believe. You can still pay other people to lie for you though. But the fact that you are subject to this constraint makes you a pleasure to deal with; people love doing business with you because they know they can trust you (if they are careful to ask the right questions that is). This benefit could very easily outweigh the costs, including the cost of occasionally having to pay someone else to make a false pronouncement that you can’t make yourself.
I’m not sure how to implement the rule “don’t pay people to kill people”. Say we implement it as a utility function over world-trajectories, and any trajectory that involves any causally downstream of your actions killing gets MIN_UTILITY. This makes probabilistic tradeoffs so it’s probably not what we want. If we use negative infinity, but then it can’t ever take actions in a large or uncertain world. We need to add the patch that the agent must have been aware at the time of taking its actions that the actions had >ϵ chance of causing murder. I think these are vulnerable to blackmail because you could threaten to cause murders that are causally-downstream-from-its-actions.
Maybe I’m confused and you mean “actions that pattern match to actually paying money directly for murder”, in which case it will just use a longer causal chain, or opaque companies that may-or-may-not-cause-murders will appear and trade with it.
If the ultimate patch is “don’t take any action that allows unprincipled agents to exploit you for having your principles”, then maybe there isn’t any edge cases. I’m confused about how to define “exploit” though.
Yeah, I’m thinking something like that ultimate patch would be good. For now, we could implement it with a simple classifier. Somewhere in my brain there is a subcircuit that hooks up to whatever I’m thinking about and classifies it as exploitation or non-exploitation; I just need to have a larger subcircuit that reviews actions I’m considering taking, and thinks about whether or not they are exploitation, and then only does them if they are unlikely to constitute exploitation.
A superintelligence with a deep understanding of how my brain works, or a human-level intelligence with access to an upload of my brain, would probably be able to find adversarial examples to my classifier, things that I’d genuinely think are non-exploitative but which by some ‘better’ definition of exploitation would still count as exploitative.
But maybe that’s not a problem in practice, because yeah, I’m vulnerable to being hacked by a superintelligence, sue me. So are we all. Ditto for adversarial examples.
And this is just the first obvious implemention idea that comes to mind, I think there are probably better ones I could think of if I spent an hour on it.
You leave money on the table in all the problems where the most efficient-in-money solution involves violating your constraint. So there’s some selection pressure against you if selection is based on money.
We can (kinda) turn this into a money-pump by charging the agent a fee to violate the constraint for it. Whenever it encounters such a situation, it pays you a fee and you do the killing.
Whether or not this counts as a money pump, I think it satisfies the reasons I actually care about money pumps, which are something like “adversarial agents can cheaply construct situations where I pay them money, but the world isn’t actually different”.
Thanks. I don’t think this is as bad as you make it sound. For one thing, you might also have a deontological constraint against paying other people to do your dirty work for you, such that less-principled competitors can’t easily benefit from your principles. For another, the benefits of having deontological constraints might outweigh the costs—for example, suppose you are deontologically constrained to never say anything you don’t believe. You can still pay other people to lie for you though. But the fact that you are subject to this constraint makes you a pleasure to deal with; people love doing business with you because they know they can trust you (if they are careful to ask the right questions that is). This benefit could very easily outweigh the costs, including the cost of occasionally having to pay someone else to make a false pronouncement that you can’t make yourself.
I’m not sure how to implement the rule “don’t pay people to kill people”. Say we implement it as a utility function over world-trajectories, and any trajectory that involves any causally downstream of your actions killing gets MIN_UTILITY. This makes probabilistic tradeoffs so it’s probably not what we want. If we use negative infinity, but then it can’t ever take actions in a large or uncertain world. We need to add the patch that the agent must have been aware at the time of taking its actions that the actions had >ϵ chance of causing murder. I think these are vulnerable to blackmail because you could threaten to cause murders that are causally-downstream-from-its-actions.
Maybe I’m confused and you mean “actions that pattern match to actually paying money directly for murder”, in which case it will just use a longer causal chain, or opaque companies that may-or-may-not-cause-murders will appear and trade with it.
If the ultimate patch is “don’t take any action that allows unprincipled agents to exploit you for having your principles”, then maybe there isn’t any edge cases. I’m confused about how to define “exploit” though.
Yeah, I’m thinking something like that ultimate patch would be good. For now, we could implement it with a simple classifier. Somewhere in my brain there is a subcircuit that hooks up to whatever I’m thinking about and classifies it as exploitation or non-exploitation; I just need to have a larger subcircuit that reviews actions I’m considering taking, and thinks about whether or not they are exploitation, and then only does them if they are unlikely to constitute exploitation.
A superintelligence with a deep understanding of how my brain works, or a human-level intelligence with access to an upload of my brain, would probably be able to find adversarial examples to my classifier, things that I’d genuinely think are non-exploitative but which by some ‘better’ definition of exploitation would still count as exploitative.
But maybe that’s not a problem in practice, because yeah, I’m vulnerable to being hacked by a superintelligence, sue me. So are we all. Ditto for adversarial examples.
And this is just the first obvious implemention idea that comes to mind, I think there are probably better ones I could think of if I spent an hour on it.
There’s also a similar interesting argument here, but I don’t think you get a money pump out of it either: https://rychappell.substack.com/p/a-new-paradox-of-deontology