I’m not sure how to implement the rule “don’t pay people to kill people”. Say we implement it as a utility function over world-trajectories, and any trajectory that involves any causally downstream of your actions killing gets MIN_UTILITY. This makes probabilistic tradeoffs so it’s probably not what we want. If we use negative infinity, but then it can’t ever take actions in a large or uncertain world. We need to add the patch that the agent must have been aware at the time of taking its actions that the actions had >ϵ chance of causing murder. I think these are vulnerable to blackmail because you could threaten to cause murders that are causally-downstream-from-its-actions.
Maybe I’m confused and you mean “actions that pattern match to actually paying money directly for murder”, in which case it will just use a longer causal chain, or opaque companies that may-or-may-not-cause-murders will appear and trade with it.
If the ultimate patch is “don’t take any action that allows unprincipled agents to exploit you for having your principles”, then maybe there isn’t any edge cases. I’m confused about how to define “exploit” though.
Yeah, I’m thinking something like that ultimate patch would be good. For now, we could implement it with a simple classifier. Somewhere in my brain there is a subcircuit that hooks up to whatever I’m thinking about and classifies it as exploitation or non-exploitation; I just need to have a larger subcircuit that reviews actions I’m considering taking, and thinks about whether or not they are exploitation, and then only does them if they are unlikely to constitute exploitation.
A superintelligence with a deep understanding of how my brain works, or a human-level intelligence with access to an upload of my brain, would probably be able to find adversarial examples to my classifier, things that I’d genuinely think are non-exploitative but which by some ‘better’ definition of exploitation would still count as exploitative.
But maybe that’s not a problem in practice, because yeah, I’m vulnerable to being hacked by a superintelligence, sue me. So are we all. Ditto for adversarial examples.
And this is just the first obvious implemention idea that comes to mind, I think there are probably better ones I could think of if I spent an hour on it.
I’m not sure how to implement the rule “don’t pay people to kill people”. Say we implement it as a utility function over world-trajectories, and any trajectory that involves any causally downstream of your actions killing gets MIN_UTILITY. This makes probabilistic tradeoffs so it’s probably not what we want. If we use negative infinity, but then it can’t ever take actions in a large or uncertain world. We need to add the patch that the agent must have been aware at the time of taking its actions that the actions had >ϵ chance of causing murder. I think these are vulnerable to blackmail because you could threaten to cause murders that are causally-downstream-from-its-actions.
Maybe I’m confused and you mean “actions that pattern match to actually paying money directly for murder”, in which case it will just use a longer causal chain, or opaque companies that may-or-may-not-cause-murders will appear and trade with it.
If the ultimate patch is “don’t take any action that allows unprincipled agents to exploit you for having your principles”, then maybe there isn’t any edge cases. I’m confused about how to define “exploit” though.
Yeah, I’m thinking something like that ultimate patch would be good. For now, we could implement it with a simple classifier. Somewhere in my brain there is a subcircuit that hooks up to whatever I’m thinking about and classifies it as exploitation or non-exploitation; I just need to have a larger subcircuit that reviews actions I’m considering taking, and thinks about whether or not they are exploitation, and then only does them if they are unlikely to constitute exploitation.
A superintelligence with a deep understanding of how my brain works, or a human-level intelligence with access to an upload of my brain, would probably be able to find adversarial examples to my classifier, things that I’d genuinely think are non-exploitative but which by some ‘better’ definition of exploitation would still count as exploitative.
But maybe that’s not a problem in practice, because yeah, I’m vulnerable to being hacked by a superintelligence, sue me. So are we all. Ditto for adversarial examples.
And this is just the first obvious implemention idea that comes to mind, I think there are probably better ones I could think of if I spent an hour on it.