Psy-Kosh: Hrm. I’d think “avoid destroying the world” itself to be an ethical injunction too.
The problem is that this is phrased as an injunction over positive consequneces. Deontology does better when it’s closer to the action level and negative rather than positive.
Imagine trying to give this injunction to an AI. Then it would have to do anything that it thought would prevent the destruction of the world, without other considerations. Doesn’t sound like a good idea.
Crossman: Eliezer, can you be explicit which argument you’re making? I thought you were a utilitarian, but you’ve been sounding a bit Kantian lately.
If all I want is money, then I will one-box on Newcomb’s Problem. I don’t think that’s quite the same as being a Kantian, but it does reflect the idea that similar decision algorithms in similar epistemic states will tend to produce similar outputs.
Clay: Put more seriously, I would think that being believed to put the welfare of humanity ahead of concerns about personal integrity could have significant advantages itself.
The whole point here is that “personal integrity” doesn’t have to be about being a virtuous person. It can be about trying to save the world without any concern for your own virtue. It can be the sort of thing you’d want a pure nonsentient decision agent to do.
There seems to be a conflict here between not lying to yourself, and holding a traditional rule that suggests you ignore your rationality.
Your rationality is the sum of your full abilities, all components, including your wisdom about what you refrain from doing in the presence of what seem like good reasons.
Psy-Kosh: Hrm. I’d think “avoid destroying the world” itself to be an ethical injunction too.
The problem is that this is phrased as an injunction over positive consequneces. Deontology does better when it’s closer to the action level and negative rather than positive.
Imagine trying to give this injunction to an AI. Then it would have to do anything that it thought would prevent the destruction of the world, without other considerations. Doesn’t sound like a good idea.
So, I realize this is really old, but it helped trip the threshold for this idea I’m rolling between my palms.
Do we suspect that a proper AI would interpret “avoid destroying the world” as something like
avoid(prevent self from being cause of)
destroying(analysis indicates destruction threshold ~= 10% landmass remaining habitable, etc.)
the world(interpret as earth, human society...)
(like a modestly intelligent genie)
or do we have reason to suspect that it would hash out the phrase to something more like how a human would read it (given that it’s speaking english which it learned from humans)?
This idea isn’t quite fully formed yet, but I think there might be something to it.
The problem is that this is phrased as an injunction over positive consequneces. Deontology does better when it’s closer to the action level and negative rather than positive.
Imagine trying to give this injunction to an AI. Then it would have to do anything that it thought would prevent the destruction of the world, without other considerations. Doesn’t sound like a good idea.
If all I want is money, then I will one-box on Newcomb’s Problem. I don’t think that’s quite the same as being a Kantian, but it does reflect the idea that similar decision algorithms in similar epistemic states will tend to produce similar outputs.
The whole point here is that “personal integrity” doesn’t have to be about being a virtuous person. It can be about trying to save the world without any concern for your own virtue. It can be the sort of thing you’d want a pure nonsentient decision agent to do.
Your rationality is the sum of your full abilities, all components, including your wisdom about what you refrain from doing in the presence of what seem like good reasons.
So, I realize this is really old, but it helped trip the threshold for this idea I’m rolling between my palms.
Do we suspect that a proper AI would interpret “avoid destroying the world” as something like
avoid(prevent self from being cause of) destroying(analysis indicates destruction threshold ~= 10% landmass remaining habitable, etc.) the world(interpret as earth, human society...)
(like a modestly intelligent genie)
or do we have reason to suspect that it would hash out the phrase to something more like how a human would read it (given that it’s speaking english which it learned from humans)?
This idea isn’t quite fully formed yet, but I think there might be something to it.