And if we do discover the specific lines of code that will get an AI to perfectly care about its programmer’s True Intentions, such that it reliably self-modifies to better fit them — well, then that will just mean that we’ve solved Friendliness Theory. The clever hack that makes further Friendliness research unnecessary is Friendliness.
It’s still a lot easy to program an AI to care about the programmer’s True intentions than it is to explicitly program in those intentions. The clever hack helps a lot.
It’s at least the case that some rational reconstructions are drastically Truer than others. Idealizations and approximations are pretty OK, in the cases where they don’t ruin everything forever.
I mean, yes, of course, in those cases where X is not unacceptably bad, then X is probably acceptably good, and that’s just as true for X = “an approximation of my True Intentions” as for X = “the annihilation of all life on Earth” or anything else.
And yes, of course, there’s a Vast class of propositions that are drastically worse approximations of my True Intentions than any plausible candidate would be… “Paint everything in the universe blue,” just to pick one arbitrary example.
Did you mean anything more than that? If so, can you unpack a little further?
If not… well, yes. The question is precisely whether there exists any consistent proposition that is a good enough approximation of my True Intentions to not “ruin everything forever,” or whether my True Intentions are sufficiently confused/incoherent/unstable that we prefer to ignore me altogether and listen to someone else instead.
My point is that an approximation that just creates a decently fun world, rather than a wildly overwhelmingly fun world, would still be pretty good. There’s probably no need to aim that low, but if that’s the best we can safely get by extrapolating human volition, without risking a catastrophe, then so be it. Worlds that aren’t completely valueless for most of the stelliferous era are a very small and difficult target to hit; I’m a lot more worried about whether we can hit that target at all than about whether we can hit the very center in a philosophically and neuroscientifically ideal way. If ‘OK’ is the best we can do, then that’s still OK.
I would say that the world we live in is not unacceptably bad for everyone in it. (It it is not the case that everyone currently alive would be better off dead.) So there’s a proof that an AGI could potentially create a non-terrible circumstance for us to inhabit, relative to our preferences. The questions are (a) How do we spatially and temporally distribute the things we already have and value, so we can have a lot more of them?, and (b) How much better can we make things without endangering value altogether?
I agree that a decently fun world would still be pretty good, and if that maximizes expected value then that’s what we should choose, given a choice.
Of course, the “if/then” part of that sentence is important. In other words to go from there to “we should therefore choose a decently but not wildly overwhelmingly fun world, given a choice” is unjustified without additional data. Opportunity costs are still costs.
I agree that the world we live in is not unacceptably bad for everyone in it.
I agree that not everyone currently alive would be better off dead.
I agree that an AGI (as that term is used here) could potentially create a non-terrible circumstance for us to inhabit. (I’m not really sure what “relative to our preferences” means there. What else might “non-terrible” be relative to?)
I’m not at all sure why the two questions you list are “the” questions. I agree that the first one is worth answering. I’m not sure the second one is, though if we answer it along the way to doing something else that’s OK with me.
It’s still a lot easy to program an AI to care about the programmer’s True intentions than it is to explicitly program in those intentions. The clever hack helps a lot.
...always assuming the programmer actually has relevant True Intentions that are coherent enough to be cared about.
It’s at least the case that some rational reconstructions are drastically Truer than others. Idealizations and approximations are pretty OK, in the cases where they don’t ruin everything forever.
I am not sure I’ve understood your point here.
I mean, yes, of course, in those cases where X is not unacceptably bad, then X is probably acceptably good, and that’s just as true for X = “an approximation of my True Intentions” as for X = “the annihilation of all life on Earth” or anything else.
And yes, of course, there’s a Vast class of propositions that are drastically worse approximations of my True Intentions than any plausible candidate would be… “Paint everything in the universe blue,” just to pick one arbitrary example.
Did you mean anything more than that? If so, can you unpack a little further?
If not… well, yes. The question is precisely whether there exists any consistent proposition that is a good enough approximation of my True Intentions to not “ruin everything forever,” or whether my True Intentions are sufficiently confused/incoherent/unstable that we prefer to ignore me altogether and listen to someone else instead.
My point is that an approximation that just creates a decently fun world, rather than a wildly overwhelmingly fun world, would still be pretty good. There’s probably no need to aim that low, but if that’s the best we can safely get by extrapolating human volition, without risking a catastrophe, then so be it. Worlds that aren’t completely valueless for most of the stelliferous era are a very small and difficult target to hit; I’m a lot more worried about whether we can hit that target at all than about whether we can hit the very center in a philosophically and neuroscientifically ideal way. If ‘OK’ is the best we can do, then that’s still OK.
I would say that the world we live in is not unacceptably bad for everyone in it. (It it is not the case that everyone currently alive would be better off dead.) So there’s a proof that an AGI could potentially create a non-terrible circumstance for us to inhabit, relative to our preferences. The questions are (a) How do we spatially and temporally distribute the things we already have and value, so we can have a lot more of them?, and (b) How much better can we make things without endangering value altogether?
I agree that a decently fun world would still be pretty good, and if that maximizes expected value then that’s what we should choose, given a choice.
Of course, the “if/then” part of that sentence is important. In other words to go from there to “we should therefore choose a decently but not wildly overwhelmingly fun world, given a choice” is unjustified without additional data. Opportunity costs are still costs.
I agree that the world we live in is not unacceptably bad for everyone in it. I agree that not everyone currently alive would be better off dead.
I agree that an AGI (as that term is used here) could potentially create a non-terrible circumstance for us to inhabit. (I’m not really sure what “relative to our preferences” means there. What else might “non-terrible” be relative to?)
I’m not at all sure why the two questions you list are “the” questions. I agree that the first one is worth answering. I’m not sure the second one is, though if we answer it along the way to doing something else that’s OK with me.