Possibly. I said this AGI is “safer and more aligned”, implying that it is a matter of degree – while I think most people would regard these properties as discrete: either you are aligned or unaligned. But then I can just replace it with “more likely to be regarded as safe, friendly and aligned”, and the argument remains the same. Moreover, my standard of comparison was Celest-IA, who convinces people to do brain uploading by creating a “race to the bottom” scenario (i.e., as more and more people go to the simulation, human extinction becomes more likely – until there’s nobody to object turning the Solar System into computronium), and adapts their simulated minds so they enjoy being ponnies; my AGI is way “weaker” than that.
I still think it’s not inappropriate to call my AGI “Friendly”, since its goals are defined by a consistent social welfare function; and it’s at least tempting to call it “safe”, as it is law-abiding and does obey explicit commands. Plus, it is strictly maximizing the utility of the agents it is interacting with according to their own utility function, inferred from their brain simulation—i.e., it doesn’t even require generalinterpersonal comparison of utility. I admit I did add a tip of a perverse sense of humor (e.g., the story of the neighbors), but that’s pretty much irrelevant for the overall argument.
But I guess arguing over semantics is beyond the point, right? I was targeting people who think one can “solve alignment” without “solving value”. Thus, I concede that, after reading the story, you and I can agree that the AGI is not aligned—and so could B, in hindsight; but it’s not clear to me how this AGI could have been aligned in the first place. I believe the interesting discussion to have here is why it ends up displaying unaligned behaviour.
I suspect the problem is that B has (temporally and modally) inconsistent preferences, such that, after the brain upload, the AI can consistently disregard the desire of original-B-in-the-present (even though it still obeys original-B’s explicit commands), because they conflict with simulated-B’s preferences (which weigh more, since sim-B can produce more utility with less resources) and past-B’s preferences (who freely opted for brain upload). As I mentioned above, one way to deflect my critique is to bite the bullet: like a friend of mine replied, one can just say that they would not want to survive in the real world after a brain upload – they can consistently say that it’d be a waste of resources. Another way to avoid the specific scenario in my story would be by avoiding brain simulation, or by not regarding simulations as equivalent to oneself, or, finally, by somehow becoming robust against evidential blackmail.
I don’t think that is an original point, and I now see I was sort of inspired by things I read from debates on coherent extrapolated volition long ago. But I think people still underestimate the idea that value is a hard problem: no one has a complete and consistent system of preferences and beliefs (except the Stoic Sages, who “are more rare than the Phoenix”), and it’s hard to see how we could extrapolate from the way we usually cope with that (e.g., through social norms and satisficing behavior) to AI alignment—as superintelligences can do way worse than Dutch books.
Possibly. I said this AGI is “safer and more aligned”, implying that it is a matter of degree – while I think most people would regard these properties as discrete: either you are aligned or unaligned. But then I can just replace it with “more likely to be regarded as safe, friendly and aligned”, and the argument remains the same. Moreover, my standard of comparison was Celest-IA, who convinces people to do brain uploading by creating a “race to the bottom” scenario (i.e., as more and more people go to the simulation, human extinction becomes more likely – until there’s nobody to object turning the Solar System into computronium), and adapts their simulated minds so they enjoy being ponnies; my AGI is way “weaker” than that.
I still think it’s not inappropriate to call my AGI “Friendly”, since its goals are defined by a consistent social welfare function; and it’s at least tempting to call it “safe”, as it is law-abiding and does obey explicit commands. Plus, it is strictly maximizing the utility of the agents it is interacting with according to their own utility function, inferred from their brain simulation—i.e., it doesn’t even require general interpersonal comparison of utility. I admit I did add a tip of a perverse sense of humor (e.g., the story of the neighbors), but that’s pretty much irrelevant for the overall argument.
But I guess arguing over semantics is beyond the point, right? I was targeting people who think one can “solve alignment” without “solving value”. Thus, I concede that, after reading the story, you and I can agree that the AGI is not aligned—and so could B, in hindsight; but it’s not clear to me how this AGI could have been aligned in the first place. I believe the interesting discussion to have here is why it ends up displaying unaligned behaviour.
I suspect the problem is that B has (temporally and modally) inconsistent preferences, such that, after the brain upload, the AI can consistently disregard the desire of original-B-in-the-present (even though it still obeys original-B’s explicit commands), because they conflict with simulated-B’s preferences (which weigh more, since sim-B can produce more utility with less resources) and past-B’s preferences (who freely opted for brain upload). As I mentioned above, one way to deflect my critique is to bite the bullet: like a friend of mine replied, one can just say that they would not want to survive in the real world after a brain upload – they can consistently say that it’d be a waste of resources. Another way to avoid the specific scenario in my story would be by avoiding brain simulation, or by not regarding simulations as equivalent to oneself, or, finally, by somehow becoming robust against evidential blackmail.
I don’t think that is an original point, and I now see I was sort of inspired by things I read from debates on coherent extrapolated volition long ago. But I think people still underestimate the idea that value is a hard problem: no one has a complete and consistent system of preferences and beliefs (except the Stoic Sages, who “are more rare than the Phoenix”), and it’s hard to see how we could extrapolate from the way we usually cope with that (e.g., through social norms and satisficing behavior) to AI alignment—as superintelligences can do way worse than Dutch books.