And then I claim that conditional on that scenario having happened, I am very surprised by the fact that we did not know this deception in any earlier scenario that didn’t lead to extinction. And I don’t really get people’s intuitions for why that would be the case. I haven’t tried to figure that one out though.
I feel like I believe that people notice deception early on but are plausibly wrong about whether or not they’ve fixed it
RS After a few failures, you’d think we’d at least know to expect it?
DF Sure, but if your AI is also getting smarter, then that probably doesn’t help you that much in detecting it, and only one person has to be wrong and deploy (if actually fixing takes a significantly longer time than sort of but not really fixing it) [this comment was written with less than usual carefulness]
RS Seems right, but in general human society / humans seem pretty good at being risk-averse (to the point that it seems to me that on anything that isn’t x-risk the utilitarian thing is to be more risk-seeking), and I’m hopeful that the same will be true here. (Also I’m assuming that it would take a bunch of compute, and it’s not that easy for a single person to deploy an AI, though even in that case I’d be optimistic, given that smallpox hasn’t been released yet.)
DF sorry by ‘one person’ I meant ‘one person in charge of a big team’
RS The hope is that they are constrained by all the typical constraints on such people (shareholders, governments, laws, public opinion, the rest of the team, etc.) Also this significantly decreases the number of people who can do the thing, restricts it to people who are “broadly reasonable” (e.g. no terrorists), and allows us to convince each such person individually. Also I rarely think there is just one person — at the very least you need one person with a bunch of money and resources and another with the technical know-how, and it would be very difficult for these to be the same person
DF Sure. I guess even with those caveats my scenario doesn’t seem that unlikely to me.
RS Sure, I don’t think this is enough to say “yup, this definitely won’t happen”. I think we do disagree on the relative likelihood of it happening, but maybe not by that much. (I’m hesitant to write a number because the scenario isn’t really fleshed out enough yet for us to agree on what we’re writing a number about.)
DF From your AI impacts interview:
I feel like I believe that people notice deception early on but are plausibly wrong about whether or not they’ve fixed it
RS After a few failures, you’d think we’d at least know to expect it?
DF Sure, but if your AI is also getting smarter, then that probably doesn’t help you that much in detecting it, and only one person has to be wrong and deploy (if actually fixing takes a significantly longer time than sort of but not really fixing it) [this comment was written with less than usual carefulness]
RS Seems right, but in general human society / humans seem pretty good at being risk-averse (to the point that it seems to me that on anything that isn’t x-risk the utilitarian thing is to be more risk-seeking), and I’m hopeful that the same will be true here. (Also I’m assuming that it would take a bunch of compute, and it’s not that easy for a single person to deploy an AI, though even in that case I’d be optimistic, given that smallpox hasn’t been released yet.)
DF sorry by ‘one person’ I meant ‘one person in charge of a big team’
RS The hope is that they are constrained by all the typical constraints on such people (shareholders, governments, laws, public opinion, the rest of the team, etc.) Also this significantly decreases the number of people who can do the thing, restricts it to people who are “broadly reasonable” (e.g. no terrorists), and allows us to convince each such person individually. Also I rarely think there is just one person — at the very least you need one person with a bunch of money and resources and another with the technical know-how, and it would be very difficult for these to be the same person
DF Sure. I guess even with those caveats my scenario doesn’t seem that unlikely to me.
RS Sure, I don’t think this is enough to say “yup, this definitely won’t happen”. I think we do disagree on the relative likelihood of it happening, but maybe not by that much. (I’m hesitant to write a number because the scenario isn’t really fleshed out enough yet for us to agree on what we’re writing a number about.)