This sounds like you’re assuming that I’m trying to argue in favor of Friendly AI as the best solution...
(Responding to the whole paragraph but don’t want to quote it all) I would be interested to hear a definition of “AI risk” that does not reduce to “risk of unfriendly outcome” which itself is defined in terms of friendliness aka relation to human morality. If, like me, you reject the idea of consistent, discoverable morality in the first place, and therefore find friendliness to be an ill-formed, inconsistent idea, then it’s hard to say anything concrete about AI risk either. If you have a better definition that does not reduce to alignment with human morality, please provide it.
Mapping the problem starts with defining what the problem is. What is AI risk, without reference to dubious notions of human morality?
I think that you’ve left out a LOT of things that must happen a certain way in order for your AI risk outcomes to come to pass.
Would appreciate hearing more about these.
To start with there’s all the normal, benign things that happen in any large scale software project that require human intervention. Like, say, the AGI crashes. Or the database that holds its memories becomes inconsistent. Or it gets deadlocked on choosing actions due to a race condition. The humanity threatening failure mode presume that the AGI, on its first attempt at break-out, doesn’t suffer any normal engineering defect failures—or that if it does then the humans operating it just fix it and turn it back on. I’m not interested in any arguments that assume the latter, and the former is highly conjunctive.
Isn’t that the standard way of figuring out the appropriate corrective actions? First figure out what would happen absent any intervention, then see which points seem like most amenable to correction.
I may have misread your intent, and if so I apologize. The first sentence of your post here made it seem like you were countering a criticism, aka advocating for the original position. So I read your posts in that context and may have inferred too much.
If, like me, you reject the idea of consistent, discoverable morality in the first place, and therefore find friendliness to be an ill-formed, inconsistent idea, then it’s hard to say anything concrete about AI risk either. If you have a better definition that does not reduce to alignment with human morality, please provide it.
Mapping the problem starts with defining what the problem is. What is AI risk, without reference to dubious notions of human morality?
I also reject the idea of a consistent, discoverable morality, at least to the extent that the morality is assumed to be unique. I think that moralities are not so much discovered but constructed: a morality is in a sense an adaptation to a specific environment, and it will continue to constantly evolve as the environment changes (including the social environment, so the morality changing will by itself cause more changes to the environment, which will trigger new changes to the morality, etc.). There is no reason to assume that this will produce a consistent moral system: there will be inconsistencies which will need to be resolved when they become relevant, and the order in which they are resolved seems likely to affect the final outcome.
But to answer your actual question: I don’t have a rigorous answer for exactly what the criteria for “success” are. The intuitive answer is that there are some futures that I’d consider horrifying if they came true and some which I’d consider fantastic if they came true, and I want to further the fantastic ones and avoid the horrifying ones. (I presume this to also be the case for you, because why else would you care about the topic in the first place?)
Given that this is very much a “I can’t give you a definition, but I know it when I see it” thing, it seems hard to make sure that we avoid the horrifying outcomes without grounding the AIs in human values somehow, and making sure that they share our reaction when they see (imagine) some particular future. (Either that, or trying to make sure that we evolve to be equally powerful as the AIs, but this seems unlikely to me.)
Depending on your definitions, you could say that this still reduces to alignment with human morality, but with the note that my conception of human morality is that of a dynamic process, and that the AIs could be allowed to e.g. nudge the development of our values towards a direction that made it easier to reconcile value differences between different cultures, even if there was no “objective” reason for why that direction was any better or worse than any other one.
To start with there’s all the normal, benign things that happen in any large scale software project that require human intervention. Like, say, the AGI crashes. Or the database that holds its memories becomes inconsistent. Or it gets deadlocked on choosing actions due to a race condition. The humanity threatening failure mode presume that the AGI, on its first attempt at break-out, doesn’t suffer any normal engineering defect failures—or that if it does then the humans operating it just fix it and turn it back on. I’m not interested in any arguments that assume the latter, and the former is highly conjunctive.
Are you assuming that there will only ever be one AGI that might try to escape, that its creators never decide to release it, and that it can’t end up effectively in control even if boxed?
(Responding to the whole paragraph but don’t want to quote it all) I would be interested to hear a definition of “AI risk” that does not reduce to “risk of unfriendly outcome” which itself is defined in terms of friendliness aka relation to human morality. If, like me, you reject the idea of consistent, discoverable morality in the first place, and therefore find friendliness to be an ill-formed, inconsistent idea, then it’s hard to say anything concrete about AI risk either. If you have a better definition that does not reduce to alignment with human morality, please provide it.
Mapping the problem starts with defining what the problem is. What is AI risk, without reference to dubious notions of human morality?
To start with there’s all the normal, benign things that happen in any large scale software project that require human intervention. Like, say, the AGI crashes. Or the database that holds its memories becomes inconsistent. Or it gets deadlocked on choosing actions due to a race condition. The humanity threatening failure mode presume that the AGI, on its first attempt at break-out, doesn’t suffer any normal engineering defect failures—or that if it does then the humans operating it just fix it and turn it back on. I’m not interested in any arguments that assume the latter, and the former is highly conjunctive.
I may have misread your intent, and if so I apologize. The first sentence of your post here made it seem like you were countering a criticism, aka advocating for the original position. So I read your posts in that context and may have inferred too much.
I also reject the idea of a consistent, discoverable morality, at least to the extent that the morality is assumed to be unique. I think that moralities are not so much discovered but constructed: a morality is in a sense an adaptation to a specific environment, and it will continue to constantly evolve as the environment changes (including the social environment, so the morality changing will by itself cause more changes to the environment, which will trigger new changes to the morality, etc.). There is no reason to assume that this will produce a consistent moral system: there will be inconsistencies which will need to be resolved when they become relevant, and the order in which they are resolved seems likely to affect the final outcome.
But to answer your actual question: I don’t have a rigorous answer for exactly what the criteria for “success” are. The intuitive answer is that there are some futures that I’d consider horrifying if they came true and some which I’d consider fantastic if they came true, and I want to further the fantastic ones and avoid the horrifying ones. (I presume this to also be the case for you, because why else would you care about the topic in the first place?)
Given that this is very much a “I can’t give you a definition, but I know it when I see it” thing, it seems hard to make sure that we avoid the horrifying outcomes without grounding the AIs in human values somehow, and making sure that they share our reaction when they see (imagine) some particular future. (Either that, or trying to make sure that we evolve to be equally powerful as the AIs, but this seems unlikely to me.)
Depending on your definitions, you could say that this still reduces to alignment with human morality, but with the note that my conception of human morality is that of a dynamic process, and that the AIs could be allowed to e.g. nudge the development of our values towards a direction that made it easier to reconcile value differences between different cultures, even if there was no “objective” reason for why that direction was any better or worse than any other one.
Are you assuming that there will only ever be one AGI that might try to escape, that its creators never decide to release it, and that it can’t end up effectively in control even if boxed?
To talk about risk you need to define “bad outcomes”. You don’t have to define them in terms of morality, but you have to define them somehow.