Hi all, writing here to introduce myself and to test the waters for a longer post.
I am an experienced software developer by trade. I have an undergraduate degree in mathematics and a graduate degree in a field of applied moral and political philosophy. I am a Rawlsian by temperament but an Olsonian by observation of how the world seems to actually work. My knowledge of real-world AI is in the neighborhood of the layman’s. I have “Learning Tensorflow.js” on my nightstand, which I promised my spouse not to start into until after the garage has been tidied.
Now for what brings me here: I have a strong suspicion that the AI Risk community is under-reporting the highly asymmetric risks from AI in favor of symmetric ones. Not to deny that a misaligned AI that kills everyone is a scary problem that people should be worrying about. But it seems to me that the distribution of the “reward” in the risk-reward trade-off that gives rise to the misalignment problem in the first place needs more elucidation. If, for most people, the difference between AI developers “getting it right” and “getting it wrong” is being killed at the prompting of people instead of the self-direction of AI, the likelihood of the latter vs. the former is rather academic, is it not?
To use the “AI is like nukes” analogy, if misalignment is analogous to the risk of the whole atmosphere igniting in the first fission bomb test (in terms of a really bad outcomes for everyone—yes, I know the probabilities aren’t equivalent), then successful alignment is the analogue of a large number of people getting a working-exactly-as-intended bomb dropped on their city, which, one would think, on July 15th 1945 would have been predicted as the likely follow-up to a successful Trinity test.
The disutility of a large human population in a world where human labor is becoming fungible with automation seems to me to be the fly in the ointment for anyone hoping to enjoy the benefits of all that automation.
That’s the gist of it, but I can write a longer post. Apologies if I missed a discussion in which this concern was already falsified.
Formal alignment proposals avoid this problem by doing metaethics, mostly something like determining what a person would want if they were perfectly rational (so no cognitive biases or logical errors), otherwise basically omniscient, and had an unlimited amount of time to think about it. This is called reflective equilibrium. I think this approach would work for most people, even pretty terrible people. If you extrapolated a terrorist who commits acts of violence for some supposed greater good, for example, they’d realize that the reasoning they used to determine that said acts of violence were good was wrong.
Corrigibility, on the other hand, is more susceptible to this problem and you’d want to get the AI to do a pivotal act, for example, destroying every GPU to prevent other people from deploying harmful AI, or unaligned AI for that matter.
Realistically, I think that most entities who’d want to use a superintelligent AI like a nuke would probably be too short-sighted to care about alignment, but don’t quote me on that.
Thank you. If I understand your explanation correctly, you are saying that there are alignment solutions that are rooted in more general avoidance of harm to currently living humans. If these turn out to be the only feasible solutions to the not-killing-all-humans problem, then they will produce not-killing-most-humans as a side-effect. Nuke analogy: if we cannot build/test a bomb without igniting the whole atmosphere, we’ll pass on bombs altogether and stick to peaceful nuclear energy generation.
It seems clear that such limiting approaches would be avoided by rational actors under winner-take-all dynamics, so long as other approaches remain that have not yet been falsified.
Follow-up Question: does the “any meaningfully useful AI is also potentially lethal to its operator” assertion hold under the significantly different usefulness requirements of a much smaller human population? I’m imagining limited AI that can only just “get the (hard) job done” of killing most people under the direction of its operators, and then support a “good enough” future for the remaining population, which isn’t the hard part because the Earth itself is pretty good at supporting small human populations.
Hi all, writing here to introduce myself and to test the waters for a longer post.
I am an experienced software developer by trade. I have an undergraduate degree in mathematics and a graduate degree in a field of applied moral and political philosophy. I am a Rawlsian by temperament but an Olsonian by observation of how the world seems to actually work. My knowledge of real-world AI is in the neighborhood of the layman’s. I have “Learning Tensorflow.js” on my nightstand, which I promised my spouse not to start into until after the garage has been tidied.
Now for what brings me here: I have a strong suspicion that the AI Risk community is under-reporting the highly asymmetric risks from AI in favor of symmetric ones. Not to deny that a misaligned AI that kills everyone is a scary problem that people should be worrying about. But it seems to me that the distribution of the “reward” in the risk-reward trade-off that gives rise to the misalignment problem in the first place needs more elucidation. If, for most people, the difference between AI developers “getting it right” and “getting it wrong” is being killed at the prompting of people instead of the self-direction of AI, the likelihood of the latter vs. the former is rather academic, is it not?
To use the “AI is like nukes” analogy, if misalignment is analogous to the risk of the whole atmosphere igniting in the first fission bomb test (in terms of a really bad outcomes for everyone—yes, I know the probabilities aren’t equivalent), then successful alignment is the analogue of a large number of people getting a working-exactly-as-intended bomb dropped on their city, which, one would think, on July 15th 1945 would have been predicted as the likely follow-up to a successful Trinity test.
The disutility of a large human population in a world where human labor is becoming fungible with automation seems to me to be the fly in the ointment for anyone hoping to enjoy the benefits of all that automation.
That’s the gist of it, but I can write a longer post. Apologies if I missed a discussion in which this concern was already falsified.
Formal alignment proposals avoid this problem by doing metaethics, mostly something like determining what a person would want if they were perfectly rational (so no cognitive biases or logical errors), otherwise basically omniscient, and had an unlimited amount of time to think about it. This is called reflective equilibrium. I think this approach would work for most people, even pretty terrible people. If you extrapolated a terrorist who commits acts of violence for some supposed greater good, for example, they’d realize that the reasoning they used to determine that said acts of violence were good was wrong.
Corrigibility, on the other hand, is more susceptible to this problem and you’d want to get the AI to do a pivotal act, for example, destroying every GPU to prevent other people from deploying harmful AI, or unaligned AI for that matter.
Realistically, I think that most entities who’d want to use a superintelligent AI like a nuke would probably be too short-sighted to care about alignment, but don’t quote me on that.
Thank you. If I understand your explanation correctly, you are saying that there are alignment solutions that are rooted in more general avoidance of harm to currently living humans. If these turn out to be the only feasible solutions to the not-killing-all-humans problem, then they will produce not-killing-most-humans as a side-effect. Nuke analogy: if we cannot build/test a bomb without igniting the whole atmosphere, we’ll pass on bombs altogether and stick to peaceful nuclear energy generation.
It seems clear that such limiting approaches would be avoided by rational actors under winner-take-all dynamics, so long as other approaches remain that have not yet been falsified.
Follow-up Question: does the “any meaningfully useful AI is also potentially lethal to its operator” assertion hold under the significantly different usefulness requirements of a much smaller human population? I’m imagining limited AI that can only just “get the (hard) job done” of killing most people under the direction of its operators, and then support a “good enough” future for the remaining population, which isn’t the hard part because the Earth itself is pretty good at supporting small human populations.