Let’s suppose there are ~300 million people who’d use their unlimited power to destroy the world (I think the true number is far smaller). That would mean > 95% of people wouldn’t do so. Suppose there were an alignment scheme that we’d tested billions of times on human-level AGIs, and > 95% of the time, it resulted in values compatible with humanity’s continued survival. I think that would be a pretty promising scheme.
grant absolute power and superhuman intelligence along with capacity for further self-modification to a single person, and I give far better than even odds that what results is utterly catastrophic.
If there were a process that predictably resulted in me having values strongly contrary to those I currently posses, I wouldn’t do it. The vast majority of people won’t take pills that turn them into murderers. For the same reason, an aligned AI at slightly superhuman capabilities levels won’t self modify without first becoming confidant that its self modification will preserve its values. Most likely, it would instead develop better alignment tech than we used to create said AI and create a more powerful aligned successor.
I think that a 95% success rate in not destroying the human world would also be fantastic, though I note that there are plenty more potential totalitarian hellscapes that some people would apparently rate even worse than extinction.
Note that I’m not saying that they would deliberately destroy the world for shits and giggles, just that if the rest of the human world was any impediment to anything they valued more, then its destruction would just be a side effect of what had to be done.
I also don’t have any illusion that a superintelligent agent will be infallible. The laws of the universe are not kind, and great power brings the opportunity for causing great disasters. I fully expect that any super-civilizational entity of any level of intelligence could very well destroy the human world by mistake.
Let’s suppose there are ~300 million people who’d use their unlimited power to destroy the world (I think the true number is far smaller). That would mean > 95% of people wouldn’t do so. Suppose there were an alignment scheme that we’d tested billions of times on human-level AGIs, and > 95% of the time, it resulted in values compatible with humanity’s continued survival. I think that would be a pretty promising scheme.
If there were a process that predictably resulted in me having values strongly contrary to those I currently posses, I wouldn’t do it. The vast majority of people won’t take pills that turn them into murderers. For the same reason, an aligned AI at slightly superhuman capabilities levels won’t self modify without first becoming confidant that its self modification will preserve its values. Most likely, it would instead develop better alignment tech than we used to create said AI and create a more powerful aligned successor.
I think that a 95% success rate in not destroying the human world would also be fantastic, though I note that there are plenty more potential totalitarian hellscapes that some people would apparently rate even worse than extinction.
Note that I’m not saying that they would deliberately destroy the world for shits and giggles, just that if the rest of the human world was any impediment to anything they valued more, then its destruction would just be a side effect of what had to be done.
I also don’t have any illusion that a superintelligent agent will be infallible. The laws of the universe are not kind, and great power brings the opportunity for causing great disasters. I fully expect that any super-civilizational entity of any level of intelligence could very well destroy the human world by mistake.