JBlack comments on But What’s Your New Alignment Insight, out of a Future-Textbook Paragraph?

JBlack 7 May 2022 10:47 UTC
4 points
I believe that there are plenty of people who would destroy the world. I do know at least one personally. I don’t know very many people to the extent that I could even hazard a guess as to whether they actually would or not, so either I am very fortunate (!) to know one of this tiny number, or there are at least millions of them and possibly hundreds of millions.
I am pretty certain that most humans would destroy the world if there was any conflict between that and any of their strongest values. The world persists only because there are no gods. The most powerful people to ever have existed have been powerful only because of the power granted to them by other humans. Remove that limitation and grant absolute power and superhuman intelligence along with capacity for further self-modification to a single person, and I give far better than even odds that what results is utterly catastrophic.
- Quintin Pope 8 May 2022 3:10 UTC
  11 points
  Parent
  Let’s suppose there are ~300 million people who’d use their unlimited power to destroy the world (I think the true number is far smaller). That would mean > 95% of people wouldn’t do so. Suppose there were an alignment scheme that we’d tested billions of times on human-level AGIs, and > 95% of the time, it resulted in values compatible with humanity’s continued survival. I think that would be a pretty promising scheme.
  grant absolute power and superhuman intelligence along with capacity for further self-modification to a single person, and I give far better than even odds that what results is utterly catastrophic.
  If there were a process that predictably resulted in me having values strongly contrary to those I currently posses, I wouldn’t do it. The vast majority of people won’t take pills that turn them into murderers. For the same reason, an aligned AI at slightly superhuman capabilities levels won’t self modify without first becoming confidant that its self modification will preserve its values. Most likely, it would instead develop better alignment tech than we used to create said AI and create a more powerful aligned successor.
  - JBlack 9 May 2022 0:51 UTC
    1 point
    Parent
    I think that a 95% success rate in not destroying the human world would also be fantastic, though I note that there are plenty more potential totalitarian hellscapes that some people would apparently rate even worse than extinction.
    Note that I’m not saying that they would deliberately destroy the world for shits and giggles, just that if the rest of the human world was any impediment to anything they valued more, then its destruction would just be a side effect of what had to be done.
    I also don’t have any illusion that a superintelligent agent will be infallible. The laws of the universe are not kind, and great power brings the opportunity for causing great disasters. I fully expect that any super-civilizational entity of any level of intelligence could very well destroy the human world by mistake.

JBlack comments on But What’s Your *New Alignment Insight,* out of a Future-Textbook Paragraph?

JBlack comments on But What’s Your New Alignment Insight, out of a Future-Textbook Paragraph?