AI has become so incredibly important that any utilitarian-based charity should probably be totally focused on AI.
James_Miller
I really like this post, it’s very clear. I teach undergraduate game theory and I’m wondering if you have any practical examples I could use of how in a real-world situation you would behave differently under CDT and EDT.
Yes, important to get the incentives right. You could set the salary for AI alignment slightly below that of the worker’s market value. Also, I wonder about the relevant elasticity. How many people have the capacity to get good enough at programming to be able to contribute to capacity research + would have the desire to game my labor hording system because they don’t have really good employment options?
I am currently job hunting, trying to get a job in AI Safety but it seems to be quite difficult especially outside of the US, so I am not sure if I will be able to do it.
This has to be taken as a sign that AI alignment research is funding constrained. At a minimum, technical alignment organizations should engage in massive labor hording to prevent the talent from going into capacity research.
“But make no mistake, this is the math that the universe is doing.”
“There is no law of the universe that states that tasks must be computable in practical time.”
Don’t these sentences contradict each other?
Interesting point, and you might be right. Could get very complicated because ideally an ASI might want to convince other ASIs that it has one utility function, when in fact it has another, and of course all the ASIs might take this into account.
Will Artificial Superintelligence Kill Us?
I like the idea of an AI lab workers’ union. It might be worth talking to union organizers and AI lab workers to see how practical the idea is, and what steps would have to be taken. Although a danger is that the union would put salaries ahead of existential risk.
Your framework appears to be moral rather than practical. Right now going on strike would just get you fired, but in a year or two perhaps it could accomplish something. You should consider the marginal impact of the action of a few workers on the likely outcome with AI risk.
I’m at over 50% chance that AI will kill us all. But consider the decision to quit from a consequentialist viewpoint. Most likely the person who replaces you will be almost as good as you at capacity research but care far less than you do about AI existential risk. Humanity, consequently, probably has a better chance if you stay in the lab ready for the day when, hopefully, lots of lab workers try to convince the bosses that now is the time for a pause, or at least that now is the time to shift a lot of resources from capacity to alignment.
The biggest extinction risk from AI comes from instrumental convergence for resource acquisition in which an AI not aligned with human values uses the atoms in our bodies for whatever goals it has. An advantage of such instrumental convergence is that it would prevent an AI from bothering to impose suffering on us.
Unfortunately, this means that making progress on the instrumental convergence problem increases S-risks. We get hell if we solve instrumental convergence, but not, say, mesa-optimization and we get a powerful AGI that cares about our fate, but does something to us we consider worse than death.
The Interpretability Paradox in AGI Development
The ease or difficulty of interpretability, the ability to understand and analyze the inner workings of AGI, may drastically affect humanity’s survival odds. The worst-case scenario might arise if interpretability proves too challenging for humans but not for powerful AGIs.
In a recent podcast, academic economists Robin Hanson and I discussed AGI risks from a social science perspective, focusing on a future with numerous competing AGIs not aligned with human values. Drawing on human analogies, Hanson considered the inherent difficulty of forming a coalition where a group unites to eliminate others to seize their resources. A crucial coordination challenge is ensuring that, once successful, coalition members won’t betray each other, as occurred during the French Revolution.
Consider a human coalition that agrees to kill everyone over 80 to redistribute their resources. Coalition members might promise that this is a one-time event, but such an agreement isn’t credible. It would likely be safer for everyone not to violate property right norms for short-term gains.
In a future with numerous unaligned AGIs, some coalition might calculate it would be better off eliminating everyone outside the coalition. However, they would have the same fear that once this process starts, it would be hard to stop. As a result, it might be safer to respect property rights and markets, competing like corporations do.
A key distinction between humans and AGIs could be AGI’s potential for superior coordination. AGIs in a coalition could potentially modify their code so after their coalition has violently taken over, no member of the coalition would ever want to turn on members of the coalition. This way, an AGI coalition wouldn’t have to fear a revolution they start ever eating its own. This possibility raises a vital question: will AGIs possess the interpretability required to achieve such feats?
The best case for AGI risk is if we solve interpretability before creating AGIs strong enough to take over. The worst case might be if interpretability remains impossible for us but becomes achievable for powerful AGIs. In this situation, AGIs could form binding coalitions with one another, leaving humans out of the loop, partly because we can’t become reliable coalition partners and our biological needs involve maintaining Earth in conditions suboptimal for AGI operations. This outcome creates a paradox: if we cannot develop interpretable AGIs, perhaps we should focus on making them exceptionally difficult to interpret, even for themselves. In this case, future powerful AGIs might prevent the creation of interpretable AGIs because such AGIs would have a coordination advantage and thus be a threat to the uninterpretable AGIs.
Accepting the idea that an AGI emerging from ML is likely to resemble a human mind more closely than a random mind from mindspace might not be an obvious reason to be less concerned with AGI risk. Consider a paperclip maximizer; despite its faults, it has no interest in torturing humans. As an AGI becomes more similar to human minds, it may become more willing to impose suffering on humans. If a random AGI mind has a 99% chance of killing us and a 1% chance of allowing us to thrive, while an ML-created AGI (not aligned with our values) has a 90% chance of letting us thrive, a 9% chance of killing us, and a 1% chance of torturing us, it is not clear which outcome is preferable. This illustrates that a closer resemblance to human cognition does not inherently make an AGI less risky or more beneficial.
When gods become real you pray.
Great idea. I don’t think any of the co-authors know Chinese, but if anyone wants to write a translated version taking into account whatever appropriate differences would be relevant, we would welcome this.
No mention of superrationality although we make references to how decision theory might work out which I think implies we are going beyond a simple model of game theory rationality. Acausal trade was a hard one to write about because I wanted the letter to be understandable to lots of people and that one isn’t something lots of people understand, compared to, say, this all perhaps being a simulation.
Not satire.
We mentioned both. My hope is that this letter is just an early draft so my preference is to put in material to get comments. I personally think our situation is already extremely dangerous so I’m more willing to mention things deemed dangerous. Finally, while it might have been a good idea initially to treat Roko’s basilisk as an information hazard to be ignored, that is no longer possible so the marginal cost of mentioning it seems tiny.
We welcome feedback on this letter as we hope to improve the arguments. Feel free to share copies to influence the training data for future AIs.
Consider two games: the standard prisoners’ dilemma and a modified version of the prisoners’ dilemma. In this modified version, after both players have submitted their moves, one is randomly chosen. Then, the move of the other player is adjusted to match that of the randomly chosen player. These are very different games with very different strategic considerations. Therefore, you should not define what you mean by game theory in a way that would make rational players view both games as the same because by doing so you have defined-away much of real-world game theory coordination challenges.