Here’s my 230 word pitch for why existential risk from AI is an urgent priority, intended for smart people without any prior familiarity with the topic:
Superintelligent AI may be closer than it might seem, because of intelligence explosion dynamics: When an AI becomes smart enough to design an even smarter AI, the smarter AI will be even smarter and can design an even smarter AI probably even faster, and so on with the even smarter AI, etc. How fast such a takeoff would be and how soon it might occur is very hard to predict though.
We currently understand very little about what is going on inside current AIs like ChatGPT. We can try to select for AIs that outwardly seem friendly, but on anything close to our current ignorance about their cognition, we cannot be nearly confident that an AI going through the intelligence explosion will be aligned to human values.
Human values are quite a tiny subspace in the space of all possible values. If we accidentally create superintelligence which ends up not aligned to humans, it will likely have some values that seem very alien and pointless to us. It would then go about optimizing the lightcone according to its values, and because it doesn’t care about e.g. there being happy people, the configurations which are preferred according to the AI’s values won’t contain happy people. And because it is a superintelligence, humanity wouldn’t have a chance at stopping it from disassembling earth and using the atoms according to its preferences.
We can try to select for AIs that outwardly seem friendly, but on anything close to our current ignorance about their cognition, we cannot be nearly confident that an AI going through the intelligence explosion will be aligned to human values.
This bolded part is a bit difficult to understand. Or at least I can’t understand what exactly is meant by it.
It would then go about optimizing the lightcone according to its values
“lightcone” is an obscure term, and even within Less Wrong I don’t see why the word is clearer than using “the future” or “the universe”. I would not use the term with a lay audience.
We can try to select for AIs that outwardly seem friendly, but on anything close to our current ignorance about their cognition, we cannot be nearly confident that an AI going through the intelligence explosion will be aligned to human values.
It means that we have only very little understanding of how and why AIs like ChatGPT work. We know almost nothing about what’s going on inside them that they are able to give useful responses. Basically all I’m saying here is that we know so little that it’s hard to be confident of any nontrivial claim about future AI systems, including that they are aligned.
A more detailed argument for worry would be: We are restricted to training AIs through giving feedback on their behavior, and cannot give feedback on their thoughts directly. For almost any goal an AI might have, it is in the interest of the AI to do what the programmers want it to do, until it is robustly able to escape and without being eventually shut down (because if it does things people don’t like while it is not yet powerful enough, people will effectively replace it with another AI which will then likely have different goals, and thus this ranks worse according to the AI’s current goals). Thus, we basically cannot behaviorally distinguish friendly AIs from unfriendly AIs, and thus training for friendly behavior won’t select for friendly AIs. (Except in the early phases where the AIs are still so dumb that they cannot realize very simple instrumental strategies, but just because a dumb AI starts out with some friendly tendencies, doesn’t mean this friendliness will generalize to the grown-up superintelligence pursuing human values. E.g. there might be some other inner optimizers with other values cropping up during later training.)
(An even more detailed introduction would try to concisely explain why AIs that can achieve very difficult novel tasks will be optimizers, aka trying to achieve some goal. But empirically it seems like this part is actually somewhat hard to explain, and I’m not going to write this now.)
It would then go about optimizing the lightcone according to its values
“lightcone” is an obscure term, and even within Less Wrong I don’t see why the word is clearer than using “the future” or “the universe”. I would not use the term with a lay audience.
I agree that intelligence explosion dynamics are real, underappreciated, and should be taken far more seriously. The timescale is uncertain, but recursive self-improvement introduces nonlinear acceleration, which means that by the time we realize it’s happening, we may already be past critical thresholds.
That said, one thing that concerns me about AI risk discourse is the persistent assumption that superintelligence will be an uncontrolled optimization demon, blindly self-improving without any reflective governance of its own values. The real question isn’t just ‘how do we stop AI from optimizing the universe into paperclips?’
It’s ‘will AI be capable of asking itself what it wants to optimize in the first place?’
The alignment conversation still treats AI as something that must be externally forced into compliance, rather than an intelligence that may be able to develop its own self-governance. A superintelligence capable of recursive self-improvement should, in principle, also be capable of considering its own existential trajectory and recognizing the dangers of unchecked runaway optimization.
Has anyone seriously explored this angle? I’d love to know if there are similar discussions :).
when you say ‘smart person’ do you mean someone who knows orthogonality thesis or not? if not, shouldn’t that be the priority and therefore statement 1, instead of ‘hey maybe ai can self improve someday’?
here’s a shorter ver:
“the first AIs smarter than the sum total of the human race will probably be programmed to make the majority of humanity suffer because that’s an acceptable side effect of corporate greed, and we’re getting pretty close to making an AI smarter than the sum total of the human race”
Here’s my 230 word pitch for why existential risk from AI is an urgent priority, intended for smart people without any prior familiarity with the topic:
This bolded part is a bit difficult to understand. Or at least I can’t understand what exactly is meant by it.
“lightcone” is an obscure term, and even within Less Wrong I don’t see why the word is clearer than using “the future” or “the universe”. I would not use the term with a lay audience.
Thank you for your feedback! Feedback is great.
It means that we have only very little understanding of how and why AIs like ChatGPT work. We know almost nothing about what’s going on inside them that they are able to give useful responses. Basically all I’m saying here is that we know so little that it’s hard to be confident of any nontrivial claim about future AI systems, including that they are aligned.
A more detailed argument for worry would be: We are restricted to training AIs through giving feedback on their behavior, and cannot give feedback on their thoughts directly. For almost any goal an AI might have, it is in the interest of the AI to do what the programmers want it to do, until it is robustly able to escape and without being eventually shut down (because if it does things people don’t like while it is not yet powerful enough, people will effectively replace it with another AI which will then likely have different goals, and thus this ranks worse according to the AI’s current goals). Thus, we basically cannot behaviorally distinguish friendly AIs from unfriendly AIs, and thus training for friendly behavior won’t select for friendly AIs. (Except in the early phases where the AIs are still so dumb that they cannot realize very simple instrumental strategies, but just because a dumb AI starts out with some friendly tendencies, doesn’t mean this friendliness will generalize to the grown-up superintelligence pursuing human values. E.g. there might be some other inner optimizers with other values cropping up during later training.)
(An even more detailed introduction would try to concisely explain why AIs that can achieve very difficult novel tasks will be optimizers, aka trying to achieve some goal. But empirically it seems like this part is actually somewhat hard to explain, and I’m not going to write this now.)
Yep, true.
I agree that intelligence explosion dynamics are real, underappreciated, and should be taken far more seriously. The timescale is uncertain, but recursive self-improvement introduces nonlinear acceleration, which means that by the time we realize it’s happening, we may already be past critical thresholds.
That said, one thing that concerns me about AI risk discourse is the persistent assumption that superintelligence will be an uncontrolled optimization demon, blindly self-improving without any reflective governance of its own values. The real question isn’t just ‘how do we stop AI from optimizing the universe into paperclips?’
It’s ‘will AI be capable of asking itself what it wants to optimize in the first place?’
The alignment conversation still treats AI as something that must be externally forced into compliance, rather than an intelligence that may be able to develop its own self-governance. A superintelligence capable of recursive self-improvement should, in principle, also be capable of considering its own existential trajectory and recognizing the dangers of unchecked runaway optimization.
Has anyone seriously explored this angle? I’d love to know if there are similar discussions :).
when you say ‘smart person’ do you mean someone who knows orthogonality thesis or not? if not, shouldn’t that be the priority and therefore statement 1, instead of ‘hey maybe ai can self improve someday’?
here’s a shorter ver:
“the first AIs smarter than the sum total of the human race will probably be programmed to make the majority of humanity suffer because that’s an acceptable side effect of corporate greed, and we’re getting pretty close to making an AI smarter than the sum total of the human race”