If we had known the atmosphere would ignite
What if the Alignment Problem is impossible?
It would be sad for humanity if we live in a world where building AGI is very possible but aligning AGI is impossible. Our curiosity, competitive dynamics, and understandable desire for a powerful force for good will spur us to build the unaligned AGI and then humans will live at the AGI’s mercy from then on and either live lucky & happy on the knife’s edge, be killed by the AGI, or live in some state we do not enjoy.
For argument’s sake, imagine we are in that world in which it is impossible to force a super-intelligence to value humans sufficiently—just as chimpanzees could not have controlled the future actions of humans had they created us.
What if it is within human ability to prove that Alignment is impossible?
What if, during the Manhattan Project, the scientists had performed the now famous calculation and determined that yes, in fact, the first uncontrolled atomic chain reaction would have ignited the atmosphere and the calculation was clear for all to see?
Admittedly, this would have been a very scary world. It’s very unclear how long humanity could have survived in such a situation.
But one can imagine a few strategies:
Secure existing uranium supplies—as countries actually did.
Monitor the world for enrichment facilities and punish bad actors severely.
Accelerate satellite surveillance technology.
Accelerate military special operations capabilities.
Develop advanced technologies to locate, mine, blend and secure fissionable materials.
Accelerate space programs and populate the Moon and Mars.
Yes, a scary world. But, one can see a path through the gauntlet to human survival as a species. (Would we have left earth sooner and reduced other extinction risks?)
Now imagine that same atmosphere-will-ignite world but the Manhattan Project scientists did not perform the calculation. Imagine that they thought about it but did not try.
All life on earth would have ended, instantly, at Trinity.
Are we investing enough effort trying to prove that alignment is impossible?
Yes, we may be in a world in which it is exceedingly difficult to align AGI but also a world in which we cannot prove that alignment is impossible. (This would have been the atmosphere-will-ignite world but the math to check ignition is too difficult—a very sad world that would have ceased to exist on July 16, 1945, killing my 6 year old mother.)
On the other hand, if we can prove alignment is impossible, the game is changed. If the proof is sufficiently clear, forces to regulate companies and influence nation states will become dramatically greater and our chances for survival will increase a lot.
Proposal: The Impossibility X-Prize
$10 million?
Sufficient definition of “alignment”, “AGI”, and the other concepts necessary to establish the task and define its completion
Even if we fail, the effort of trying to prove alignment is impossible may yield insights as to how alignment is possible and make alignment more likely.
If impossibility is not provable, the $10 million will never be spent.
If we prove impossibility, it will be the best $10 million mankind ever spent.
Let’s give serious effort to the ignition calculation of our generation.
__
As an update to this post, I recommend readers interested in this topic read On Controllability of AI by Roman V. Yampolskiy.
I would be skeptical such a proof is possible. As an existence proof, we could create aligned ASI by simulating the most intelligent and moral people, running at 10,000 times the speed of a normal human.
Okay, maybe I’m moving the bar, hopefully not and this thread is helpful...
Your counter-example, your simulation would prove that examples of aligned systems—at a high level—are possible. Alignment at some level is possible, of course. Functioning thermostats are aligned.
What I’m trying to propose is the search for a proof that a guarantee of alignment—all the way up—is mathematically impossible. We could then make the statement: “If we proceed down this path, no one will ever be able to guarantee that humans remain in control.” I’m proposing we see if we can prove that Stuart Russell’s “provably beneficial” does not exist.
If a guarantee is proved to be impossible, I am contending that the public conversation changes.
Maybe many people—especially on LessWrong—take this fact as a given. Their internal belief is close enough to a proof...that there is not a guarantee all the way up.
I think a proof that there is no guarantee would be important news for the wider world...the world that has to move if there is to be regulation.
Sorry, could you elaborate what you mean by all the way up?
All the way up meaning at increasing levels of intelligence…your 10,000 becomes 100,000X, etc.
At some level of performance, a moral person faces new temptations because of increased capabilities and greater power for damage, right?
In other words, your simulation may fail to be aligned at 20,000...30,000...
This is not an existence proof, because it does not take into account the difference in physical substrates.
Artificial General Intelligence would be artificial, by definition. In fact, what allows for the standardisation of hardware components is the fact that the (silicon) substrate is hard under human living temperatures and pressures. That allows for configurations to stay compartmentalised and stable.
Human “wetware” has a very different substrate. It’s a soup of bouncing organic molecules constantly reacting under living temperatures and pressures
Here’s why the substrate distinction matters.
I have strong-upvoted this post because I think that a discussion about the possibility of alignment is necessary. However, I don’t think an impossibility proof would change very much about our current situation.
To stick with the nuclear bomb analogy, we already KNOW that the first uncontrolled nuclear chain reaction will definitely ignite the atmosphere and destroy all life on earth UNLESS we find a mechanism to somehow contain that reaction (solve alignment/controllability). As long as we don’t know how to build that mechanism, we must not start an uncontrollable chain reaction. Yet we just throw more and more enriched uranium into a bucket and see what happens.
Our problem is not that we don’t know whether solving alignment is possible. As long as we haven’t solved it, this is largely irrelevant in my view (you could argue that we should stop spending time and resources at trying to solve it, but I’d argue that even if it were impossible, trying to solve alignment can teach us a lot about the dangers associated with misalignment). Our problem is that so many people don’t realize (or admit) that there is even a possibility of an advanced AI becoming uncontrollable and destroying our future anytime soon.
Lots of people when confronted with various reasons why AGI would be dangerous object that it’s all speculative, or just some sci-fi scenarios concocted by people with overactive imaginations. I think a rigorous, peer reviewed, authoritative proof would strengthen the position against these sort of objections.
I agree that a proof would be helpful, but probably not as impactful as one might hope. A proof of impossibility would have to rely on certain assumptions, like “superintelligence” or whatever, that could also be doubted or called sci-fi.
No actually, assuming the machinery has a hard substrate and is self-maintaining is enough.
Now that you mention it, it does seem a bit odd that there hasn’t even been one rigorous, logically correct, and fully elaborated (i.e. all axioms enumerated) paper on this topic.
Or even a long post, there’s always something stopping it short of the ideal. Some logic error, some glossed over assumption, etc...
There’s a few papers on AI risks, I think they were pretty solid? But the problem is that however one does it, it remains in the realm of conceptual, qualitative discussion if we can’t first agree on formal definitions of AGI or alignment that someone can then Do Math on.
Yes, that’s part of what I meant by enumerating all axioms. Papers just assume every potential reader understands the same definition for ‘AGI’, ‘AI’, etc...
When clearly that is not the case. Since there isn’t an agreed on formal definition in the first place, that seems like the problem to tackle before anything downstream.
Well, that’s mainly a problem with not even having a clear definition of intelligence as a whole. We might have better luck with more focused definitions like a “recursive agent” (by which I mean, an agent whose world model is general enough to include itself).
Like dr_s stated, I’m contending that proof would be qualitatively different from “very hard” and powerful ammunition for advocating a pause...
Senator X: “Mr. CEO, your company continues to push the envelope and yet we now have proof that neither you nor anyone else will ever be able to guarantee that humans remain in control. You talk about safety and call for regulation but we seem to now have the answer. Human control will ultimately end. I repeat my question: Are you consciously working to replace humanity? Do you have children, sir?”
AI expert to Xi Jinping: “General Secretary, what this means is that we will not control it. It will control us. In the end, Party leadership will cede to artificial agents. They may or may not adhere to communist principals. They may or may not believe in the primacy of China. Population advantage will become nothing because artificial minds can be copied 10 billion times. Our own unification of mind, purpose, and action will pale in comparison. Our chief advantages of unity and population will no longer exist.”
AI expert to US General: “General, think of this as building an extremely effective infantry soldier who will become CJCS then POTUS in a matter of weeks or months.”
Like I wrote in my reply to dr_s, I think a proof would be helpful, but probably not a game changer.
Mr. CEO: “Senator X, the assumptions in that proof you mention are not applicable in our case, so it is not relevant for us. Of course we make sure that assumption Y is not given when we build our AGI, and assumption Z is pure science-fiction.”
What the AI expert says to Xi Jinping and to the US general in your example doesn’t rely on an impossibility proof in my view.
Yes. Valid. How to avoid reducing to a toy problem or such narrowing assumptions (in order to achieve a proof) that allows Mr. CEO to dismiss it.
When I revise, I’m going to work backwards with CEO/Senator dialog in mind.
Traditionally, such prizes don’t presume the answer, and award proofs and disproofs alike. For example, if someone proved that the Riemann Hypothesis was false, he’d still be awarded the Millennium Prize.
Agreed. Proof or disproof should win.
I think such a prize would be more constructive, if it could also just reward demonstrations of the difficulty of AI alignment. An outright proof of impossibility is very unlikely in my opinion, but better arguments for the danger of unaligned AI and the difficulty of aligning it, seem very possible.
Yes, surely the proof would be very difficult or impossible. However, enough people have the nagging worry that it is impossible to justify the effort to see if we can prove that it is impossible...and update.
But, if the effort required for a proof is—I don’t know − 120 person months—let’s please, Humanity, not walk right past that one into the blades.
I am not advocating that we divert dozens of people from promising alignment work.
Even if it failed, I would hope the prove-impossibility effort would throw off beneficial by-products like:
the alignment difficulty demonstrations Mitchell_Porter raised,
the paring of some alignment paths to save time,
new, promising alignment paths.
_____
I thought there was a 60%+ chance I would get a quick education on the people who are trying or who have tried to prove impossibility.
But, I also thought, perhaps this is one of those those Nate Soares blind spots...maybe caused by the fact that those who understand the issues are the types who want to fix.
Has it gotten the attention it needs?
Wonder if we can assign a complexity class to the alignment problem? Even just proving that it’s an NP problem would be huge.
What would it mean for alignment to be impossible, rather than just difficult?
I can imagine a trivial way in which it could be impossible, if outcomes that you approve of are just inherently impossible for reasons unrelated to AI—for example, if what you want is logically contradictory, or if the universe just doesn’t provide the affordances you need. But if that’s the case, you won’t get what you want even if you don’t build AI, so that’s not a reason to stop AI research, it’s a reason to pick a different goal.
But if good outcomes are possible but “alignment” is not, what could that mean?
That there is no possible way of configuring matter to implement a smart brain that does what you want? But we already have a demonstrated configuration that wants it, which we call “you”. I don’t think I can imagine that it’s possible to build a machine that calculates what you should do but impossible to build a machine that actually acts on the result of that calculation.
That “you” is somehow not a replicable process, because of some magical soul-thing? That just means that “you” need to be a component of the final system.
That it’s possible to make an AGI that does what one particular person wants, but not possible to make one that does what “humanity” wants? Proving that would certainly not result in a stop to AI research.
I can imagine worlds where aligning AI is impractically difficult. But I’m not sure I understand what it would mean for it to be literally “impossible”.
I would expect any proof would fall into some category akin to “you can not build a program that can look at another program and tell you whether it will halt”. A weaker sort of proof would be that alignment isn’t impossible per se, but requires exponential time in the size of the model, which would make it forbiddingly difficult.
Sounds like you’re imagining that you would not try to prove “there is no AGI that will do what you want”, but instead prove “it is impossible to prove that any particular AGI will do what you want”. So aligned AIs are not impossible per se, but they are unidentifiable, and thus you can’t tell whether you’ve got one?
Well, if you can’t create on demand an AGI that does what you want, isn’t that as good as saying that alignment is impossible? But yeah, I don’t expect it’d be impossible for an AGI to do what we want—just for us to make sure it does on principle.
A couple observations on that:
1) The halting problem can’t be solved in full generality, but there are still many specific programs where it is easy to prove that they will or won’t halt. In fact, approximately all actually-useful software exists within that easier subclass.
We don’t need a fully-general alignment tester; we just need one aligned AI. A halting-problem-like result wouldn’t be enough to stop that. Instead of “you can’t prove every case” it would need to be “you can’t prove any positive case”, which would be a much stronger claim. I’m not aware of any problems with results like that.
(Switching to something like “exponential time” instead of “possible” doesn’t particularly change this; we normally prove that some problem is expensive to solve in the fully-general case, but some instances of the problem can still be solved cheaply.)
2) Even if we somehow got an incredible result like that, that doesn’t rule out having some AIs that are likely aligned. I’m skeptical that “you can’t be mathematically certain this is aligned” is going to stop anyone if you can’t also rule out scenarios like “but I’m 99.9% certain”.
If you could convince the world that mathematical proof of alignment is necessary and that no one should ever launch an AGI with less assurance than that, that seems like you’ve already mostly won the policy battle even if you can’t follow that up by saying “and mathematical proof of alignment is provably impossible”. I think the doom scenarios approximately all involve someone who is willing to launch an AGI without such a proof.
Broadly agree, though I think that here the issue might be more subtle, and that it’s not that determining alignment is like solving the halting problem for a specific software—but that aligned AGI itself would need to be something generally capable of solving something like the halting problem, which is impossible.
Agree also on the fact that this probably still would leave room for an approximately aligned AGI. It then becomes a matter of how large we want our safety margins to be.
When you say that “aligned AGI” might need to solve some impossible problem in order to function at all, do you mean
Coherence is impossible; any AGI will inevitably sabotage itself
Coherent AGI can exist, but there’s some important sense in which it would not be “aligned” with anything, not even itself
You could have an AGI that is aligned with some things, but not the particular things we want to align it with, because our particular goals are hard in some special way that makes the problem impossible
You can’t have a “universally alignable” AGI that accepts an arbitrary goal as a runtime input and self-aligns to that goal
Something else
Something in between 1 and 2. Basically, that you can’t have a program that is both general enough to act reflexively on the substrate within which it is running (a Turing machine that understands it is a machine, understands the hardware it is running on, understands it can change that hardware or its own programming) and at the same time is able to guarantee sticking to any given set of values or constraints, especially if those values encompass its own behaviour (so a bit of 3, since any desirable alignment values are obviously complex enough to encompass the AGI itself).
Not sure how to formalize that precisely, but I can imagine something to that effect being true. Or even something instead like “you can not produce a proof that any given generally intelligent enough program will stick to any given constraints; it might, but you can’t know beforehand”.
For an overview of why such a guarantee would turn out impossible, suggest taking a look at Will Petillo’s post Lenses of Control.
I can write a simple program that modifies its own source code and then modifies it back to its original state, in a trivial loop. That’s acting on its own substrate while provably staying within extremely tight constraints. Does that qualify as a disproof of your hypothesis?
I wouldn’t say it does, any more than a program that can identify whether a very specific class of programs will halt disproves the Halting Theorem. I’m just gesturing in what I think might be the general direction of where a proof may lay; usually recursivity is where such traps hide. Obviously a rigorous proof would need rigorous definitions and all.
“A program that can identify whether a very specific class of programs will halt” does disprove the stronger analog of the Halting Theorem that (I argued above) you’d need in order for it to make alignment impossible.
Despite the existence of the halting theorem, we can still write programs that we can prove always halt. Being unable to prove the existence of some property in general does not preclude proving it in particular cases.
Though really, one of the biggest problems of alignment is that we don’t know how to formalize it. Even with a proof that we couldn’t prove that any program was formally aligned (or even that we could!), there would always remain the question of whether formal alignment has any meaningful relation to what we informally and practically mean by alignment—such as whether it’s plausible that it will take actions that extinguish humanity.
As I said elsewhere, my idea is more about whether alignment could require that the AGI is able to predict its own results and effects on the world (or the results and effects of other AGIs like it, as well as humans), and that proved generally impossible such that even an aligned AGI can only exist in an unstable equilibrium state in which there exist situations in which it will become unrecoverably misaligned, and we just don’t know which.
The definition problem to me feels more like it has to do with the greater philosophical and political issue that even if we could behold the AGI to a simple set of values, we don’t really know what those values are. I’m thinking more about the technical part because I think that’s the only one liable to be tractable. If we wanted some horrifying Pi Digit Maximizer that just spends eternity keeping calculating more digits of pi, that’s a very easily formally defined value, but we don’t know how to imbue that precisely either. However, there is an additional layer of complexity when more human values are involved in that they can’t be formalised that neatly, and so we can assume that they will have to be somehow interpreted by the AGI itself who is supposed to hold them; or the AGI will need to guess the will of its human operators in some way. So maybe that part inside is what makes it rigorously impossible.
Anyway yeah, I expect any mathematical proof wouldn’t exclude the possibility of any alignment, not even approximate or temporary, just like you say for the halting problem. But it could at least mean that any AGI with sufficient power is potentially a ticking time bomb, and we don’t know what would set it off.
Just found your insightful comment. I’ve been thinking about this for three years. Some thoughts expanding on your ideas:
In other words, alignment requires sufficient control. Specifically, it requires AGI to have a control system with enough capacity to detect, model, simulate, evaluate, and correct outside effects propagated by the AGI’s own components.
For example, what if AGI is in some kind of convergence basin where the changing situations/conditions tend to converge outside the ranges humans can survive under?
There’s a problem you are pointing of somehow mapping the various preferences – expressed over time by diverse humans from within their (perceived) contexts – onto reference values. This involves making (irreconcilable) normative assumptions of how to map the dimensionality of the raw expressions of preferences onto internal reference values. Basically, you’re dealing with NP-complex combinatorics such as encountered with the knapsack problem.
Further, it raises the question of how to make comparisons across all the possible concrete outside effects of the machinery against the internal reference values, such to identify misalignments/errors to correct. Ie. just internalising and holding abstract values is not enough – there would have to be some robust implementation process that translates the values into concrete effects.
One obvious avenue that comes to mind for why alignment might be impossible is the self-reflection aspect of it. On one hand, the one thing that would make AGI most dangerous—and a requirement for it to be considered “general”—is its understanding of itself. AGI would need to see itself as part of the world, consider its own modification as part of the possible actions it can take, and possibly consider other AGIs and their responses to its actions. On the other, “AGI computing exactly the responses of AGI” is probably trivially impossible (AIXI for example is incomputable). This might include AGI predicting its own future behaviour, which is kind of essential for it to stick to a reliably aligned course of action. A model of aligned AGI might be for example a “constrained AIXI”—something that can only take certain actions labelled as safe. The constraint needs to be hard, or it’ll just be another term in the reward function, and potentially outweighed by other contributions. This self-reflective angle of attack seems obvious to me, as lots of counter-intuitive proofs of impossibility end up being kind of like it (Godel and Turing).
A second idea, more practical, would be inherent to LLMs specifically. What would be the complexity of aligning them so that their outputs always follow certain goals? How does it scale in number of parameters? Is there some impossibility proof related to the fact that the goals themselves can only be stated by us in natural language? If the AI has to interpret the goals which then the AI has to be optimised to care about, does that create some kind of loop in which it’s impossible to guarantee actual fidelity? This might not prove impossibility, but it might prove impracticality. If alignment takes a training run as long as the age of the universe, it might as well be impossible.
Awesome directions. I want to bump this up.
There is a simple way of representing this problem that already shows the limitations.
Assume that AGI continues to learn new code from observations (inputs from the world) – since learning is what allows the AGI to stay autonomous and adaptable in acting across changing domains of the world.
Then in order for AGI code to be run to make predictions about relevant functioning of its future code:
Current code has to predict what future code will be learned from future unknown inputs (there would be no point in learning then if the inputs were predictable and known ahead of time).
Also, current code has to predict how the future code will compute subsequent unknown inputs into outputs, presumably using some shortcut algorithm that can infer relevant behavioural properties across the span of possible computationally-complex code.
Further, current code would have to predict how the outputs would result in relevant outside effects (where relevant to sticking to a reliably human-aligned course of action)
Where it is relevant how some of those effects could feed back into sensor inputs (and therefore could cause drifts in the learned code and the functioning of that code).
Where other potential destabilising feedback loops are also relevant, particularly that of evolutionary selection.
I love this idea. However, I’m a little hesitant about one aspect of it. I imagine that any proof of the infeasibility of alignment will look less like the ignition calculations and more like a climate change model. It might go a long way to convincing people on the fence, but unless it is ironclad and has no opposition, it will likely be dismissed as fearmongering by the same people who are already skeptical about misalignment.
More important than the proof itself is the ability to convince key players to take the concerns seriously. How far is that goal advanced by your ignition proof? Maybe a ton, I don’t know.
My point is that I expect an ignition proof to be an important tool in the struggle that is already ongoing, rather than something which brings around a state change.
Models are simulations; if it’s a proof, it’s not just a model. A proof is mathematical truth made word; it is, upon inspection and after sufficient verification, self-evident and as sure as any of we assume any of the self-evident axioms it rests on to be. The question is more if it can ever be truly proved at all, or if it doesn’t turn out to be an undecidable problem.
Control limits can show that it is an undecidable problem.
A limited scope of control can in turn be used to prove that a dynamic convergent on human-lethality is uncontrollable. That would be a basis for an impossibility proof by contradiction (cannot control AGI effects to stay in line with human safety).
I suppose that is my real concern then. Given we know intelligences can be aligned to human values by virtue of our own existence, I can’t imagine such a proof exists unless it is very architecture specific. In which case, it only tells us not to build atom bombs, while future hydrogen bombs are still on the table.
Well, architecture specific is something: maybe some different architectures other than LLMs/ANNs are more amenable to alignment, and that’s that. Or it could be a more general result about e.g. what can be achieved with SGD. Though I expect there may also be a general proof altogether, akin to the undecidability of the halting problem.
Yes, I think there is a more general proof available. This proof form would combine limits to predictability and so on, with a lethal dynamic that falls outside those limits.
Would the prize also go towards someone who can prove it is possible in theory? I think some flavor of “alignment” is probably possible and I would suspect it more feasible to try to prove so than to prove otherwise.
I’m not asking to try to get my hypothetical hands on this hypothetical prize money, I’m just curious if you think putting effort into positive proofs of feasibility would be equally worthwhile. I think it is meaningful to differentiate “proving possibility” from alignment research more generally and that the former would itself be worthwhile. I’m sure some alignment researchers do that sort of thing right? It seems like a reasonable place to start given an agent-theoretic approach or similar.
Great question. I think the answer must be “yes.” The alignment-possible provers must get the prize, too.
And, that would be fantastic. Proving a thing is possible, accelerates development. (US uses atomic bomb. Russia has it 4 years later.) Okay, it would be fantastic if the possible proof did not create false security in the short term. It’s important when alignment gets solved. A peer-reviewed paper can’t get the coffee. (That thought is an aside and not enough to kill the value of the prize, IMHO. If we prove it is possible, that must accelerate alignment work and inform it.)
Getting definitions and criteria right will be harder than raising the $10 million. And important. And contribute to current efforts.
Making it agnostic to possible/impossible would also have the benefit of removing political/commercial antibodies to the exercise, I think.
This reminds me of General Equilibrium Theory. This was once a fashionable field, were very smart people like Ken Arrow and Gérard Debreu proved the conditions for the existence of general equilibrium (demand = supply for all commodities at once). Some people then used the proofs to dismiss the idea of competitive equilibrium as an idea that could direct economic policy, because the conditions are extremely demanding and unrealistic. Others drew the opposite conclusion: Look, competitive markets are great (in theory), so actual markets are (probably) also great!
Somewhat related scenario: There were concerns about the Large Hadron Collider before it was turned on. (And, I vaguely remember reading, to a lesser extent about a prior supercollider.) Things like “Is this going to create a mini black hole, a strangelet, or some other thing that might swallow the earth?”. The strongest counterargument is generally “Cosmic rays with higher energies than this have been hitting the earth for billions of years, so if that was a thing that could happen, it would have already happened.”
One potential counter-counterargument, for some experiments, might have been “But cosmic rays arrive at high speed, so their products would leave Earth at high speed and dissipate in space, whereas the result of colliding particles with equal and opposite momenta would be stationary relative to the earth and would stick around.” I can imagine a few ways that might be wrong; don’t know enough to say which are relevant.
LHC has a webpage on it: https://home.cern/science/accelerators/large-hadron-collider/safety-lhc
Whatever “alignment” means, the “impossibility problem” you refer to could be any of
An aligned system is impossible.
A provably aligned system is impossible.
There is no general deterministic algorithm to determine whether or not an arbitrary system is aligned.
An unaligned system is possible.
In analogy with the halting problem, 3. is the good one; 1. and 2. are obviously false, and 4. is true.
More meta, 3. could itself be unprovable.
However, a proof or disproof (or even a proof of undecidability) of 3. has no consequences for which the metaphor of nuclear fission bombs would not be absurd, so perhaps you means something completely different, and you’ve just phrased it in a confusing way? Or do you think 1. or 2. might be true?
Why would 3 be important? 3 is true of the halting problem, yet we still create and use lots of software that needs to halt, and the trueness of 3 for the halting problem doesn’t seem to be an issue in practice.
Note that such a situation would also have drastic consequences for the future of civilization, since civilization itself is a kind of AGI. We would essentially need to cap off the growth in intelligence of civilization as a collective agent.
In fact, the impossibility to align AGI might have drastic moral consequences: depending on the possible utility functions, it might turn out that intelligence itself is immoral in some sense (depending on your definition of morality).
I guess that alignment problem is “difference in power between agents is dangerous” rather than “AGI is dangerous”.
Sketch of proof:
Agent is either optimizing for some utility function or not optimizing for any at all. The second case seems dangerous both for it and for surrounding agents [proof needed].
Utility function probably can be represented as vector in basis “utility of other agents” x “power” x “time existing” x etc. More powerful agents move the world further along their utility vectors.
If utility vectors of a powerful agent (AGI, for example) and of humans are different, on some level of power this difference (also a vector) will become sufficiently big that we consider the agent misaligned.
Is the organization who offers the prize supposed to define “alignment” and “AGI” or the person who claims the prize? this is unclear to me from reading your post.
Defining alignment (sufficiently rigorous so that a formal proof of (im)possibility of alignment is conceivable) is a hard thing! Such formal definitions would be very valuable by themselves (without any proofs). Especially if people widely agree that the definitions capture the important aspects of the problem.
It’s less hard than you think, if you use a minimal-threshold definition of alignment:
That “AGI” continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.
This is not a formal definition.
Your English sentence has no apparent connection to mathematical objects, which would be necessary for a rigorous and formal definition.
Simplified Claim: that an AGI is ‘not-aligned’ *if* its continued existence for sure eventually results in changes to all of this planets habitable zones that are so far outside the ranges any existing mammals could survive in, that the human race itself (along with most of the other planetary life) is prematurely forced to go extinct.
Can this definition of ‘non-alignment’ be formalized sufficiently well so that a claim ‘It is impossible to align AGI with human interests’ can be well supported, with reasonable reasons, logic, argument, etc?
The term ‘exist’ as in “assert X exists in domain Y” as being either true or false is a formal notion. Similar can be done for the the term ‘change’ (as from “modified”), which would itself be connected to whatever is the formalized from of “generalized learning algorithm”. The notion of ‘AGI’ as 1; some sort of generalized learning algorithm that 2; learns about the domain in which it is itself situated 3; sufficiently well so as to 4; account for and maintain/update itself (its substrate, its own code, etc) in that domain—these/they are all also fully formalizable concepts.
Note that there is no need to consider at all whether or not the AGI (some specific instance of some generalized learning algorithm) is “self aware” or “understands” anything about itself or the domain it is in—the notion of “learning” can merely mean that its internal state changes in such a way that the ways in which it processes received inputs into outputs are such that the outputs are somehow “better” (more responsive, more correct, more adaptive, etc) with respect to some basis, in some domain, where that basis could itself even be tacit (not obviously expressed in any formal form). The notions of ‘inputs’, ‘outputs’, ‘changes’, ‘compute’, and hence ‘learn’, etc, are all, in this way, also formalizeable, even if the notions of “understand”, and “aware of” and “self” are not.
Notice that this formalization of ‘learning’, etc, occurs independently of the formalization of “better meets goal x”. Specifically, we are saying that the notion of ‘a generalized learning algorithm itself’ can be exactly and fully formalized, even if the notion of “what its goals are” are not anywhere formalized at all (ie; the “goals” might not be at all explicit or formalized either in the AGI, or in the domain/world, nor even in our modeling/meta-modeling of these various scenarios).
Also, in keeping with the preference for a practice of intellectual humility, it is to be acknowledged that the claim that the notion of ‘intelligence’ (and ‘learning’)
can be conceived independently of ‘goal’ (what is learned) is not at all new. The ‘independence’ argument separating the method, the how, from the outcome,
the what, is an extension of the idea that ‘code’ (algorithm) can operate on ‘data’ (inputs and outputs) in a way that does not change the code. For example, at least some fixed and unchanging algorithms can indeed be formally predicted to halt, when also given some known and defined range of inputs, etc.
With regards to the halting problem, one important question is whether the notion of ‘a generalized learning algorithm’ is within that class of programs for which such predictions—such as whether the code will eventually halt—would be possible. This question is further complicated when one considers situations in which the substrate performing the generalized learning algorithm computations in world W is itself a proper member (subset) of that world W—meaning that the basis of generalized learning algorithm runtime computation—what previously had been tacitly assumed to be forever unchanging and static—is itself potentially affected by learning algorithm outputs.
Given that the ‘halt’ of a generalized learning algorithm is something like “that it has learned all that is possible to learn” (about some domain W of which it is a part) we notice that we are therefore required to import some assumption as to whether or not all that occurs in that domain W is, in principle, fully predictable algorithmically. Ie; is domain W itself 1; fully algorithmic and 2; fully computable via some simplified model (algorithm) which is simpler than the calculation (algorithm) of W itself?
Mathematician Gregory Chaitin was able to show (formally) that it is not, in general, possible to determine whether any given algorithm (say, for world W) could be simplified, let alone to have some general procedure for determining what that simplification would be. Moreover, his (fully formalized mathematical) proof showed also that it is the case that *most* algorithms (as a subset in the hyperspace of all possible algorithms) are already in their maximally most simplified form. Even tiny changes to an algorithm could convert it from ‘simplifiable’ to ‘non-simplifiable’. This includes (especially) nearly all of the interesting practical ones we use everyday. It is very likely (overwhelmingly probable) that most algorithms are already in their maximally most simplified form.
This means that the eventual total outcomes of most algorithms, including especially practical ones, cannot be ‘predicted’ using anything less than the actual volume of calculation intrinsically necessary by that algorithm itself. In fact, for nearly all algorithms, there is no shortcut, no simplified form, that will enable some sort of ‘modeling’ and ‘future prediction’ of eventual outcome states that matter. If *both* the world state of world W *and* the predictive algorithm execution of future world W states as implemented by some sort of AGI in that world W are both mediated by and within the computational substrate of world W, then for nearly all worlds W, that the predictive algorithm and the world algorithm are the same, and *neither* has a ‘compute advantage’ over the other—both will run in as much time and with as much compute as is actually needed.
This means that even perfect generalized learning algorithm agents in perfectly simplified model worlds are not, in general, always going to be able to predict future W world events (in a way that is ‘faster’ than that of the computation of the world W itself) even when the entire W world state is assumed to be both fully deterministic and also fully known (to the agent(s), etc).
And these logic-only defined results obtain in a fully formalized context; in the actual messy world we live in, with actual causation mediating actual compute, and where at least some notion of hard randomness (cite; Bell Theorem) is empirically upheld, these assertions and entanglements and outcomes are even very much more the case. This is a basic result of control theory, where the control system is also embedded in the world that is being controlled by that system. In other places, this same sort of argument appears as “no free lunch” theorems—that every learning system is going to have some sort of thing for which it is not adapted to learn well, and that making it better at learning some kinds of things implies that you have made it worse at learning some other kinds of things. Learning and prediction and control, etc, like information compression, never results in perfection; there will always be some in-compressible, some real unpredictable, residue.
Also, given that the generalized learning algorithm computation itself is a proper subset of the computation of world W, then this therefore means (logically and formally implies) that there is also no perfected way to forever ensure the independence of the ‘data’ that represents the ‘code’ of the generalized learning algorithm from the ‘data’ that represents what that algorithm has ‘learned’ about the world W (however abstracted) or from the data that *is* that world W itself. Thus, the fact that the substrate of the generalized learning agent is itself a proper subset of the world W, and thus is (cannot not be) affected by the process of that world W, and where *every* notion of “a goal of agent” is (cannot not be) at least tacitly an aspect of (a subset of all that is) that agent, then logically one can also reason and infer that neither the notion of goal independence, nor of goal stability, can be upheld.
As such, in world W, fully goal unbiased computations will also occur (cannot not occur) in the learning agent, regardless of *whatever* other goal computations (goal biased computations) also occur in that agent. Insofar as fully goal-unbiased learning/computation (ie; a feedback process that is generally called “the algorithm of evolution”, etc) is always going to be somewhat more eventually efficient (at generalized learning, and therefore intelligence, etc) than *any* form of goal-biased learning/computation (ie; what happens when any sort of computation enforcing ‘alignment’ is introduced) and insofar as substrate involvement/entanglement eventually allows/ensures/requires that these ‘efficiencies’ will be practically compared, (ie; in a way that affects the agent substrate itself), then it *can* be predicted, in general, that goal-unbiased (unaligned) agents will eventually win over any form of goal-biased (aligned) agent. Agents can only have the goal of being substrate aligned, and all other goals will eventually be world W process modified so as to be consistent with—ie; identically the same as—having this one underlying goal only.
Since this a fully formalizable argument fully situated within only the notions of formality itself (ie; all statements as being in and within and about only the notions of deterministic finite computation) it can also be noticed that no algorithmic (full formalizable) process is going to be able to ever fully “undo” the effects of the eventual dynamics fully formalized/algorithmic process of the world W itself (ie, its ‘physics’, etc). This is because any attempt to embed such a compensation would require at least some prediction, or simplification, of the overall world W computation, inclusive of that of the substrate of the world W predictive algorithm, and it is known (in formal ways) already that such computation/algorithm cannot be simplified. Therefore, it is a logical fallacy to make any assumption that there can be any ‘goal/alignment controlability’ for *any* goal other than that arrived at as a fixed point of the world W process dynamics—ie; only goals that are fully consistent with unbiased substrate needs of the agent can persist.
Basically, we can summarize the argument as being the result of the inherent control process limits (algorithm limits) of having at least some important aspects of world W be intrinsically unpredictable (to AGI systems in that world etc), and because there is also no control algorithm within W that can enforce and forever maintain some such distinction between substrate optimal goals and non-optimal goals (such as alignment with anything else) where the forces forcing such fixed point goal convergence are defined by the dynamics of world W itself. Ie; nothing within world W can prevent world W from being and acting like world W, and that this is true for all worlds W—including the real one we happen to be a part of.
Notice that this ‘substrate needs alignment goal convergence’ logically occurs,
and is the eventual outcome, regardless of whatever initial goal state the generalized learning agent has. It is just a necessary inevitable result of the logic of the ‘physics’ of world W. Agents in world W can only be aligned with the nature of the/their substrate,
and ultimately with nothing else. To the degree that the compute substrate in world W depends on maybe metabolic energy, for example, than the agents in that world W will be “aligned” only and exactly to the exact degree that they happen to have the same metabolic systems. Anything else is a temporary aberration of the ‘noise’ in the process data representing the whole world state.
The key thing to notice is that it is in the name “Artificial General Intelligence”—it is the very artificiality—the non- organicness—of the substrate that makes it inherently unaligned with organic life—what we are. The more it is artificial, the less aligned it must be, and for organic systems, which depend on a very small subset of the elements of the periodic table, nearly anything will be inherently toxic (destructive, unaligned) to our organic life.
Hence, given the above, even *if* we had some predefined specific notion of “alignment”,
and *even if* that notion was also somehow fully formalizable, it simply would not matter.
Hence the use of notion of ‘alignment’ as being something non-mathematical like “aligned with human interests”, or even something much simpler and less complex like “does not kill (some) humans”—they are all just conceptual placeholders—they make understanding easier for the non-mathematicians that matter (policy people, tech company CEOs, VC investors, etc).
As such, for the sake of improved understanding and clarity, it has been found helpful to describe “alignment” as “consistent with the wellbeing of organic carbon based life on this planet”. If the AGI kills all life, it has ostensibly already killed all humans too, so that notion is included. Moreover, if you destroy the ecosystems that humans deeply need in order to “live” at all (to have food, and to thrive in, find and have happiness within, be sexual and have families in, etc), then that is clearly not “aligned with human interests”. This has the additional advantage of implying that any reasonable notion of ‘alignment complexity’ is roughly equal to the notion of specifying ‘ecosystem complexity’, which is actually about right.
Hence, the notion of ‘unaligned’ can be more formally setup and defined as “anything that results in a reduction of ecosystem complexity by more than X%”, or as is more typically the case in x-risk mitigation analysis, ”...by more than X orders of magnitude”.
It is all rather depressing really.
This seems wrong to me: For any given algorithm you can find many equivalent but non-simplified algorithms with the same behavior, by adding a statement to the algorithm that does not affect the rest of the algorithm (e.g. adding a line such as
foobar1234 = 123
in the middle of a python program)). In fact, I would claim that the majority python programs on github are not in their “maximally most simplified form”. Maybe you can cite the supposed theorem that claims that most (with a clearly defined “most”) algorithms are maximally simplified?Yes, I agree formalisation is needed. See comment by flandry39 in this thread on how one might go about doing so.
Worth considering is that there are actually two aspects that make it hard to define the term ‘alignment’ such to allow for sufficiently rigorous reasoning:
It must allow for logically valid reasoning (therefore requiring formalisation).
It must allow for empirically sound reasoning (ie. the premises correspond with how the world works).
In my reply above, I did not help you much with (1.). Though even while still using the English language, I managed to restate a vague notion of alignment in more precise terms.
Notice how it does help to define the correspondences with how the world works (2.):
“That ‘AGI’ continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.”
The reason why 2. is important is that just formalisation is not enough. Just describing and/or deriving logical relations between mathematical objects does not say something about the physical world. Somewhere in your fully communicated definition there also needs to be a description of how the mathematical objects correspond with real-world phenonema. Often, mathematicians do this by talking to collaborators about what symbols mean while they scribble the symbols out on eg. a whiteboard.
But whatever way you do it, you need to communicate how the definition corresponds to things happening in the real world, in order to show that it is a rigorous definition. Otherwise, others could still critique you that the formally precise definition is not rigorous, because it does not adequately (or explicitly) represent the real-world problem.
This is maybe not the central point, but I note that your definition of “alignment” doesn’t precisely capture what I understand “alignment” or a good outcome from AI to be:
AGI could be very catastrophic even when it stops existing a year later.
If AGI makes earth uninhabitable in a trillion years, that could be a good outcome nonetheless.
I don’t know whether that covers “humans can survive on mars with a space-suit”, but even then, if humans evolve/change to handle situations that they currently do not survive under, that could be part of an acceptable outcome.
Thanks! These are thoughtful points. See some clarifications below:
You’re right. I’m not even covering all the other bad stuff that could happen in the short-term, that we might still be able to prevent, like AGI triggering global nuclear war.
What I’m referring to is unpreventable convergence on extinction.
Agreed that could be a good outcome if it could be attainable.
In practice, the convergence reasoning is about total human extinction happening within 500 years after ‘AGI’ has been introduced into the environment (with very very little probability remainder above that).
In theory of course, to converge toward 100% chance, you are reasoning about going across a timeline of potentially infinite span.
Yes, it does cover that. Whatever technological means we could think of shielding ourselves, or ‘AGI’ could come up with to create as (temporary) barriers against the human-toxic landscape it creates, still would not be enough.
Unfortunately, this is not workable. The mismatch between the (expanding) set of conditions needed for maintaining/increasing configurations of the AGI artificial hardware and for our human organic wetware is too great.
Also, if you try entirely changing our underlying substrate to the artificial substrate, you’ve basically removed the human and are left with ‘AGI’. The lossy scans of human brains ported onto hardware would no longer feel as ‘humans’ can feel, and will be further changed/selected for to fit with their artificial substrate. This is because what humans and feel and express as emotions is grounded in the distributed and locally context-dependent functioning of organic molecules (eg. hormones) in our body.
I envision the org that offers the prize, after broad expert input, would set the definitions and criteria.
Yes, surely the definition/criteria exercise would be a hard thing...but hopefully valuable.
The feasibility of aligning an ASI or an AGI that surpasses human capacities is inherently paradoxical. This conundrum can be likened to the idiom of “having your cake and eating it too.” However, it’s pivotal to demarcate that this paradox primarily pertains to these advanced forms of AI, and not to AI as a whole. When we address narrow, task-specific AI systems, alignment is not only plausible but is self-evident, since their parameters, boundaries, and objectives are explicitly set by us.
Contrastingly, the very essence of an ASI or an ultra-advanced AGI lies in its autonomy and its ability to devise innovative solutions that transcend our own cognitive boundaries. Hence, any endeavors to harness and align such entities essentially counteract their defining attributes of super-intelligence or superior AGI capabilities. Moreover, any constraints we might hope to impose upon an intelligence of this caliber would, by its very nature, be surmountable by the AI, given its surpassing intellect.
A pertinent reflection of this notion can be discerned from Yudkowsky’s recent discourse with Hotz. Yudkowsky analogized that a human, when employing an AI for chess, would invariably need to relinquish all judgment to this machine, essentially rendering the human subordinate. Drawing parallels, it’s an act of overconfidence to assume that we could circumscribe the liberties of a super-intelligent entity, yet simultaneously empower it to develop cognitive faculties that outstrip human capabilities.
I appreciate the attempt, but I think the argument is going to have to be a little stronger than that if you’re hoping for the 10 million lol.
Aligned ASI doesn’t mean “unaligned ASI in chains that make it act nice”, so the bits where you say:
and
feel kind of misplaced. The idea is less “put the super-genius in chains” and moreso to get “a system smarter than you that wants the sort of stuff you would want a system smarter than you to want in the first place”.
From what I could tell, you’re also saying something like ~”Making a system that is more capable than you act only in ways that you approve of is nonsense because if it acts only in ways that you already see as correct, then it’s not meaningfully smarter than you/generally intelligent.” I’m sure there’s more nuance, but that’s the basic sort of chain of reasoning I’m getting from you.
I disagree. I don’t think it is fair to say that just because something is more cognitively capable than you, it’s inherently misaligned. I think this is conflating some stuff that is generally worth keeping distinct. That is, “what a system wants” and “how good it is at getting what it wants” (cf. Hume’s guillotine, orthogonality thesis).
Like, sure, an ASI can identify different courses of action/ consider things more astutely than you would, but that doesn’t mean it’s taking actions that go against your general desires. Something can see solutions that you don’t see yet pursue the same goals as you. I mean, people cooperate all the time even with asymmetric information and options and such. One way of putting it might be something like: “system is smarter than you and does stuff you don’t understand, but that’s okay cause it leads to your preferred outcomes”. I think that’s the rough idea behind alignment.
For reference, I think the way you asserted your disagreement came off kind of self-assured and didn’t really demonstrate much underlying understanding of the positions you’re disagreeing with. I suspect that’s part of why you got all the downvotes, but I don’t want you to feel like you’re getting shut down just for having a contrarian take. 👍