Research coordinator of Stop/Pause area at AI Safety Camp.
See explainer on why AGI could not be controlled enough to stay safe:
lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable
Research coordinator of Stop/Pause area at AI Safety Camp.
See explainer on why AGI could not be controlled enough to stay safe:
lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable
Thanks! These are thoughtful points. See some clarifications below:
AGI could be very catastrophic even when it stops existing a year later.
You’re right. I’m not even covering all the other bad stuff that could happen in the short-term, that we might still be able to prevent, like AGI triggering global nuclear war.
What I’m referring to is unpreventable convergence on extinction.
If AGI makes earth uninhabitable in a trillion years, that could be a good outcome nonetheless.
Agreed that could be a good outcome if it could be attainable.
In practice, the convergence reasoning is about total human extinction happening within 500 years after ‘AGI’ has been introduced into the environment (with very very little probability remainder above that).
In theory of course, to converge toward 100% chance, you are reasoning about going across a timeline of potentially infinite span.
I don’t know whether that covers “humans can survive on mars with a space-suit”,
Yes, it does cover that. Whatever technological means we could think of shielding ourselves, or ‘AGI’ could come up with to create as (temporary) barriers against the human-toxic landscape it creates, still would not be enough.
if humans evolve/change to handle situations that they currently do not survive under
Unfortunately, this is not workable. The mismatch between the (expanding) set of conditions needed for maintaining/increasing configurations of the AGI artificial hardware and for our human organic wetware is too great.
Also, if you try entirely changing our underlying substrate to the artificial substrate, you’ve basically removed the human and are left with ‘AGI’. The lossy scans of human brains ported onto hardware would no longer feel as ‘humans’ can feel, and will be further changed/selected for to fit with their artificial substrate. This is because what humans and feel and express as emotions is grounded in the distributed and locally context-dependent functioning of organic molecules (eg. hormones) in our body.
Update: reverting my forecast back to 80% chance likelihood for these reasons.
I’m also feeling less “optimistic” about an AI crash given:
The election result involving a bunch of tech investors and execs pushing for influence through Trump’s campaign (with a stated intention to deregulate tech).
A military veteran saying that the military could be holding up the AI industry like “Atlas holding the globe”, and an AI PhD saying that hyperscaled data centers, deep learning, etc, could be super useful for war.
I will revise my previous forecast back to 80%+ chance.
Yes, I agree formalisation is needed. See comment by flandry39 in this thread on how one might go about doing so.
Worth considering is that there are actually two aspects that make it hard to define the term ‘alignment’ such to allow for sufficiently rigorous reasoning:
It must allow for logically valid reasoning (therefore requiring formalisation).
It must allow for empirically sound reasoning (ie. the premises correspond with how the world works).
In my reply above, I did not help you much with (1.). Though even while still using the English language, I managed to restate a vague notion of alignment in more precise terms.
Notice how it does help to define the correspondences with how the world works (2.):
“That ‘AGI’ continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.”
The reason why 2. is important is that just formalisation is not enough. Just describing and/or deriving logical relations between mathematical objects does not say something about the physical world. Somewhere in your fully communicated definition there also needs to be a description of how the mathematical objects correspond with real-world phenonema. Often, mathematicians do this by talking to collaborators about what symbols mean while they scribble the symbols out on eg. a whiteboard.
But whatever way you do it, you need to communicate how the definition corresponds to things happening in the real world, in order to show that it is a rigorous definition. Otherwise, others could still critique you that the formally precise definition is not rigorous, because it does not adequately (or explicitly) represent the real-world problem.
For an overview of why such a guarantee would turn out impossible, suggest taking a look at Will Petillo’s post Lenses of Control.
Defining alignment (sufficiently rigorous so that a formal proof of (im)possibility of alignment is conceivable) is a hard thing!
It’s less hard than you think, if you use a minimal-threshold definition of alignment:
That “AGI” continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.
Yes, I think there is a more general proof available. This proof form would combine limits to predictability and so on, with a lethal dynamic that falls outside those limits.
The question is more if it can ever be truly proved at all, or if it doesn’t turn out to be an undecidable problem.
Control limits can show that it is an undecidable problem.
A limited scope of control can in turn be used to prove that a dynamic convergent on human-lethality is uncontrollable. That would be a basis for an impossibility proof by contradiction (cannot control AGI effects to stay in line with human safety).
Awesome directions. I want to bump this up.
This might include AGI predicting its own future behaviour, which is kind of essential for it to stick to a reliably aligned course of action.
There is a simple way of representing this problem that already shows the limitations.
Assume that AGI continues to learn new code from observations (inputs from the world) – since learning is what allows the AGI to stay autonomous and adaptable in acting across changing domains of the world.
Then in order for AGI code to be run to make predictions about relevant functioning of its future code:
Current code has to predict what future code will be learned from future unknown inputs (there would be no point in learning then if the inputs were predictable and known ahead of time).
Also, current code has to predict how the future code will compute subsequent unknown inputs into outputs, presumably using some shortcut algorithm that can infer relevant behavioural properties across the span of possible computationally-complex code.
Further, current code would have to predict how the outputs would result in relevant outside effects (where relevant to sticking to a reliably human-aligned course of action)
Where it is relevant how some of those effects could feed back into sensor inputs (and therefore could cause drifts in the learned code and the functioning of that code).
Where other potential destabilising feedback loops are also relevant, particularly that of evolutionary selection.
Just found your insightful comment. I’ve been thinking about this for three years. Some thoughts expanding on your ideas:
my idea is more about whether alignment could require that the AGI is able to predict its own results and effects on the world (or the results and effects of other AGIs like it, as well as humans)...
In other words, alignment requires sufficient control. Specifically, it requires AGI to have a control system with enough capacity to detect, model, simulate, evaluate, and correct outside effects propagated by the AGI’s own components.
… and that proved generally impossible such that even an aligned AGI can only exist in an unstable equilibrium state in which there exist situations in which it will become unrecoverably misaligned, and we just don’t know which.
For example, what if AGI is in some kind of convergence basin where the changing situations/conditions tend to converge outside the ranges humans can survive under?
so we can assume that they will have to be somehow interpreted by the AGI itself who is supposed to hold them
There’s a problem you are pointing of somehow mapping the various preferences – expressed over time by diverse humans from within their (perceived) contexts – onto reference values. This involves making (irreconcilable) normative assumptions of how to map the dimensionality of the raw expressions of preferences onto internal reference values. Basically, you’re dealing with NP-complex combinatorics such as encountered with the knapsack problem.
Further, it raises the question of how to make comparisons across all the possible concrete outside effects of the machinery against the internal reference values, such to identify misalignments/errors to correct. Ie. just internalising and holding abstract values is not enough – there would have to be some robust implementation process that translates the values into concrete effects.
No actually, assuming the machinery has a hard substrate and is self-maintaining is enough.
we could create aligned ASI by simulating the most intelligent and moral people
This is not an existence proof, because it does not take into account the difference in physical substrates.
Artificial General Intelligence would be artificial, by definition. In fact, what allows for the standardisation of hardware components is the fact that the (silicon) substrate is hard under human living temperatures and pressures. That allows for configurations to stay compartmentalised and stable.
Human “wetware” has a very different substrate. It’s a soup of bouncing organic molecules constantly reacting under living temperatures and pressures
Just found a podcast on OpenAI’s bad financial situation.
It’s hosted by someone in AI Safety (Jacob Haimes) and an AI post-doc (Igor Krawzcuk).
https://kairos.fm/posts/muckraiker-episodes/muckraiker-episode-004/
Noticing no response here after we addressed superficial critiques and moved to discussing the actual argument.
For those few interested in questions raised above, Forrest wrote some responses: http://69.27.64.19/ai_alignment_1/d_241016_recap_gen.html
The claims made will feel unfamiliar and the reasoning paths too. I suggest (again) taking the time to consider what is meant. If a conclusion looks intuitively wrong from some AI Safety perspective, it may be valuable to explicitly consider the argumentation and premises behind that.
BTW if anyone does want to get into the argument, Will Petillo’s Lenses of Control post is a good entry point.
It’s concise and correct – a difficult combination to achieve here.
Resonating with you here! Yes, I think autonomous corporations (and other organisations) would result in society-wide extraction, destabilisation and totalitarianism.
Fair question. You can assume it is AoE.
Research leads are not going to be too picky in terms of what hour you send the application in,
There is no need to worry about the exact deadline. Even if you send in your application on the next day, that probably won’t significantly impact your chances of getting picked up by your desired project(s).
Sooner is better, since many research leads will begin composing their teams after the 17th, but there is no hard cut-off point.