Research coordinator of Stop/Pause area at AI Safety Camp.
See explainer on why AGI could not be controlled enough to stay safe:
lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable
Research coordinator of Stop/Pause area at AI Safety Camp.
See explainer on why AGI could not be controlled enough to stay safe:
lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable
I kinda appreciate you being honest here.
Your response is also emblematic of what I find concerning here, which is that you are not offering a clear argument of why something does not make sense to you before writing ‘crank’.
Writing that you do not find something convincing is not an argument – it’s a statement of conviction, which could as much be a reflection of a poor understanding of an argument or of not taking the time to question one’s own premises. Because it’s not transparent about one’s thinking, but still comes across like there must be legit thinking underneath, this can be used as a deflection tactic (I don’t think you are, but others who did not engage much ended the discussion on that note). Frankly, I can’t convince someone if they’re not open to the possibility of being convinced.
I explained above why your opinion was flawed – that ASI would be so powerful that it could cancel all of evolution across its constituent components (or at least anything that through some pathway could build up to lethality).
I similarly found Quintin’s counter-arguments (eg. hinging on modelling AGI as trackable internal agents) to be premised on assumptions that considered comprehensively looked very shaky.
I relate why discussing this feels draining for you. But it does not justify you writing ‘crank’, when you have not had the time to examine the actual argumentation (note: you introduced the word ‘crank’ in this thread; Oliver wrote something else).
Overall, this is bad for community epistemics. It’s better if you can write what you thought was unsound about my thinking, and I can write what I found unsound about yours. Barring that exchange, some humility that you might be missing stuff is well-placed.
Besides this point, the respect is mutual.
Lucius, the text exchanges I remember us having during AISC6 was about the question whether ‘ASI’ could control comprehensively for evolutionary pressures it would be subjected to. You and I were commenting on a GDoc with Forrest. I was taking your counterarguments against his arguments seriously – continuing to investigate those counterarguments after you had bowed out.
You held the notion that ASI would be so powerful that it could control for any of its downstream effects that evolution could select for. This is a common opinion held in the community. But I’ve looked into this opinion and people’s justifications for it enough to consider it an unsound opinion.[1]
I respect you as a thinker, and generally think you’re a nice person. It’s disappointing that you wrote me off as a crank in one sentence. I expect more care, including that you also question your own assumptions.
A shortcut way of thinking about this:
The more you increase ‘intelligence’ (as a capacity in transforming patterns in data), the more you have to increase the number of underlying information-processing components. But the corresponding increase in the degrees of freedom those components have in their interactions with each other and their larger surroundings grows faster.
This results in a strict inequality between:
the space of possible downstream effects that evolution can select across; and
the subspace of effects that the ‘ASI’ (or any control system connected with/in ASI) could detect, model, simulate, evaluate, and correct for.
The hashiness model is a toy model for demonstrating this inequality (incl. how the mismatch between 1. and 2. grows over time). Anders Sandberg and two mathematicians are working on formalising that model at AISC.
There’s more that can be discussed in terms of why and how this fully autonomous machinery is subjected to evolutionary pressures. But that’s a longer discussion, and often the researchers I talked with lacked the bandwidth.
I agree that Remmelt seems kind of like he has gone off the deep end
Could you be specific here?
You are sharing a negative impression (“gone off the deep end”), but not what it is based on. This puts me and others in a position of not knowing whether you are e.g. reacting with a quick broad strokes impression, and/or pointing to specific instances of dialogue that I handled poorly and could improve on, and/or revealing a fundamental disagreement between us.
For example, is it because on Twitter I spoke up against generative AI models that harm communities, and this seems somehow strategically bad? Do you not like the intensity of my messaging? Or do you intuitively disagree with my arguments about AGI being insufficiently controllable?
As is, this is dissatisfying. On this forum, I’d hope[1] there is a willingness to discuss differences in views first, before moving to broadcasting subjective judgements[2] about someone.
Even though that would be my hope, it’s no longer my expectation. There’s an unhealthy dynamic on this forum, where 3+ times I noticed people moving to sideline someone with unpopular ideas, without much care.
To give a clear example, someone else listed vaguely dismissive claims about research I support. Their comment lacked factual grounding but still got upvotes. When I replied to point out things they were missing, my reply got downvoted into the negative.
I guess this is a normal social response on most forums. It is naive of me to hope that on LessWrong it would be different.
This particularly needs to be done with care if the judgement is given by someone seen as having authority (because others will take it at face value), and if the judgement is guarding default notions held in the community (because that supports an ideological filter bubble).
For example, it might be the case that, for some reason, alignment would only have been solved if and only if Abraham Lincoln wasn’t assassinated in 1865. That means that humans in 2024 in our world (where Lincoln was assasinated in 1865) will not be able to solve alignment, despite it being solvable in principle.
With this example, you might still assert that “possible worlds” are world states reachable through physics from past states of the world. Ie. you could still assert that alignment possibility is path-dependent from historical world states.
But you seem to mean something broader with “possible worlds”. Something like “in theory, there is a physically possible arrangement of atoms/energy states that would result in an ‘aligned’ AGI, even if that arrangement of states might not be reachable from our current or even a past world”.
–> Am I interpreting you correctly?
Alignment is a broad word, and I don’t really have the authority to interpret stranger’s words in a specific way without accidentally misrepresenting them.
You saying this shows the ambiguity here of trying to understand what different people mean. One researcher can make a technical claim about the possibility/tractability of “alignment” that is similarly worded to a technical claim others made. Yet their meaning of “alignment” could be quite different.
It’s hard then to have a well-argued discussion, because you don’t know whether people are equivocating (ie. switching between different meanings of the term).
one article managed to find six distinct interpretations of the word:
That’s a good summary list! I like the inclusion of “long-term outcomes” in P6. In contrast, P4 could just entail short-term problems that were specified by a designer or user who did not give much thought to long-term repercussions.
The way I deal with the wildly varying uses of the term “alignment” is to use a minimum definition that most of those six interpretations are consistent with. Where (almost) everyone would agree that AGI not meeting that definition would be clearly unaligned.
Alignment is at the minimum the control of the AGI’s components (as modified over time) to not (with probability above some guaranteeable high floor) propagate effects that cause the extinction of humans.
Thanks!
With ‘possible worlds’, do you mean ‘possible to be reached from our current world state’?
And what do you mean with ‘alignment’? I know that can sound like an unnecessary question. But if it’s not specified, how can people soundly assess whether it is technically solvable?
Thanks, when you say “in the space of possible mathematical things”, do you mean “hypothetically possible in physics” or “possible in the physical world we live in”?
Here’s how I specify terms in the claim:
AGI is a set of artificial components, connected physically and/or by information signals over time, to in aggregate sense and act autonomously over many domains.
‘artificial’ as configured out of a (hard) substrate that can be standardised to process inputs into outputs consistently (vs. what our organic parts can do).
‘autonomously’ as continuing to operate without needing humans (or any other species that share a common ancestor with humans).
Alignment is at the minimum the control of the AGI’s components (as modified over time) to not (with probability above some guaranteeable high floor) propagate effects that cause the extinction of humans.
Control is the implementation of (a) feedback loop(s) through which the AGI’s effects are detected, modelled, simulated, compared to a reference, and corrected.
Good to know. I also quoted your more detailed remark on AI Standards Lab at the top of this post.
I have made so many connections that have been instrumental to my research.
I didn’t know this yet, and glad to hear! Thank you for the kind words, Nell.
Fair question. You can assume it is AoE.
Research leads are not going to be too picky in terms of what hour you send the application in,
There is no need to worry about the exact deadline. Even if you send in your application on the next day, that probably won’t significantly impact your chances of getting picked up by your desired project(s).
Sooner is better, since many research leads will begin composing their teams after the 17th, but there is no hard cut-off point.
Thanks! These are thoughtful points. See some clarifications below:
AGI could be very catastrophic even when it stops existing a year later.
You’re right. I’m not even covering all the other bad stuff that could happen in the short-term, that we might still be able to prevent, like AGI triggering global nuclear war.
What I’m referring to is unpreventable convergence on extinction.
If AGI makes earth uninhabitable in a trillion years, that could be a good outcome nonetheless.
Agreed that could be a good outcome if it could be attainable.
In practice, the convergence reasoning is about total human extinction happening within 500 years after ‘AGI’ has been introduced into the environment (with very very little probability remainder above that).
In theory of course, to converge toward 100% chance, you are reasoning about going across a timeline of potentially infinite span.
I don’t know whether that covers “humans can survive on mars with a space-suit”,
Yes, it does cover that. Whatever technological means we could think of shielding ourselves, or ‘AGI’ could come up with to create as (temporary) barriers against the human-toxic landscape it creates, still would not be enough.
if humans evolve/change to handle situations that they currently do not survive under
Unfortunately, this is not workable. The mismatch between the (expanding) set of conditions needed for maintaining/increasing configurations of the AGI artificial hardware and for our human organic wetware is too great.
Also, if you try entirely changing our underlying substrate to the artificial substrate, you’ve basically removed the human and are left with ‘AGI’. The lossy scans of human brains ported onto hardware would no longer feel as ‘humans’ can feel, and will be further changed/selected for to fit with their artificial substrate. This is because what humans and feel and express as emotions is grounded in the distributed and locally context-dependent functioning of organic molecules (eg. hormones) in our body.
Update: reverting my forecast back to 80% chance likelihood for these reasons.
I’m also feeling less “optimistic” about an AI crash given:
The election result involving a bunch of tech investors and execs pushing for influence through Trump’s campaign (with a stated intention to deregulate tech).
A military veteran saying that the military could be holding up the AI industry like “Atlas holding the globe”, and an AI PhD saying that hyperscaled data centers, deep learning, etc, could be super useful for war.
I will revise my previous forecast back to 80%+ chance.
Yes, I agree formalisation is needed. See comment by flandry39 in this thread on how one might go about doing so.
Worth considering is that there are actually two aspects that make it hard to define the term ‘alignment’ such to allow for sufficiently rigorous reasoning:
It must allow for logically valid reasoning (therefore requiring formalisation).
It must allow for empirically sound reasoning (ie. the premises correspond with how the world works).
In my reply above, I did not help you much with (1.). Though even while still using the English language, I managed to restate a vague notion of alignment in more precise terms.
Notice how it does help to define the correspondences with how the world works (2.):
“That ‘AGI’ continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.”
The reason why 2. is important is that just formalisation is not enough. Just describing and/or deriving logical relations between mathematical objects does not say something about the physical world. Somewhere in your fully communicated definition there also needs to be a description of how the mathematical objects correspond with real-world phenonema. Often, mathematicians do this by talking to collaborators about what symbols mean while they scribble the symbols out on eg. a whiteboard.
But whatever way you do it, you need to communicate how the definition corresponds to things happening in the real world, in order to show that it is a rigorous definition. Otherwise, others could still critique you that the formally precise definition is not rigorous, because it does not adequately (or explicitly) represent the real-world problem.
For an overview of why such a guarantee would turn out impossible, suggest taking a look at Will Petillo’s post Lenses of Control.
Defining alignment (sufficiently rigorous so that a formal proof of (im)possibility of alignment is conceivable) is a hard thing!
It’s less hard than you think, if you use a minimal-threshold definition of alignment:
That “AGI” continuing to exist, in some modified form, does not result eventually in changes to world conditions/contexts that fall outside the ranges that existing humans could survive under.
Yes, I think there is a more general proof available. This proof form would combine limits to predictability and so on, with a lethal dynamic that falls outside those limits.
The question is more if it can ever be truly proved at all, or if it doesn’t turn out to be an undecidable problem.
Control limits can show that it is an undecidable problem.
A limited scope of control can in turn be used to prove that a dynamic convergent on human-lethality is uncontrollable. That would be a basis for an impossibility proof by contradiction (cannot control AGI effects to stay in line with human safety).
Awesome directions. I want to bump this up.
This might include AGI predicting its own future behaviour, which is kind of essential for it to stick to a reliably aligned course of action.
There is a simple way of representing this problem that already shows the limitations.
Assume that AGI continues to learn new code from observations (inputs from the world) – since learning is what allows the AGI to stay autonomous and adaptable in acting across changing domains of the world.
Then in order for AGI code to be run to make predictions about relevant functioning of its future code:
Current code has to predict what future code will be learned from future unknown inputs (there would be no point in learning then if the inputs were predictable and known ahead of time).
Also, current code has to predict how the future code will compute subsequent unknown inputs into outputs, presumably using some shortcut algorithm that can infer relevant behavioural properties across the span of possible computationally-complex code.
Further, current code would have to predict how the outputs would result in relevant outside effects (where relevant to sticking to a reliably human-aligned course of action)
Where it is relevant how some of those effects could feed back into sensor inputs (and therefore could cause drifts in the learned code and the functioning of that code).
Where other potential destabilising feedback loops are also relevant, particularly that of evolutionary selection.
Good to know that this is why you think AI Safety Camp is not worth funding.
Once a core part of the AGI non-safety argument is put into maths to be comprehensible for people in your circle, it’d be interesting to see how you respond then.