In practice, engineers know that complex architectures interacting with the surrounding world end up having functional failures (because of unexpected interactive effects, or noisy interference). With AGI, we are talking about an architecture here that would be replacing all our jobs and move to managing conditions across our environment. If AGI continues to persist in some form over time, failures will occur and build up toward lethality at some unknown rate. Over a long enough period, this repeated potential for uncontrolled failures pushes the risk of human extinction above 99%.
This part is invalid, I think.
My understanding of this argument is: 1) There is an extremely powerful agent, so powerful that if it wanted to it could cause human extinction. 2) There is some risk of its goal-related systems breaking, and this risk doesn’t rapidly decrease over time. Therefore the risk adds up over time and converges toward 1.
This argument doesn’t work because the two premises won’t hold. For 2) An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure. For 1) Decentralizing away from a single point of failure is another obvious step that one would take in a post-ASI world.
So the risk of everyone dying should only come from a relatively short period after an agent (or agents) become powerful enough that killing everyone is an ~easy option.
I’ve reread and my understanding of point 3 remains the same. I wasn’t trying to summarize points 1-5, to be clear. And by “goal-related systems” I just meant whatever is keeping track of the outcomes being optimized for.
Perhaps you could point me to my misunderstanding?
An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure.
…
by “goal-related systems” I just meant whatever is keeping track of the outcomes being optimized for.
So the argument for 3. is that just by AGI continuing to operate and maintain its components as adapted to a changing environment, the machinery can accidentally end up causing destabilising effects that were untracked or otherwise insufficiently corrected for.
You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals.
But this would be a problem with the control process itself.
An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure.
Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process.
Any of the machinery’s external downstream effects that its internal control process cannot track (ie. detect, model, simulate, and identify as a “goal-related failure”), that process cannot correct for.
For further explanation, please see links under point 4.
Decentralizing away from a single point of failure is another obvious step that one would take in a post-ASI world.
The problem here is that (a) we are talking about not just a complicated machine product but self-modifying machinery and (b) at the scale this machinery would be operating at it cannot account for most of the potential human-lethal failures that could result.
For (a), notice how easily feedback processes can become unsimulatable for such unfixed open-ended architectures.
E.g. How can AGI code predict how its future code learned from unknown inputs will function in processing subsequent unknown inputs? What if future inputs are changing as a result of effects propagated across the larger environment from previous AGI outputs? And those outputs were changing as a result of previous new code that was processing signals in connection with other code running across the machinery? And so on.
For (b), engineering decentralised redundancy can help especially at the microscale.
E.g. correcting for bit errors.
But what does it mean to correct for failures at the level of local software (bugs, viruses, etc)? What does it mean to correct for failures across some decentralised server network? What does it mean to correct for failures at the level of an entire machine ecosystem (which AGI effectively becomes)?
~
In scaling up the connected components, this exponentially increases their degrees of freedom of interaction. And as those components change in feedback with surrounding contexts of the environment (and have to in order for AGI to autonomously adapt), an increasing portion of the possible human-lethal failures cannot be adequately controlled for by the system itself.
You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals.
But this would be a problem with the control process itself.
So it’s the AI being incompetent?
Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process.
Yeah I think would be a good response to my argument against premise 2). I’ve had a quick look at the list of theorems in the paper, I don’t know most of them, but the ones I do know don’t seem to support the point you’re making. So I don’t buy it. You could walk me though how one of these theorems is relevant to capping self-improvement of reliability?
For (a), notice how easily feedback processes can become unsimulatable for such unfixed open-ended architectures.
You don’t have to simulate something to reason about it.
E.g. How can AGI code predict how its future code learned from unknown inputs will function in processing subsequent unknown inputs?
Garrabrant induction shows one way of doing self-referential reasoning.
But what does it mean to correct for failures at the level of local software (bugs, viruses, etc)? What does it mean to correct for failures across some decentralised server network? What does it mean to correct for failures at the level of an entire machine ecosystem (which AGI effectively becomes)?
As an analogy: Use something more like democracy than like dictatorship, such that any one person going crazy can’t destroy the world/country, as a crazy dictator would.
Yes, but in the sense that there are limits to the AGI’s capacity to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.
You don’t have to simulate something to reason about it.
If you can’t simulate (and therefore predict) that a failure mode that by default is likely to happen would happen, then you cannot counterfactually act to prevent the failure mode.
You could walk me though how one of these theorems is relevant to capping self-improvement of reliability?
Maybe take a look at the hashiness model of AGI uncontainability. That’s an elegant way of representing the problem (instead of pointing at lots of examples of theorems that show limits to control).
This is not put into mathematical notation yet though. Anders Sandberg is working on it, but also somewhat distracted. Would value your contribution/thinking here, but I also get if you don’t want to read through the long transcripts of explanation at this stage. See project here.
Anders’ summary: ”A key issue is the thesis that AGI will be uncontrollable in the sense that there is no control mechanism that can guarantee aligned behavior since the more complex and abstract the target behavior is the amount of resources and forcing ability needed become unattainable.
In order to analyse this better a sufficiently general toy model is needed for how controllable systems of different complexity can be, that ideally can be analysed rigorously.
One such model is to study families of binary functions parametrized by their circuit complexity and their “hashiness” (how much they mix information) as an analog for the AGI and the alignment model, and the limits to finding predicates that can keep the alignment system making the AGI analog producing a desired output.”
Garrabrant induction shows one way of doing self-referential reasoning.
We’re talking about learning from inputs received from a more complex environment (through which AGI outputs also propagate as changed effects of which some are received as inputs).
Does Garrabrant take that into account in his self-referential reasoning?
As an analogy: Use something more like democracy than like dictatorship, such that any one person going crazy can’t destroy the world/country, as a crazy dictator would.
A human democracy is composed out of humans with similar needs. This turns out to be an essential difference.
How about I assume there is some epsilon such that the probability of an agent going off the rails is greater than epsilon in any given year. Why can’t the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others? This should reduce the risk to some fraction of epsilon, right?
(I’m gonna try and stay focused on a single point, specifically the argument that leads up to >99%, because that part seems wrong for quite simple reasons).
How about I assume there is some epsilon such that the probability of an agent going off the rails
Got it. So we are both assuming that there would be some accumulative failure rate [per point 3.].
Why can’t the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others?
I tried to adopt this ~uncorrelated agents framing, and then argue from within that. But I ran up against some problems with this framing:
It assumes there are stable boundaries between “agents” that allows us to mark them as separate entities. This kinda works for us as physically bounded and communication-bottlenecked humans. But in practice it wouldn’t really work to define “agent” separations within a larger machine network maintaining of own existence in the environment. (Also, it is not clear to me how failures of those defined “agent” subsets would necessarily be sufficiently uncorrelated – as an example, if the failure involves one subset hijacking the functioning of another subset, their failures become correlated.)
It assumes that if any (physical or functional) subset of this adaptive machinery happens to gain any edge in influencing the distributed flows of atoms and energy back towards own growth, that the other machinery subsets can robustly “control” for that.
It assumes a macroscale-explanation of physical processes that build up from the microscale. Agreed that the concept of agents owning and directing the allocation of “resources” is a useful abstraction, but it also involves holding a leaky representation of what’s going on. Any argument for control using that representation can turn out not to capture crucial aspects.
It raises the question what “off-the-rails” means here. This gets us into the hashiness model: Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space?
→ Do those problems makes sense to you as stated? Do you notice anything missing there?
To sum it up, you and I are still talking about a control system [per point 4.]:
However you define the autonomous “agents”, they are still running through code running across connected hardware.
There are limits to the capacity of this aggregate machinery to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.
I’m gonna try and stay focused on a single point, specifically the argument that leads up to >99%
I’m also for now leaving aside substrate-needs convergence [point 5]:
That the entire population of nested/connected machine components would be pulled toward a human-lethal attractor state.
This part is invalid, I think.
My understanding of this argument is: 1) There is an extremely powerful agent, so powerful that if it wanted to it could cause human extinction. 2) There is some risk of its goal-related systems breaking, and this risk doesn’t rapidly decrease over time. Therefore the risk adds up over time and converges toward 1.
This argument doesn’t work because the two premises won’t hold. For 2) An obvious consideration for any reflective agent is to find ways to reduce the risk of goal-related failure. For 1) Decentralizing away from a single point of failure is another obvious step that one would take in a post-ASI world.
So the risk of everyone dying should only come from a relatively short period after an agent (or agents) become powerful enough that killing everyone is an ~easy option.
Ah, that’s actually not the argument.
Could you try read points 1-5. again?
I’ve reread and my understanding of point 3 remains the same. I wasn’t trying to summarize points 1-5, to be clear. And by “goal-related systems” I just meant whatever is keeping track of the outcomes being optimized for.
Perhaps you could point me to my misunderstanding?
Appreciating your openness.
(Just making dinner – will get back to this when I’m behind my laptop in around an hour).
So the argument for 3. is that just by AGI continuing to operate and maintain its components as adapted to a changing environment, the machinery can accidentally end up causing destabilising effects that were untracked or otherwise insufficiently corrected for.
You could call this a failure of the AGI’s goal-related systems if you mean with that that the machinery failed to control its external effects in line with internally represented goals.
But this would be a problem with the control process itself.
Unfortunately, there are fundamental limits to that cap the extent to which the machinery can improve its own control process.
Any of the machinery’s external downstream effects that its internal control process cannot track (ie. detect, model, simulate, and identify as a “goal-related failure”), that process cannot correct for.
For further explanation, please see links under point 4.
The problem here is that (a) we are talking about not just a complicated machine product but self-modifying machinery and (b) at the scale this machinery would be operating at it cannot account for most of the potential human-lethal failures that could result.
For (a), notice how easily feedback processes can become unsimulatable for such unfixed open-ended architectures.
E.g. How can AGI code predict how its future code learned from unknown inputs will function in processing subsequent unknown inputs? What if future inputs are changing as a result of effects propagated across the larger environment from previous AGI outputs? And those outputs were changing as a result of previous new code that was processing signals in connection with other code running across the machinery? And so on.
For (b), engineering decentralised redundancy can help especially at the microscale.
E.g. correcting for bit errors.
But what does it mean to correct for failures at the level of local software (bugs, viruses, etc)? What does it mean to correct for failures across some decentralised server network? What does it mean to correct for failures at the level of an entire machine ecosystem (which AGI effectively becomes)?
~
In scaling up the connected components, this exponentially increases their degrees of freedom of interaction. And as those components change in feedback with surrounding contexts of the environment (and have to in order for AGI to autonomously adapt), an increasing portion of the possible human-lethal failures cannot be adequately controlled for by the system itself.
So it’s the AI being incompetent?
Yeah I think would be a good response to my argument against premise 2). I’ve had a quick look at the list of theorems in the paper, I don’t know most of them, but the ones I do know don’t seem to support the point you’re making. So I don’t buy it. You could walk me though how one of these theorems is relevant to capping self-improvement of reliability?
You don’t have to simulate something to reason about it.
Garrabrant induction shows one way of doing self-referential reasoning.
As an analogy: Use something more like democracy than like dictatorship, such that any one person going crazy can’t destroy the world/country, as a crazy dictator would.
Yes, but in the sense that there are limits to the AGI’s capacity to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.
If you can’t simulate (and therefore predict) that a failure mode that by default is likely to happen would happen, then you cannot counterfactually act to prevent the failure mode.
Maybe take a look at the hashiness model of AGI uncontainability. That’s an elegant way of representing the problem (instead of pointing at lots of examples of theorems that show limits to control).
This is not put into mathematical notation yet though. Anders Sandberg is working on it, but also somewhat distracted. Would value your contribution/thinking here, but I also get if you don’t want to read through the long transcripts of explanation at this stage. See project here.
Anders’ summary:
”A key issue is the thesis that AGI will be uncontrollable in the sense that there is no control mechanism that can guarantee aligned behavior since the more complex and abstract the target behavior is the amount of resources and forcing ability needed become unattainable.
In order to analyse this better a sufficiently general toy model is needed for how controllable systems of different complexity can be, that ideally can be analysed rigorously.
One such model is to study families of binary functions parametrized by their circuit complexity and their “hashiness” (how much they mix information) as an analog for the AGI and the alignment model, and the limits to finding predicates that can keep the alignment system making the AGI analog producing a desired output.”
We’re talking about learning from inputs received from a more complex environment (through which AGI outputs also propagate as changed effects of which some are received as inputs).
Does Garrabrant take that into account in his self-referential reasoning?
A human democracy is composed out of humans with similar needs. This turns out to be an essential difference.
How about I assume there is some epsilon such that the probability of an agent going off the rails is greater than epsilon in any given year. Why can’t the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others? This should reduce the risk to some fraction of epsilon, right?
(I’m gonna try and stay focused on a single point, specifically the argument that leads up to >99%, because that part seems wrong for quite simple reasons).
Got it. So we are both assuming that there would be some accumulative failure rate [per point 3.].
I tried to adopt this ~uncorrelated agents framing, and then argue from within that. But I ran up against some problems with this framing:
It assumes there are stable boundaries between “agents” that allows us to mark them as separate entities. This kinda works for us as physically bounded and communication-bottlenecked humans. But in practice it wouldn’t really work to define “agent” separations within a larger machine network maintaining of own existence in the environment.
(Also, it is not clear to me how failures of those defined “agent” subsets would necessarily be sufficiently uncorrelated – as an example, if the failure involves one subset hijacking the functioning of another subset, their failures become correlated.)
It assumes that if any (physical or functional) subset of this adaptive machinery happens to gain any edge in influencing the distributed flows of atoms and energy back towards own growth, that the other machinery subsets can robustly “control” for that.
It assumes a macroscale-explanation of physical processes that build up from the microscale. Agreed that the concept of agents owning and directing the allocation of “resources” is a useful abstraction, but it also involves holding a leaky representation of what’s going on. Any argument for control using that representation can turn out not to capture crucial aspects.
It raises the question what “off-the-rails” means here. This gets us into the hashiness model:
Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space?
→ Do those problems makes sense to you as stated? Do you notice anything missing there?
To sum it up, you and I are still talking about a control system [per point 4.]:
However you define the autonomous “agents”, they are still running through code running across connected hardware.
There are limits to the capacity of this aggregate machinery to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.
I’m also for now leaving aside substrate-needs convergence [point 5]:
That the entire population of nested/connected machine components would be pulled toward a human-lethal attractor state.
I appreciate that you tried. If words are failing us to this extent, I’m going to give up.