Yes, but in the sense that there are limits to the AGI’s capacity to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.
You don’t have to simulate something to reason about it.
If you can’t simulate (and therefore predict) that a failure mode that by default is likely to happen would happen, then you cannot counterfactually act to prevent the failure mode.
You could walk me though how one of these theorems is relevant to capping self-improvement of reliability?
Maybe take a look at the hashiness model of AGI uncontainability. That’s an elegant way of representing the problem (instead of pointing at lots of examples of theorems that show limits to control).
This is not put into mathematical notation yet though. Anders Sandberg is working on it, but also somewhat distracted. Would value your contribution/thinking here, but I also get if you don’t want to read through the long transcripts of explanation at this stage. See project here.
Anders’ summary: ”A key issue is the thesis that AGI will be uncontrollable in the sense that there is no control mechanism that can guarantee aligned behavior since the more complex and abstract the target behavior is the amount of resources and forcing ability needed become unattainable.
In order to analyse this better a sufficiently general toy model is needed for how controllable systems of different complexity can be, that ideally can be analysed rigorously.
One such model is to study families of binary functions parametrized by their circuit complexity and their “hashiness” (how much they mix information) as an analog for the AGI and the alignment model, and the limits to finding predicates that can keep the alignment system making the AGI analog producing a desired output.”
Garrabrant induction shows one way of doing self-referential reasoning.
We’re talking about learning from inputs received from a more complex environment (through which AGI outputs also propagate as changed effects of which some are received as inputs).
Does Garrabrant take that into account in his self-referential reasoning?
As an analogy: Use something more like democracy than like dictatorship, such that any one person going crazy can’t destroy the world/country, as a crazy dictator would.
A human democracy is composed out of humans with similar needs. This turns out to be an essential difference.
How about I assume there is some epsilon such that the probability of an agent going off the rails is greater than epsilon in any given year. Why can’t the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others? This should reduce the risk to some fraction of epsilon, right?
(I’m gonna try and stay focused on a single point, specifically the argument that leads up to >99%, because that part seems wrong for quite simple reasons).
How about I assume there is some epsilon such that the probability of an agent going off the rails
Got it. So we are both assuming that there would be some accumulative failure rate [per point 3.].
Why can’t the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others?
I tried to adopt this ~uncorrelated agents framing, and then argue from within that. But I ran up against some problems with this framing:
It assumes there are stable boundaries between “agents” that allows us to mark them as separate entities. This kinda works for us as physically bounded and communication-bottlenecked humans. But in practice it wouldn’t really work to define “agent” separations within a larger machine network maintaining of own existence in the environment. (Also, it is not clear to me how failures of those defined “agent” subsets would necessarily be sufficiently uncorrelated – as an example, if the failure involves one subset hijacking the functioning of another subset, their failures become correlated.)
It assumes that if any (physical or functional) subset of this adaptive machinery happens to gain any edge in influencing the distributed flows of atoms and energy back towards own growth, that the other machinery subsets can robustly “control” for that.
It assumes a macroscale-explanation of physical processes that build up from the microscale. Agreed that the concept of agents owning and directing the allocation of “resources” is a useful abstraction, but it also involves holding a leaky representation of what’s going on. Any argument for control using that representation can turn out not to capture crucial aspects.
It raises the question what “off-the-rails” means here. This gets us into the hashiness model: Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space?
→ Do those problems makes sense to you as stated? Do you notice anything missing there?
To sum it up, you and I are still talking about a control system [per point 4.]:
However you define the autonomous “agents”, they are still running through code running across connected hardware.
There are limits to the capacity of this aggregate machinery to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.
I’m gonna try and stay focused on a single point, specifically the argument that leads up to >99%
I’m also for now leaving aside substrate-needs convergence [point 5]:
That the entire population of nested/connected machine components would be pulled toward a human-lethal attractor state.
Yes, but in the sense that there are limits to the AGI’s capacity to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.
If you can’t simulate (and therefore predict) that a failure mode that by default is likely to happen would happen, then you cannot counterfactually act to prevent the failure mode.
Maybe take a look at the hashiness model of AGI uncontainability. That’s an elegant way of representing the problem (instead of pointing at lots of examples of theorems that show limits to control).
This is not put into mathematical notation yet though. Anders Sandberg is working on it, but also somewhat distracted. Would value your contribution/thinking here, but I also get if you don’t want to read through the long transcripts of explanation at this stage. See project here.
Anders’ summary:
”A key issue is the thesis that AGI will be uncontrollable in the sense that there is no control mechanism that can guarantee aligned behavior since the more complex and abstract the target behavior is the amount of resources and forcing ability needed become unattainable.
In order to analyse this better a sufficiently general toy model is needed for how controllable systems of different complexity can be, that ideally can be analysed rigorously.
One such model is to study families of binary functions parametrized by their circuit complexity and their “hashiness” (how much they mix information) as an analog for the AGI and the alignment model, and the limits to finding predicates that can keep the alignment system making the AGI analog producing a desired output.”
We’re talking about learning from inputs received from a more complex environment (through which AGI outputs also propagate as changed effects of which some are received as inputs).
Does Garrabrant take that into account in his self-referential reasoning?
A human democracy is composed out of humans with similar needs. This turns out to be an essential difference.
How about I assume there is some epsilon such that the probability of an agent going off the rails is greater than epsilon in any given year. Why can’t the agent split into multiple ~uncorrelated agents and have them each control some fraction of resources (maybe space) such that one off-the-rails agent can easily be fought and controlled by the others? This should reduce the risk to some fraction of epsilon, right?
(I’m gonna try and stay focused on a single point, specifically the argument that leads up to >99%, because that part seems wrong for quite simple reasons).
Got it. So we are both assuming that there would be some accumulative failure rate [per point 3.].
I tried to adopt this ~uncorrelated agents framing, and then argue from within that. But I ran up against some problems with this framing:
It assumes there are stable boundaries between “agents” that allows us to mark them as separate entities. This kinda works for us as physically bounded and communication-bottlenecked humans. But in practice it wouldn’t really work to define “agent” separations within a larger machine network maintaining of own existence in the environment.
(Also, it is not clear to me how failures of those defined “agent” subsets would necessarily be sufficiently uncorrelated – as an example, if the failure involves one subset hijacking the functioning of another subset, their failures become correlated.)
It assumes that if any (physical or functional) subset of this adaptive machinery happens to gain any edge in influencing the distributed flows of atoms and energy back towards own growth, that the other machinery subsets can robustly “control” for that.
It assumes a macroscale-explanation of physical processes that build up from the microscale. Agreed that the concept of agents owning and directing the allocation of “resources” is a useful abstraction, but it also involves holding a leaky representation of what’s going on. Any argument for control using that representation can turn out not to capture crucial aspects.
It raises the question what “off-the-rails” means here. This gets us into the hashiness model:
Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space?
→ Do those problems makes sense to you as stated? Do you notice anything missing there?
To sum it up, you and I are still talking about a control system [per point 4.]:
However you define the autonomous “agents”, they are still running through code running across connected hardware.
There are limits to the capacity of this aggregate machinery to sense, model, simulate, evaluate, and correct own component effects propagating through a larger environment.
I’m also for now leaving aside substrate-needs convergence [point 5]:
That the entire population of nested/connected machine components would be pulled toward a human-lethal attractor state.
I appreciate that you tried. If words are failing us to this extent, I’m going to give up.