When you say failures will “build up toward lethality at some unknown rate”, why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.
Let’s take your example of semiconductor factories.
There are several ways to think about failures here. For one, we can talk about local failures in the production of the semiconductor chips. These especially will get corrected for.
A less common way to talk about factory failures is when workers working in the factories die or are physically incapacitated as a result, eg. because of chemical leaks or some robot hitting them. Usually when this happens, the factories can keep operating and existing. Just replace the expendable workers with new workers.
Of course, if too many workers die, other workers will decide to not work at those factories. Running the factories has to not be too damaging to the health of the internal human workers, in any of the many (indirect) that ways operations could turn out to be damaging.
The same goes for humans contributing to the surrounding infrastructure needed to maintain the existence of these sophisticated factories – all the building construction, all the machine parts, all the raw materials, all the needed energy supplies, and so on. If you try overseeing the relevant upstream and downstream transactions, it turns out that a non-tiny portion of the entire human economy is supporting the existence of these semiconductor factories one way or another. It took a modern industrial cross-continental economy to even make eg. TSMC’s factories viable.
The human economy acts as a forcing function constraining what semiconductor factories can be. There are many, many ways to incapacitate complex multi-celled cooperative organisms like us. So the semiconductor factories that humans are maintaining today ended up being constrained to those that for the most part do not trigger those pathways downstream.
Some of that is because humans went through the effort of noticing errors explicitly and then correcting them, or designing automated systems to do likewise. But the invisible hand of the market considered broadly – as constituting of humans with skin in the game, making often intuitive choices – will actually just force semiconductor factories to be not too damaging to surrounding humans maintaining the needed infrastructure.
With AGI, you lose that forcing function.
Let’s take AGI to be machinery that is autonomous enough to at least automate all the jobs needed to maintain its own existence. Then AGI is no longer dependent on an economy of working humans to maintain its own existence. AGI would be displacing the human economy – as a hypothetical example, AGI is what you’d get if those semiconductor factories producing microchips expanded to producing servers and robots using those microchips that in turn learn somehow to design themselves to operate the factories and all the factory-needed infrastructure autonomously.
Then there is one forcing function left: the machine operation of control mechanisms. Ie. mechanisms that detect, model, simulate, evaluate, and correct downstream effects in order to keep AGI safe.
The question becomes – Can we rely on only control mechanisms to keep AGI safe? That question raises other questions.
E.g. as relevant to the hashiness model: “Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space? ”
This is pretty similar to Hendrycks’s natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life.
There are some ways to expand Hendrycks’ argument to make it more comprehensive:
Consider evolutionary selection at the more fundamental level of physical component interactions. Ie. not just at the macro level of agents competing for resources, since this is a leaky abstraction that can easily fail to capture underlying vectors of change.
Consider not only selection of local variations (ie. mutations) that introduces new functionality, but also the selection of variants connecting up with surrounding units in ways that ends up repurposing existing functionality.
Consider not only the concept of goals that are (able to be) explicitly tracked by the machinery itself, but also that of the implicit conditions needed by components which end up being selected for in expressions across the environment.
Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time
This is why we need to take extra care in modelling how evolution – as a kind of algorithm – would apply across the physical signalling pathways of AGI.
I might share a gears-level explanation that Forrest that just gave in response to your comment.
The claims made will feel unfamiliar and the reasoning paths too. I suggest (again) taking the time to consider what is meant. If a conclusion looks intuitively wrong from some AI Safety perspective, it may be valuable to explicitly consider the argumentation and premises behind that.
Let’s take your example of semiconductor factories.
There are several ways to think about failures here. For one, we can talk about local failures in the production of the semiconductor chips. These especially will get corrected for.
A less common way to talk about factory failures is when workers working in the factories die or are physically incapacitated as a result, eg. because of chemical leaks or some robot hitting them. Usually when this happens, the factories can keep operating and existing. Just replace the expendable workers with new workers.
Of course, if too many workers die, other workers will decide to not work at those factories. Running the factories has to not be too damaging to the health of the internal human workers, in any of the many (indirect) that ways operations could turn out to be damaging.
The same goes for humans contributing to the surrounding infrastructure needed to maintain the existence of these sophisticated factories – all the building construction, all the machine parts, all the raw materials, all the needed energy supplies, and so on. If you try overseeing the relevant upstream and downstream transactions, it turns out that a non-tiny portion of the entire human economy is supporting the existence of these semiconductor factories one way or another. It took a modern industrial cross-continental economy to even make eg. TSMC’s factories viable.
The human economy acts as a forcing function constraining what semiconductor factories can be. There are many, many ways to incapacitate complex multi-celled cooperative organisms like us. So the semiconductor factories that humans are maintaining today ended up being constrained to those that for the most part do not trigger those pathways downstream.
Some of that is because humans went through the effort of noticing errors explicitly and then correcting them, or designing automated systems to do likewise. But the invisible hand of the market considered broadly – as constituting of humans with skin in the game, making often intuitive choices – will actually just force semiconductor factories to be not too damaging to surrounding humans maintaining the needed infrastructure.
With AGI, you lose that forcing function.
Let’s take AGI to be machinery that is autonomous enough to at least automate all the jobs needed to maintain its own existence. Then AGI is no longer dependent on an economy of working humans to maintain its own existence. AGI would be displacing the human economy – as a hypothetical example, AGI is what you’d get if those semiconductor factories producing microchips expanded to producing servers and robots using those microchips that in turn learn somehow to design themselves to operate the factories and all the factory-needed infrastructure autonomously.
Then there is one forcing function left: the machine operation of control mechanisms. Ie. mechanisms that detect, model, simulate, evaluate, and correct downstream effects in order to keep AGI safe.
The question becomes – Can we rely on only control mechanisms to keep AGI safe?
That question raises other questions.
E.g. as relevant to the hashiness model:
“Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space? ”
There are some ways to expand Hendrycks’ argument to make it more comprehensive:
Consider evolutionary selection at the more fundamental level of physical component interactions. Ie. not just at the macro level of agents competing for resources, since this is a leaky abstraction that can easily fail to capture underlying vectors of change.
Consider not only selection of local variations (ie. mutations) that introduces new functionality, but also the selection of variants connecting up with surrounding units in ways that ends up repurposing existing functionality.
Consider not only the concept of goals that are (able to be) explicitly tracked by the machinery itself, but also that of the implicit conditions needed by components which end up being selected for in expressions across the environment.
This is why we need to take extra care in modelling how evolution – as a kind of algorithm – would apply across the physical signalling pathways of AGI.
I might share a gears-level explanation that Forrest that just gave in response to your comment.
Noticing no response here after we addressed superficial critiques and moved to discussing the actual argument.
For those few interested in questions raised above, Forrest wrote some responses: http://69.27.64.19/ai_alignment_1/d_241016_recap_gen.html
The claims made will feel unfamiliar and the reasoning paths too. I suggest (again) taking the time to consider what is meant. If a conclusion looks intuitively wrong from some AI Safety perspective, it may be valuable to explicitly consider the argumentation and premises behind that.