claiming to have a full mathematical proof that safe AI is impossible,
I have never claimed that there is a mathematical proof. I have claimed that the researcher I work with has done their own reasoning in formal analytical notation (just not maths). Also, that based on his argument – which I probed and have explained here as carefully as I can – AGI cannot be controlled enough to stay safe, and actually converges on extinction.
That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.
I’m kinda pointing out the obvious here, but if the researcher was a crank, why would Anders be working with them?
claiming the “proof” uses mathematical arguments from Godel’s theorem, Galois Theory,
Nope, I haven’t claimed either of that.
The claim is that the argument is based on showing a limited extent of control (where controlling effects consistently in line with reference values).
The form of the reasoning there shares some underlying correspondences with how the Gödel’s incompleteness theorems (concluding there is a limit to deriving a logical result within a formal axiomatic system) and Galois Theory (concluding that there is a limited scope of application of an algebraic tool) are reasoned through.
^– This is a pedagogical device. It helps researchers already acquainted with Gödel’s theorems or Galois Theory to understand roughly what kind of reasoning we’re talking about.
inexplicably formatted as a poem
Do you mean the fact that the researcher splits his sentences’ constituent parts into separate lines so that claims are more carefully parsable?
That is a format for analysis, not a poem format.
While certainly unconventional, it is not a reason to dismiss the rigour of someone’s analysis.
If you look at that exchange, I and the researcher I was working with were writing specific and carefully explained responses.
Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.
But I think that as rationalists, one of our core skills should be to condemn actual crankery and all of its influences
When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself.
You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything.
superficially agree with the conclusions from actually good arguments.
Unlike Anders – who examined the insufficient controllability part of the argument – you are not a position to judge whether this argument is a good argument or not.
Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.
It is not enough to say ‘as a rationalist’. You got to walk the talk.
I agree that with superficial observations, I can’t conclusively demonstrate that something is devoid of intellectual value. However, the nonstandard use of words like “proof” is a strong negative signal on someone’s work.
If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way, because a basic strategy of anyone practicing pseudoscience is to spend lots of time writing something inscrutable that ends in some conclusion, then claim that no one can disprove it and anyone who thinks it’s invalid is misunderstanding something inscrutable.
This problem is exacerbated when someone bases their work on original philosophy. To understand Forrest Landry’s work to his satisfaction someone will have to understand his 517-page book An Immanent Metaphysics, which uses words like “proof”, “theorem”, “conjugate”, “axiom”, and “omniscient” in a nonstandard sense, and also probably requires someone to have a background in metaphysics. I scanned the 134-page version, can’t make any sense of it, and found several concrete statements that sound wrong. I read about 50 pages of various articles on the website and found them to be reasonably coherent but often oddly worded and misusing words like entropy, with the same content quality as a ~10 karma LW post but super overconfident.
That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.
Ok. To be clear I don’t expect any Landry and Sandberg paper that comes out of this collaboration to be crankery. Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice’s theorem which will be slightly relevant to AI but not super relevant because the premises are too strong, like the average item in Yampolskiy’s list of impossibility proofs (I can give examples if you want of why these are not conclusive).
I’m not saying we should discard all reasoning by someone that claims an informal argument is a proof, but rather stop taking their claims of “proofs” at face value without seeing more solid arguments.
claiming the “proof” uses mathematical arguments from Godel’s theorem, Galois Theory,
Nope, I haven’t claimed either of that.
Fair enough. I can’t verify this because Wayback Machine is having trouble displaying the relevant content though.
Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.
Paul expressed appropriate uncertainty. What is he supposed to do, say “I see several red flags, but I don’t have time to read a 517-page metaphysics book, so I’m still radically uncertain whether this is a crank or the next Kurt Godel”?
Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.
When you say failures will “build up toward lethality at some unknown rate”, why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.
Variants get evolutionarily selected for how they function across the various contexts they encounter over time. [...] The artificial population therefore converges on fulfilling their own expanding needs.
This is pretty similar to Hendrycks’s natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life. He claims that there are various ways to counter evolutionary pressures, like “carefully designing AI agents’ intrinsic motivations, introducing constraints on their actions, and institutions that encourage cooperation”. In the presence of ways to change incentives such that benign AI systems get higher fitness, I don’t think you can get to 99% confidence. Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time, from Malthus to evolutionary psychology to the group selectionists.
I agree that with superficial observations, I can’t conclusively demonstrate that something is devoid of intellectual value.
Thanks for recognising this, and for taking some time now to consider the argument.
However, the nonstandard use of words like “proof” is a strong negative signal on someone’s work.
Yes, this made us move away from using the term “proof”, and instead write “formal reasoning”.
Most proofs nowadays are done using mathematical notation. So it is understandable that when people read “proof”, they automatically think “mathematical proof”.
Having said that, there are plenty of examples of proofs done in formal analytic notation that is not mathematical notation. See eg. formal verification practices in the software and hardware industries, or various branches of analytical philosophy.
If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way
Yes, much of the effort has been to translate argument parts in terms more standard for the alignment community.
What we cannot expect is that the formal reasoning is conceptually familiar and low-inferential distance. That would actually be surprising – why then has someone inside the community not already derived the result in the last 20 years?
The reasoning is going to be as complicated as it has to be to reason things through.
This problem is exacerbated when someone bases their work on original philosophy. To understand Forrest Landry’s work to his satisfaction someone will have to understand his 517-page book An Immanent Metaphysics
Cool that you took a look at his work. Forrest’s use of terms is meant to approximate everyday use of those terms, but the underlying philosophy is notoriously complicated.
Jim Rutt is an ex-chair of Santa Fe Institute who defaults to being skeptical of metaphysics proposals (funny quote he repeats: “when someone mentions metaphysics, I reach for my pistol”). But Jim ended up reading Forrest’s book and it passed his B.S. detector. So he invited Forrest over to his podcast for a three-part interview. Even if you listen to that though, I don’t expect you to immediately come away understanding the conceptual relations.
So here is a problem that you and I are both seeing:
There is this polymath who is clearly smart and recognised for some of his intellectual contributions (by interviewers like Rutt, or co-authors like Anders).
But what this polymath claims to be using as the most fundamental basis for his analysis would take too much time to work through.
So then if this polymath claims to have derived a proof by contradiction –concluding that long-term AGI safety is not possible – then it is intractable for alignment researchers to verify the reasoning using his formal annotation and his conceptual framework. That would be asking for too much – if he’d have insisted on that, I agree that would have been a big red flag signalling crankery.
The obvious move then is for some people to work with the polymath to translate his reasoning to a basis of analysis that alignment researchers agree is a sound basis to reason from. And to translate to terms/concepts people are familiar with. Also, the chain of reasoning should not be so long that busy researchers never end up reading through, but also not so short that you either end up having to use abstractions readers are unfamiliar with, or open up unaddressed gaps in the reasoning. Etc.
The problem becomes finding people who are both willing and available to do that work. One person is probably not enough.
Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice’s theorem
Both are useful theorems, which have specific conclusions that demonstrate that there are at least some limits to control.
(ie. Good Regulator Theorem demonstrates a limit to a system’s capacity to model – or internally functionally represent – the statespace of some more complex super-system. Rice Theorem demonstrates a particular limit to having some general algorithm predict a behavioural property of other algorithms.)
The hashiness model is a tool meant for demonstrating under conservative assumptions – eg. of how far from cryptographically hashy the algorithm run through ‘AGI’ is, and how targetable human-safe ecosystem conditions are – that AGI would be uncontainable. With “uncontainable”, I mean that no available control system connected with/in AGI could constrain the possibility space of AGI’s output sequences enough over time such that the (cascading) environmental effects do not lethally disrupt the bodily functioning of humans.
Paul expressed appropriate uncertainty. What is he supposed to...say...?
I can see Paul tried expressing uncertainty by adding “probably” to his claim of how the entire scientific community (not sure what this means) would interpret that one essay.
To me, it seemed his commentary was missing some meta-uncertainty. Something like “I just did some light reading. Based on how it’s stated in this essay, I feel confident it makes no sense for me to engage further with the argument. However, maybe other researchers would find it valuable to spend more time engaging with the argument after going through this essay or some other presentation of the argument.”
~ That covers your comments re: communicating the argument in a form that can be verified by the community.
Let me cook dinner, and then respond to your last two comments to dig into the argument itself. EDIT: am writing now, will respond tomorrow.
When you say failures will “build up toward lethality at some unknown rate”, why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.
Let’s take your example of semiconductor factories.
There are several ways to think about failures here. For one, we can talk about local failures in the production of the semiconductor chips. These especially will get corrected for.
A less common way to talk about factory failures is when workers working in the factories die or are physically incapacitated as a result, eg. because of chemical leaks or some robot hitting them. Usually when this happens, the factories can keep operating and existing. Just replace the expendable workers with new workers.
Of course, if too many workers die, other workers will decide to not work at those factories. Running the factories has to not be too damaging to the health of the internal human workers, in any of the many (indirect) that ways operations could turn out to be damaging.
The same goes for humans contributing to the surrounding infrastructure needed to maintain the existence of these sophisticated factories – all the building construction, all the machine parts, all the raw materials, all the needed energy supplies, and so on. If you try overseeing the relevant upstream and downstream transactions, it turns out that a non-tiny portion of the entire human economy is supporting the existence of these semiconductor factories one way or another. It took a modern industrial cross-continental economy to even make eg. TSMC’s factories viable.
The human economy acts as a forcing function constraining what semiconductor factories can be. There are many, many ways to incapacitate complex multi-celled cooperative organisms like us. So the semiconductor factories that humans are maintaining today ended up being constrained to those that for the most part do not trigger those pathways downstream.
Some of that is because humans went through the effort of noticing errors explicitly and then correcting them, or designing automated systems to do likewise. But the invisible hand of the market considered broadly – as constituting of humans with skin in the game, making often intuitive choices – will actually just force semiconductor factories to be not too damaging to surrounding humans maintaining the needed infrastructure.
With AGI, you lose that forcing function.
Let’s take AGI to be machinery that is autonomous enough to at least automate all the jobs needed to maintain its own existence. Then AGI is no longer dependent on an economy of working humans to maintain its own existence. AGI would be displacing the human economy – as a hypothetical example, AGI is what you’d get if those semiconductor factories producing microchips expanded to producing servers and robots using those microchips that in turn learn somehow to design themselves to operate the factories and all the factory-needed infrastructure autonomously.
Then there is one forcing function left: the machine operation of control mechanisms. Ie. mechanisms that detect, model, simulate, evaluate, and correct downstream effects in order to keep AGI safe.
The question becomes – Can we rely on only control mechanisms to keep AGI safe? That question raises other questions.
E.g. as relevant to the hashiness model: “Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space? ”
This is pretty similar to Hendrycks’s natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life.
There are some ways to expand Hendrycks’ argument to make it more comprehensive:
Consider evolutionary selection at the more fundamental level of physical component interactions. Ie. not just at the macro level of agents competing for resources, since this is a leaky abstraction that can easily fail to capture underlying vectors of change.
Consider not only selection of local variations (ie. mutations) that introduces new functionality, but also the selection of variants connecting up with surrounding units in ways that ends up repurposing existing functionality.
Consider not only the concept of goals that are (able to be) explicitly tracked by the machinery itself, but also that of the implicit conditions needed by components which end up being selected for in expressions across the environment.
Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time
This is why we need to take extra care in modelling how evolution – as a kind of algorithm – would apply across the physical signalling pathways of AGI.
I might share a gears-level explanation that Forrest that just gave in response to your comment.
The claims made will feel unfamiliar and the reasoning paths too. I suggest (again) taking the time to consider what is meant. If a conclusion looks intuitively wrong from some AI Safety perspective, it may be valuable to explicitly consider the argumentation and premises behind that.
I have never claimed that there is a mathematical proof. I have claimed that the researcher I work with has done their own reasoning in formal analytical notation (just not maths). Also, that based on his argument – which I probed and have explained here as carefully as I can – AGI cannot be controlled enough to stay safe, and actually converges on extinction.
That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.
I’m kinda pointing out the obvious here, but if the researcher was a crank, why would Anders be working with them?
Nope, I haven’t claimed either of that.
The claim is that the argument is based on showing a limited extent of control (where controlling effects consistently in line with reference values).
The form of the reasoning there shares some underlying correspondences with how the Gödel’s incompleteness theorems (concluding there is a limit to deriving a logical result within a formal axiomatic system) and Galois Theory (concluding that there is a limited scope of application of an algebraic tool) are reasoned through.
^– This is a pedagogical device. It helps researchers already acquainted with Gödel’s theorems or Galois Theory to understand roughly what kind of reasoning we’re talking about.
Do you mean the fact that the researcher splits his sentences’ constituent parts into separate lines so that claims are more carefully parsable?
That is a format for analysis, not a poem format.
While certainly unconventional, it is not a reason to dismiss the rigour of someone’s analysis.
If you look at that exchange, I and the researcher I was working with were writing specific and carefully explained responses.
Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.
When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself.
You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything.
Unlike Anders – who examined the insufficient controllability part of the argument – you are not a position to judge whether this argument is a good argument or not.
Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.
It is not enough to say ‘as a rationalist’. You got to walk the talk.
I agree that with superficial observations, I can’t conclusively demonstrate that something is devoid of intellectual value. However, the nonstandard use of words like “proof” is a strong negative signal on someone’s work.
If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way, because a basic strategy of anyone practicing pseudoscience is to spend lots of time writing something inscrutable that ends in some conclusion, then claim that no one can disprove it and anyone who thinks it’s invalid is misunderstanding something inscrutable.
This problem is exacerbated when someone bases their work on original philosophy. To understand Forrest Landry’s work to his satisfaction someone will have to understand his 517-page book An Immanent Metaphysics, which uses words like “proof”, “theorem”, “conjugate”, “axiom”, and “omniscient” in a nonstandard sense, and also probably requires someone to have a background in metaphysics. I scanned the 134-page version, can’t make any sense of it, and found several concrete statements that sound wrong. I read about 50 pages of various articles on the website and found them to be reasonably coherent but often oddly worded and misusing words like entropy, with the same content quality as a ~10 karma LW post but super overconfident.
Ok. To be clear I don’t expect any Landry and Sandberg paper that comes out of this collaboration to be crankery. Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice’s theorem which will be slightly relevant to AI but not super relevant because the premises are too strong, like the average item in Yampolskiy’s list of impossibility proofs (I can give examples if you want of why these are not conclusive).
I’m not saying we should discard all reasoning by someone that claims an informal argument is a proof, but rather stop taking their claims of “proofs” at face value without seeing more solid arguments.
Fair enough. I can’t verify this because Wayback Machine is having trouble displaying the relevant content though.
Paul expressed appropriate uncertainty. What is he supposed to do, say “I see several red flags, but I don’t have time to read a 517-page metaphysics book, so I’m still radically uncertain whether this is a crank or the next Kurt Godel”?
When you say failures will “build up toward lethality at some unknown rate”, why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.
This is pretty similar to Hendrycks’s natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life. He claims that there are various ways to counter evolutionary pressures, like “carefully designing AI agents’ intrinsic motivations, introducing constraints on their actions, and institutions that encourage cooperation”. In the presence of ways to change incentives such that benign AI systems get higher fitness, I don’t think you can get to 99% confidence. Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time, from Malthus to evolutionary psychology to the group selectionists.
Thanks for recognising this, and for taking some time now to consider the argument.
Yes, this made us move away from using the term “proof”, and instead write “formal reasoning”.
Most proofs nowadays are done using mathematical notation. So it is understandable that when people read “proof”, they automatically think “mathematical proof”.
Having said that, there are plenty of examples of proofs done in formal analytic notation that is not mathematical notation. See eg. formal verification practices in the software and hardware industries, or various branches of analytical philosophy.
Yes, much of the effort has been to translate argument parts in terms more standard for the alignment community.
What we cannot expect is that the formal reasoning is conceptually familiar and low-inferential distance. That would actually be surprising – why then has someone inside the community not already derived the result in the last 20 years?
The reasoning is going to be as complicated as it has to be to reason things through.
Cool that you took a look at his work. Forrest’s use of terms is meant to approximate everyday use of those terms, but the underlying philosophy is notoriously complicated.
Jim Rutt is an ex-chair of Santa Fe Institute who defaults to being skeptical of metaphysics proposals (funny quote he repeats: “when someone mentions metaphysics, I reach for my pistol”). But Jim ended up reading Forrest’s book and it passed his B.S. detector. So he invited Forrest over to his podcast for a three-part interview. Even if you listen to that though, I don’t expect you to immediately come away understanding the conceptual relations.
So here is a problem that you and I are both seeing:
There is this polymath who is clearly smart and recognised for some of his intellectual contributions (by interviewers like Rutt, or co-authors like Anders).
But what this polymath claims to be using as the most fundamental basis for his analysis would take too much time to work through.
So then if this polymath claims to have derived a proof by contradiction –concluding that long-term AGI safety is not possible – then it is intractable for alignment researchers to verify the reasoning using his formal annotation and his conceptual framework. That would be asking for too much – if he’d have insisted on that, I agree that would have been a big red flag signalling crankery.
The obvious move then is for some people to work with the polymath to translate his reasoning to a basis of analysis that alignment researchers agree is a sound basis to reason from. And to translate to terms/concepts people are familiar with. Also, the chain of reasoning should not be so long that busy researchers never end up reading through, but also not so short that you either end up having to use abstractions readers are unfamiliar with, or open up unaddressed gaps in the reasoning. Etc.
The problem becomes finding people who are both willing and available to do that work. One person is probably not enough.
Both are useful theorems, which have specific conclusions that demonstrate that there are at least some limits to control.
(ie. Good Regulator Theorem demonstrates a limit to a system’s capacity to model – or internally functionally represent – the statespace of some more complex super-system. Rice Theorem demonstrates a particular limit to having some general algorithm predict a behavioural property of other algorithms.)
The hashiness model is a tool meant for demonstrating under conservative assumptions – eg. of how far from cryptographically hashy the algorithm run through ‘AGI’ is, and how targetable human-safe ecosystem conditions are – that AGI would be uncontainable. With “uncontainable”, I mean that no available control system connected with/in AGI could constrain the possibility space of AGI’s output sequences enough over time such that the (cascading) environmental effects do not lethally disrupt the bodily functioning of humans.
I can see Paul tried expressing uncertainty by adding “probably” to his claim of how the entire scientific community (not sure what this means) would interpret that one essay.
To me, it seemed his commentary was missing some meta-uncertainty. Something like “I just did some light reading. Based on how it’s stated in this essay, I feel confident it makes no sense for me to engage further with the argument. However, maybe other researchers would find it valuable to spend more time engaging with the argument after going through this essay or some other presentation of the argument.”
~
That covers your comments re: communicating the argument in a form that can be verified by the community.
Let me cook dinner, and then respond to your last two comments to dig into the argument itself. EDIT: am writing now, will respond tomorrow.
Let’s take your example of semiconductor factories.
There are several ways to think about failures here. For one, we can talk about local failures in the production of the semiconductor chips. These especially will get corrected for.
A less common way to talk about factory failures is when workers working in the factories die or are physically incapacitated as a result, eg. because of chemical leaks or some robot hitting them. Usually when this happens, the factories can keep operating and existing. Just replace the expendable workers with new workers.
Of course, if too many workers die, other workers will decide to not work at those factories. Running the factories has to not be too damaging to the health of the internal human workers, in any of the many (indirect) that ways operations could turn out to be damaging.
The same goes for humans contributing to the surrounding infrastructure needed to maintain the existence of these sophisticated factories – all the building construction, all the machine parts, all the raw materials, all the needed energy supplies, and so on. If you try overseeing the relevant upstream and downstream transactions, it turns out that a non-tiny portion of the entire human economy is supporting the existence of these semiconductor factories one way or another. It took a modern industrial cross-continental economy to even make eg. TSMC’s factories viable.
The human economy acts as a forcing function constraining what semiconductor factories can be. There are many, many ways to incapacitate complex multi-celled cooperative organisms like us. So the semiconductor factories that humans are maintaining today ended up being constrained to those that for the most part do not trigger those pathways downstream.
Some of that is because humans went through the effort of noticing errors explicitly and then correcting them, or designing automated systems to do likewise. But the invisible hand of the market considered broadly – as constituting of humans with skin in the game, making often intuitive choices – will actually just force semiconductor factories to be not too damaging to surrounding humans maintaining the needed infrastructure.
With AGI, you lose that forcing function.
Let’s take AGI to be machinery that is autonomous enough to at least automate all the jobs needed to maintain its own existence. Then AGI is no longer dependent on an economy of working humans to maintain its own existence. AGI would be displacing the human economy – as a hypothetical example, AGI is what you’d get if those semiconductor factories producing microchips expanded to producing servers and robots using those microchips that in turn learn somehow to design themselves to operate the factories and all the factory-needed infrastructure autonomously.
Then there is one forcing function left: the machine operation of control mechanisms. Ie. mechanisms that detect, model, simulate, evaluate, and correct downstream effects in order to keep AGI safe.
The question becomes – Can we rely on only control mechanisms to keep AGI safe?
That question raises other questions.
E.g. as relevant to the hashiness model:
“Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space? ”
There are some ways to expand Hendrycks’ argument to make it more comprehensive:
Consider evolutionary selection at the more fundamental level of physical component interactions. Ie. not just at the macro level of agents competing for resources, since this is a leaky abstraction that can easily fail to capture underlying vectors of change.
Consider not only selection of local variations (ie. mutations) that introduces new functionality, but also the selection of variants connecting up with surrounding units in ways that ends up repurposing existing functionality.
Consider not only the concept of goals that are (able to be) explicitly tracked by the machinery itself, but also that of the implicit conditions needed by components which end up being selected for in expressions across the environment.
This is why we need to take extra care in modelling how evolution – as a kind of algorithm – would apply across the physical signalling pathways of AGI.
I might share a gears-level explanation that Forrest that just gave in response to your comment.
Noticing no response here after we addressed superficial critiques and moved to discussing the actual argument.
For those few interested in questions raised above, Forrest wrote some responses: http://69.27.64.19/ai_alignment_1/d_241016_recap_gen.html
The claims made will feel unfamiliar and the reasoning paths too. I suggest (again) taking the time to consider what is meant. If a conclusion looks intuitively wrong from some AI Safety perspective, it may be valuable to explicitly consider the argumentation and premises behind that.