I strongly wish you would not tie StopAI to the claim that extinction is >99% likely. It means that even your natural supporters in PauseAI will have to say “yes I broadly agree with them but disagree with their claims about extinction being certain.”
I would also echo the feedback here. There’s no reason to write in the same style as cranks.
It’s not just the writing that sounds like a crank. Core arguments that Remmelt endorses are AFAIK considered crankery by the community; with all the classic signs like
making up science-babble,
claiming to have a full mathematical proof that safe AI is impossible, despite not providing any formal mathematical reasoning
claiming the “proof” uses mathematical arguments from Godel’s theorem, Galois Theory, Rice’s Theorem
Now I don’t like or intend to make personal attacks. But I think that as rationalists, one of our core skills should be to condemn actual crankery and all of its influences, even when the conclusions of cranks and their collaborators superficially agree with the conclusions from actually good arguments.
claiming to have a full mathematical proof that safe AI is impossible,
I have never claimed that there is a mathematical proof. I have claimed that the researcher I work with has done their own reasoning in formal analytical notation (just not maths). Also, that based on his argument – which I probed and have explained here as carefully as I can – AGI cannot be controlled enough to stay safe, and actually converges on extinction.
That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.
I’m kinda pointing out the obvious here, but if the researcher was a crank, why would Anders be working with them?
claiming the “proof” uses mathematical arguments from Godel’s theorem, Galois Theory,
Nope, I haven’t claimed either of that.
The claim is that the argument is based on showing a limited extent of control (where controlling effects consistently in line with reference values).
The form of the reasoning there shares some underlying correspondences with how the Gödel’s incompleteness theorems (concluding there is a limit to deriving a logical result within a formal axiomatic system) and Galois Theory (concluding that there is a limited scope of application of an algebraic tool) are reasoned through.
^– This is a pedagogical device. It helps researchers already acquainted with Gödel’s theorems or Galois Theory to understand roughly what kind of reasoning we’re talking about.
inexplicably formatted as a poem
Do you mean the fact that the researcher splits his sentences’ constituent parts into separate lines so that claims are more carefully parsable?
That is a format for analysis, not a poem format.
While certainly unconventional, it is not a reason to dismiss the rigour of someone’s analysis.
If you look at that exchange, I and the researcher I was working with were writing specific and carefully explained responses.
Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.
But I think that as rationalists, one of our core skills should be to condemn actual crankery and all of its influences
When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself.
You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything.
superficially agree with the conclusions from actually good arguments.
Unlike Anders – who examined the insufficient controllability part of the argument – you are not a position to judge whether this argument is a good argument or not.
Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.
It is not enough to say ‘as a rationalist’. You got to walk the talk.
I agree that with superficial observations, I can’t conclusively demonstrate that something is devoid of intellectual value. However, the nonstandard use of words like “proof” is a strong negative signal on someone’s work.
If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way, because a basic strategy of anyone practicing pseudoscience is to spend lots of time writing something inscrutable that ends in some conclusion, then claim that no one can disprove it and anyone who thinks it’s invalid is misunderstanding something inscrutable.
This problem is exacerbated when someone bases their work on original philosophy. To understand Forrest Landry’s work to his satisfaction someone will have to understand his 517-page book An Immanent Metaphysics, which uses words like “proof”, “theorem”, “conjugate”, “axiom”, and “omniscient” in a nonstandard sense, and also probably requires someone to have a background in metaphysics. I scanned the 134-page version, can’t make any sense of it, and found several concrete statements that sound wrong. I read about 50 pages of various articles on the website and found them to be reasonably coherent but often oddly worded and misusing words like entropy, with the same content quality as a ~10 karma LW post but super overconfident.
That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.
Ok. To be clear I don’t expect any Landry and Sandberg paper that comes out of this collaboration to be crankery. Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice’s theorem which will be slightly relevant to AI but not super relevant because the premises are too strong, like the average item in Yampolskiy’s list of impossibility proofs (I can give examples if you want of why these are not conclusive).
I’m not saying we should discard all reasoning by someone that claims an informal argument is a proof, but rather stop taking their claims of “proofs” at face value without seeing more solid arguments.
claiming the “proof” uses mathematical arguments from Godel’s theorem, Galois Theory,
Nope, I haven’t claimed either of that.
Fair enough. I can’t verify this because Wayback Machine is having trouble displaying the relevant content though.
Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.
Paul expressed appropriate uncertainty. What is he supposed to do, say “I see several red flags, but I don’t have time to read a 517-page metaphysics book, so I’m still radically uncertain whether this is a crank or the next Kurt Godel”?
Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.
When you say failures will “build up toward lethality at some unknown rate”, why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.
Variants get evolutionarily selected for how they function across the various contexts they encounter over time. [...] The artificial population therefore converges on fulfilling their own expanding needs.
This is pretty similar to Hendrycks’s natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life. He claims that there are various ways to counter evolutionary pressures, like “carefully designing AI agents’ intrinsic motivations, introducing constraints on their actions, and institutions that encourage cooperation”. In the presence of ways to change incentives such that benign AI systems get higher fitness, I don’t think you can get to 99% confidence. Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time, from Malthus to evolutionary psychology to the group selectionists.
I agree that with superficial observations, I can’t conclusively demonstrate that something is devoid of intellectual value.
Thanks for recognising this, and for taking some time now to consider the argument.
However, the nonstandard use of words like “proof” is a strong negative signal on someone’s work.
Yes, this made us move away from using the term “proof”, and instead write “formal reasoning”.
Most proofs nowadays are done using mathematical notation. So it is understandable that when people read “proof”, they automatically think “mathematical proof”.
Having said that, there are plenty of examples of proofs done in formal analytic notation that is not mathematical notation. See eg. formal verification practices in the software and hardware industries, or various branches of analytical philosophy.
If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way
Yes, much of the effort has been to translate argument parts in terms more standard for the alignment community.
What we cannot expect is that the formal reasoning is conceptually familiar and low-inferential distance. That would actually be surprising – why then has someone inside the community not already derived the result in the last 20 years?
The reasoning is going to be as complicated as it has to be to reason things through.
This problem is exacerbated when someone bases their work on original philosophy. To understand Forrest Landry’s work to his satisfaction someone will have to understand his 517-page book An Immanent Metaphysics
Cool that you took a look at his work. Forrest’s use of terms is meant to approximate everyday use of those terms, but the underlying philosophy is notoriously complicated.
Jim Rutt is an ex-chair of Santa Fe Institute who defaults to being skeptical of metaphysics proposals (funny quote he repeats: “when someone mentions metaphysics, I reach for my pistol”). But Jim ended up reading Forrest’s book and it passed his B.S. detector. So he invited Forrest over to his podcast for a three-part interview. Even if you listen to that though, I don’t expect you to immediately come away understanding the conceptual relations.
So here is a problem that you and I are both seeing:
There is this polymath who is clearly smart and recognised for some of his intellectual contributions (by interviewers like Rutt, or co-authors like Anders).
But what this polymath claims to be using as the most fundamental basis for his analysis would take too much time to work through.
So then if this polymath claims to have derived a proof by contradiction –concluding that long-term AGI safety is not possible – then it is intractable for alignment researchers to verify the reasoning using his formal annotation and his conceptual framework. That would be asking for too much – if he’d have insisted on that, I agree that would have been a big red flag signalling crankery.
The obvious move then is for some people to work with the polymath to translate his reasoning to a basis of analysis that alignment researchers agree is a sound basis to reason from. And to translate to terms/concepts people are familiar with. Also, the chain of reasoning should not be so long that busy researchers never end up reading through, but also not so short that you either end up having to use abstractions readers are unfamiliar with, or open up unaddressed gaps in the reasoning. Etc.
The problem becomes finding people who are both willing and available to do that work. One person is probably not enough.
Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice’s theorem
Both are useful theorems, which have specific conclusions that demonstrate that there are at least some limits to control.
(ie. Good Regulator Theorem demonstrates a limit to a system’s capacity to model – or internally functionally represent – the statespace of some more complex super-system. Rice Theorem demonstrates a particular limit to having some general algorithm predict a behavioural property of other algorithms.)
The hashiness model is a tool meant for demonstrating under conservative assumptions – eg. of how far from cryptographically hashy the algorithm run through ‘AGI’ is, and how targetable human-safe ecosystem conditions are – that AGI would be uncontainable. With “uncontainable”, I mean that no available control system connected with/in AGI could constrain the possibility space of AGI’s output sequences enough over time such that the (cascading) environmental effects do not lethally disrupt the bodily functioning of humans.
Paul expressed appropriate uncertainty. What is he supposed to...say...?
I can see Paul tried expressing uncertainty by adding “probably” to his claim of how the entire scientific community (not sure what this means) would interpret that one essay.
To me, it seemed his commentary was missing some meta-uncertainty. Something like “I just did some light reading. Based on how it’s stated in this essay, I feel confident it makes no sense for me to engage further with the argument. However, maybe other researchers would find it valuable to spend more time engaging with the argument after going through this essay or some other presentation of the argument.”
~ That covers your comments re: communicating the argument in a form that can be verified by the community.
Let me cook dinner, and then respond to your last two comments to dig into the argument itself. EDIT: am writing now, will respond tomorrow.
When you say failures will “build up toward lethality at some unknown rate”, why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.
Let’s take your example of semiconductor factories.
There are several ways to think about failures here. For one, we can talk about local failures in the production of the semiconductor chips. These especially will get corrected for.
A less common way to talk about factory failures is when workers working in the factories die or are physically incapacitated as a result, eg. because of chemical leaks or some robot hitting them. Usually when this happens, the factories can keep operating and existing. Just replace the expendable workers with new workers.
Of course, if too many workers die, other workers will decide to not work at those factories. Running the factories has to not be too damaging to the health of the internal human workers, in any of the many (indirect) that ways operations could turn out to be damaging.
The same goes for humans contributing to the surrounding infrastructure needed to maintain the existence of these sophisticated factories – all the building construction, all the machine parts, all the raw materials, all the needed energy supplies, and so on. If you try overseeing the relevant upstream and downstream transactions, it turns out that a non-tiny portion of the entire human economy is supporting the existence of these semiconductor factories one way or another. It took a modern industrial cross-continental economy to even make eg. TSMC’s factories viable.
The human economy acts as a forcing function constraining what semiconductor factories can be. There are many, many ways to incapacitate complex multi-celled cooperative organisms like us. So the semiconductor factories that humans are maintaining today ended up being constrained to those that for the most part do not trigger those pathways downstream.
Some of that is because humans went through the effort of noticing errors explicitly and then correcting them, or designing automated systems to do likewise. But the invisible hand of the market considered broadly – as constituting of humans with skin in the game, making often intuitive choices – will actually just force semiconductor factories to be not too damaging to surrounding humans maintaining the needed infrastructure.
With AGI, you lose that forcing function.
Let’s take AGI to be machinery that is autonomous enough to at least automate all the jobs needed to maintain its own existence. Then AGI is no longer dependent on an economy of working humans to maintain its own existence. AGI would be displacing the human economy – as a hypothetical example, AGI is what you’d get if those semiconductor factories producing microchips expanded to producing servers and robots using those microchips that in turn learn somehow to design themselves to operate the factories and all the factory-needed infrastructure autonomously.
Then there is one forcing function left: the machine operation of control mechanisms. Ie. mechanisms that detect, model, simulate, evaluate, and correct downstream effects in order to keep AGI safe.
The question becomes – Can we rely on only control mechanisms to keep AGI safe? That question raises other questions.
E.g. as relevant to the hashiness model: “Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space? ”
This is pretty similar to Hendrycks’s natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life.
There are some ways to expand Hendrycks’ argument to make it more comprehensive:
Consider evolutionary selection at the more fundamental level of physical component interactions. Ie. not just at the macro level of agents competing for resources, since this is a leaky abstraction that can easily fail to capture underlying vectors of change.
Consider not only selection of local variations (ie. mutations) that introduces new functionality, but also the selection of variants connecting up with surrounding units in ways that ends up repurposing existing functionality.
Consider not only the concept of goals that are (able to be) explicitly tracked by the machinery itself, but also that of the implicit conditions needed by components which end up being selected for in expressions across the environment.
Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time
This is why we need to take extra care in modelling how evolution – as a kind of algorithm – would apply across the physical signalling pathways of AGI.
I might share a gears-level explanation that Forrest that just gave in response to your comment.
The claims made will feel unfamiliar and the reasoning paths too. I suggest (again) taking the time to consider what is meant. If a conclusion looks intuitively wrong from some AI Safety perspective, it may be valuable to explicitly consider the argumentation and premises behind that.
I think your own message is also too extreme to be rational. So it seems to me that you are fighting fire with a fire. Yes, Remmelt has some extreme expressions, but you definitely have extreme expressions here too, while having even weaker arguments.
Could we find a golden middle road, a common ground, please? With more reflective thinking and with less focus on right and wrong? (Regardless of the dismissive-judgemental title of this forum :P)
I agree that Remmelt can improve the message. And I believe he will do that.
I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.
I also definitely respect Paul. But mentioning his name here is mostly irrelevant for my reasoning or for taking your arguments seriously, simply because I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person’s reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average, not per instance of a thought line (which may mean they are poor thinkers 99% of the time, while having really valuable thoughts 1% of the time). I do not know the distribution for Paul, but definitely I would not be disappointed if he makes mistakes sometimes.
I think this part of Remmelt’s response sums it up nicely: “When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself. You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything.”
In my interpretation, black-and-white thinking is not “crankery”. It is a normal and essential step in the development of cognition about a particular problem. Unfortunately. There is research about that in the field of developmental and cognitive psychology. Hopefully that applies to your own black-and-white thinking as well. Note that, unfortunately this development is topic specific, not universal.
In contrast, “crankery” is too strong word for describing black-and-white thinking because it is a very judgemental word, a complete dismissal, and essentially an expression of unwillingness to understand, an insult, not just a disagreement about a degree of the claims. Is labelling someone’s thoughts as “a crankery” also a form of crankery of its own then? Paradoxical isn’t it?
I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person’s reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average
Right – this comes back to actually examining people’s reasoning.
Relying on the authority status of an insider (who dismissed the argument) or on your ‘crank vibe’ of the outsider (who made the argument) is not a reliable way of checking whether a particular argument is good.
IMO it’s also fine to say “Hey, I don’t have time to assess this argument, so for now I’m going to go with these priors that seemed to broadly kinda work in the past for filtering out poorly substantiated claims. But maybe someone else actually has a chance to go through the argument, I’ll keep an eye open.”
Yes, Remmelt has some extreme expressions…
I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.
…describing black-and-white thinking
I’m putting these quotes together because I want to check whether you’re tracking the epistemic process I’m proposing here.
Reasoning logically from premises is necessarily black-and-white thinking. Either the truth value is true or it is false.
A way to check the reasoning is to first consider the premises (in how they are described using defined terms, do they correspond comprehensively enough with how the world works?). And then check whether the logic follows from the premises through to each next argument step until you reach the conclusion.
Finally, when you reach the conclusion, and you could not find any soundness or validity issues, then that is the conclusion you have reasoned to.
If the conclusion is that it turns out impossible for some physical/informational system to meet several specified desiderata at the same time, this conclusion may sound extreme.
But if you (and many other people in the field who are inclined to disagree with the conclusion) cannot find any problem with the reasoning, the rational thing would be to accept it, and then consider how it applies to the real world.
Apparently, computer scientists hotly contested CAP theorem for a while. They wanted to build distributed data stores that could send messages that consistently represented new data entries, while the data was also made continuously available throughout the network, while the network was also tolerant to partitions. It turns out that you cannot have all three desiderata at once. Grumbling computer scientists just had to face the reality and turn to designing systems that would fail in the least bad way.
Now, assume there is a new theorem for which the research community in all their efforts have not managed to find logical inconsistencies nor empirical soundness issues. Based on this theorem, it turns out that you cannot both have machinery that keeps operating and learning autonomously across domains, and a control system that would contain the effects of that machinery enough to not feedback in ways that destabilise our environment outside the ranges we can survive in.
We need to make a decision then – what would be the least bad way to fail here? On one hand we could decide against designing increasingly autonomous machines, and lose out on the possibility of having machines running around doing things for us. On the other hand, we could have the machinery fail in about the worst way possible, which is to destroy all existing life on this planet.
Respect for doing this.
I strongly wish you would not tie StopAI to the claim that extinction is >99% likely. It means that even your natural supporters in PauseAI will have to say “yes I broadly agree with them but disagree with their claims about extinction being certain.”
I would also echo the feedback here. There’s no reason to write in the same style as cranks.
It’s not just the writing that sounds like a crank. Core arguments that Remmelt endorses are AFAIK considered crankery by the community; with all the classic signs like
making up science-babble,
claiming to have a
full mathematicalproof that safe AI is impossible, despite not providing any formal mathematical reasoningclaiming the “proof” uses mathematical arguments from Godel’s theorem, Galois Theory, Rice’s Theorem
inexplicably formatted as a poem
Paul Christiano read some of this and concluded “the entire scientific community would probably consider this writing to be crankery”, which seems about accurate to me.
Now I don’t like or intend to make personal attacks. But I think that as rationalists, one of our core skills should be to condemn actual crankery and all of its influences, even when the conclusions of cranks and their collaborators superficially agree with the conclusions from actually good arguments.
I have never claimed that there is a mathematical proof. I have claimed that the researcher I work with has done their own reasoning in formal analytical notation (just not maths). Also, that based on his argument – which I probed and have explained here as carefully as I can – AGI cannot be controlled enough to stay safe, and actually converges on extinction.
That researcher is now collaborating with Anders Sandberg to formalise an elegant model of AGI uncontainability in mathematical notation.
I’m kinda pointing out the obvious here, but if the researcher was a crank, why would Anders be working with them?
Nope, I haven’t claimed either of that.
The claim is that the argument is based on showing a limited extent of control (where controlling effects consistently in line with reference values).
The form of the reasoning there shares some underlying correspondences with how the Gödel’s incompleteness theorems (concluding there is a limit to deriving a logical result within a formal axiomatic system) and Galois Theory (concluding that there is a limited scope of application of an algebraic tool) are reasoned through.
^– This is a pedagogical device. It helps researchers already acquainted with Gödel’s theorems or Galois Theory to understand roughly what kind of reasoning we’re talking about.
Do you mean the fact that the researcher splits his sentences’ constituent parts into separate lines so that claims are more carefully parsable?
That is a format for analysis, not a poem format.
While certainly unconventional, it is not a reason to dismiss the rigour of someone’s analysis.
If you look at that exchange, I and the researcher I was working with were writing specific and carefully explained responses.
Paul had zoned in on a statement of the conclusion, misinterpreted what was meant, and then moved on to dismissing the entire project. Doing this was not epistemically humble.
When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself.
You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything.
Unlike Anders – who examined the insufficient controllability part of the argument – you are not a position to judge whether this argument is a good argument or not.
Read the core argument please (eg. summarised in point 3-5. above) and tell me where you think premises are unsound or the logic does not follow from the premises.
It is not enough to say ‘as a rationalist’. You got to walk the talk.
I agree that with superficial observations, I can’t conclusively demonstrate that something is devoid of intellectual value. However, the nonstandard use of words like “proof” is a strong negative signal on someone’s work.
If someone wants to demonstrate a scientific fact, the burden of proof is on them to communicate this in some clear and standard way, because a basic strategy of anyone practicing pseudoscience is to spend lots of time writing something inscrutable that ends in some conclusion, then claim that no one can disprove it and anyone who thinks it’s invalid is misunderstanding something inscrutable.
This problem is exacerbated when someone bases their work on original philosophy. To understand Forrest Landry’s work to his satisfaction someone will have to understand his 517-page book An Immanent Metaphysics, which uses words like “proof”, “theorem”, “conjugate”, “axiom”, and “omniscient” in a nonstandard sense, and also probably requires someone to have a background in metaphysics. I scanned the 134-page version, can’t make any sense of it, and found several concrete statements that sound wrong. I read about 50 pages of various articles on the website and found them to be reasonably coherent but often oddly worded and misusing words like entropy, with the same content quality as a ~10 karma LW post but super overconfident.
Ok. To be clear I don’t expect any Landry and Sandberg paper that comes out of this collaboration to be crankery. Having read the research proposal my guess is that they will prove something roughly like the Good Regulator Theorem or Rice’s theorem which will be slightly relevant to AI but not super relevant because the premises are too strong, like the average item in Yampolskiy’s list of impossibility proofs (I can give examples if you want of why these are not conclusive).
I’m not saying we should discard all reasoning by someone that claims an informal argument is a proof, but rather stop taking their claims of “proofs” at face value without seeing more solid arguments.
Fair enough. I can’t verify this because Wayback Machine is having trouble displaying the relevant content though.
Paul expressed appropriate uncertainty. What is he supposed to do, say “I see several red flags, but I don’t have time to read a 517-page metaphysics book, so I’m still radically uncertain whether this is a crank or the next Kurt Godel”?
When you say failures will “build up toward lethality at some unknown rate”, why would failures build up toward lethality? We have lots of automated systems e.g. semiconductor factories, and failures do not accumulate until everyone at the factory dies, because humans and automated systems can notice errors and correct them.
This is pretty similar to Hendrycks’s natural selection argument, but with the additional piece that the goals of AIs will converge to optimizing the environment for the survival of silicon-based life. He claims that there are various ways to counter evolutionary pressures, like “carefully designing AI agents’ intrinsic motivations, introducing constraints on their actions, and institutions that encourage cooperation”. In the presence of ways to change incentives such that benign AI systems get higher fitness, I don’t think you can get to 99% confidence. Evolutionary arguments are notoriously tricky and respected scientists get them wrong all the time, from Malthus to evolutionary psychology to the group selectionists.
Thanks for recognising this, and for taking some time now to consider the argument.
Yes, this made us move away from using the term “proof”, and instead write “formal reasoning”.
Most proofs nowadays are done using mathematical notation. So it is understandable that when people read “proof”, they automatically think “mathematical proof”.
Having said that, there are plenty of examples of proofs done in formal analytic notation that is not mathematical notation. See eg. formal verification practices in the software and hardware industries, or various branches of analytical philosophy.
Yes, much of the effort has been to translate argument parts in terms more standard for the alignment community.
What we cannot expect is that the formal reasoning is conceptually familiar and low-inferential distance. That would actually be surprising – why then has someone inside the community not already derived the result in the last 20 years?
The reasoning is going to be as complicated as it has to be to reason things through.
Cool that you took a look at his work. Forrest’s use of terms is meant to approximate everyday use of those terms, but the underlying philosophy is notoriously complicated.
Jim Rutt is an ex-chair of Santa Fe Institute who defaults to being skeptical of metaphysics proposals (funny quote he repeats: “when someone mentions metaphysics, I reach for my pistol”). But Jim ended up reading Forrest’s book and it passed his B.S. detector. So he invited Forrest over to his podcast for a three-part interview. Even if you listen to that though, I don’t expect you to immediately come away understanding the conceptual relations.
So here is a problem that you and I are both seeing:
There is this polymath who is clearly smart and recognised for some of his intellectual contributions (by interviewers like Rutt, or co-authors like Anders).
But what this polymath claims to be using as the most fundamental basis for his analysis would take too much time to work through.
So then if this polymath claims to have derived a proof by contradiction –concluding that long-term AGI safety is not possible – then it is intractable for alignment researchers to verify the reasoning using his formal annotation and his conceptual framework. That would be asking for too much – if he’d have insisted on that, I agree that would have been a big red flag signalling crankery.
The obvious move then is for some people to work with the polymath to translate his reasoning to a basis of analysis that alignment researchers agree is a sound basis to reason from. And to translate to terms/concepts people are familiar with. Also, the chain of reasoning should not be so long that busy researchers never end up reading through, but also not so short that you either end up having to use abstractions readers are unfamiliar with, or open up unaddressed gaps in the reasoning. Etc.
The problem becomes finding people who are both willing and available to do that work. One person is probably not enough.
Both are useful theorems, which have specific conclusions that demonstrate that there are at least some limits to control.
(ie. Good Regulator Theorem demonstrates a limit to a system’s capacity to model – or internally functionally represent – the statespace of some more complex super-system. Rice Theorem demonstrates a particular limit to having some general algorithm predict a behavioural property of other algorithms.)
The hashiness model is a tool meant for demonstrating under conservative assumptions – eg. of how far from cryptographically hashy the algorithm run through ‘AGI’ is, and how targetable human-safe ecosystem conditions are – that AGI would be uncontainable. With “uncontainable”, I mean that no available control system connected with/in AGI could constrain the possibility space of AGI’s output sequences enough over time such that the (cascading) environmental effects do not lethally disrupt the bodily functioning of humans.
I can see Paul tried expressing uncertainty by adding “probably” to his claim of how the entire scientific community (not sure what this means) would interpret that one essay.
To me, it seemed his commentary was missing some meta-uncertainty. Something like “I just did some light reading. Based on how it’s stated in this essay, I feel confident it makes no sense for me to engage further with the argument. However, maybe other researchers would find it valuable to spend more time engaging with the argument after going through this essay or some other presentation of the argument.”
~
That covers your comments re: communicating the argument in a form that can be verified by the community.
Let me cook dinner, and then respond to your last two comments to dig into the argument itself. EDIT: am writing now, will respond tomorrow.
Let’s take your example of semiconductor factories.
There are several ways to think about failures here. For one, we can talk about local failures in the production of the semiconductor chips. These especially will get corrected for.
A less common way to talk about factory failures is when workers working in the factories die or are physically incapacitated as a result, eg. because of chemical leaks or some robot hitting them. Usually when this happens, the factories can keep operating and existing. Just replace the expendable workers with new workers.
Of course, if too many workers die, other workers will decide to not work at those factories. Running the factories has to not be too damaging to the health of the internal human workers, in any of the many (indirect) that ways operations could turn out to be damaging.
The same goes for humans contributing to the surrounding infrastructure needed to maintain the existence of these sophisticated factories – all the building construction, all the machine parts, all the raw materials, all the needed energy supplies, and so on. If you try overseeing the relevant upstream and downstream transactions, it turns out that a non-tiny portion of the entire human economy is supporting the existence of these semiconductor factories one way or another. It took a modern industrial cross-continental economy to even make eg. TSMC’s factories viable.
The human economy acts as a forcing function constraining what semiconductor factories can be. There are many, many ways to incapacitate complex multi-celled cooperative organisms like us. So the semiconductor factories that humans are maintaining today ended up being constrained to those that for the most part do not trigger those pathways downstream.
Some of that is because humans went through the effort of noticing errors explicitly and then correcting them, or designing automated systems to do likewise. But the invisible hand of the market considered broadly – as constituting of humans with skin in the game, making often intuitive choices – will actually just force semiconductor factories to be not too damaging to surrounding humans maintaining the needed infrastructure.
With AGI, you lose that forcing function.
Let’s take AGI to be machinery that is autonomous enough to at least automate all the jobs needed to maintain its own existence. Then AGI is no longer dependent on an economy of working humans to maintain its own existence. AGI would be displacing the human economy – as a hypothetical example, AGI is what you’d get if those semiconductor factories producing microchips expanded to producing servers and robots using those microchips that in turn learn somehow to design themselves to operate the factories and all the factory-needed infrastructure autonomously.
Then there is one forcing function left: the machine operation of control mechanisms. Ie. mechanisms that detect, model, simulate, evaluate, and correct downstream effects in order to keep AGI safe.
The question becomes – Can we rely on only control mechanisms to keep AGI safe?
That question raises other questions.
E.g. as relevant to the hashiness model:
“Consider the space of possible machinery output sequences over time. How large is the subset of output sequences that in their propagation as (cascading) environmental effects would end up lethally disrupting the bodily functioning of humans? How is the accumulative probability of human extinction distributed across the entire output possibility space (or simplified: how mixed are the adjoining lethal and non-lethal possibility subspaces)? Can any necessarily less complex control system connected with/in this machinery actually keep tracking whether possible machinery outputs fall into the lethal sub-space or the non-lethal sub-space? ”
There are some ways to expand Hendrycks’ argument to make it more comprehensive:
Consider evolutionary selection at the more fundamental level of physical component interactions. Ie. not just at the macro level of agents competing for resources, since this is a leaky abstraction that can easily fail to capture underlying vectors of change.
Consider not only selection of local variations (ie. mutations) that introduces new functionality, but also the selection of variants connecting up with surrounding units in ways that ends up repurposing existing functionality.
Consider not only the concept of goals that are (able to be) explicitly tracked by the machinery itself, but also that of the implicit conditions needed by components which end up being selected for in expressions across the environment.
This is why we need to take extra care in modelling how evolution – as a kind of algorithm – would apply across the physical signalling pathways of AGI.
I might share a gears-level explanation that Forrest that just gave in response to your comment.
Noticing no response here after we addressed superficial critiques and moved to discussing the actual argument.
For those few interested in questions raised above, Forrest wrote some responses: http://69.27.64.19/ai_alignment_1/d_241016_recap_gen.html
The claims made will feel unfamiliar and the reasoning paths too. I suggest (again) taking the time to consider what is meant. If a conclusion looks intuitively wrong from some AI Safety perspective, it may be valuable to explicitly consider the argumentation and premises behind that.
I think your own message is also too extreme to be rational. So it seems to me that you are fighting fire with a fire. Yes, Remmelt has some extreme expressions, but you definitely have extreme expressions here too, while having even weaker arguments.
Could we find a golden middle road, a common ground, please? With more reflective thinking and with less focus on right and wrong? (Regardless of the dismissive-judgemental title of this forum :P)
I agree that Remmelt can improve the message. And I believe he will do that.
I may not agree that we are going to die with 99% probability. At the same time I find that his current directions are definitely worthwhile of exploring.
I also definitely respect Paul. But mentioning his name here is mostly irrelevant for my reasoning or for taking your arguments seriously, simply because I usually do not take authorities too seriously before I understand their reasoning in a particular question. And understanding a person’s reasoning may occasionally mean that I disagree in particular points as well. In my experience, even the most respectful people are still people, which means they often think in messy ways and they are good just on average, not per instance of a thought line (which may mean they are poor thinkers 99% of the time, while having really valuable thoughts 1% of the time). I do not know the distribution for Paul, but definitely I would not be disappointed if he makes mistakes sometimes.
I think this part of Remmelt’s response sums it up nicely: “When accusing someone of crankery (which is a big deal) it is important not to fall into making vague hand-wavey statements yourself. You are making vague hand-wavey (and also inaccurate) statements above. Insinuating that something is “science-babble” doesn’t do anything. Calling an essay formatted as shorter lines a “poem” doesn’t do anything.”
In my interpretation, black-and-white thinking is not “crankery”. It is a normal and essential step in the development of cognition about a particular problem. Unfortunately. There is research about that in the field of developmental and cognitive psychology. Hopefully that applies to your own black-and-white thinking as well. Note that, unfortunately this development is topic specific, not universal.
In contrast, “crankery” is too strong word for describing black-and-white thinking because it is a very judgemental word, a complete dismissal, and essentially an expression of unwillingness to understand, an insult, not just a disagreement about a degree of the claims. Is labelling someone’s thoughts as “a crankery” also a form of crankery of its own then? Paradoxical isn’t it?
BTW if anyone does want to get into the argument, Will Petillo’s Lenses of Control post is a good entry point.
It’s concise and correct – a difficult combination to achieve here.
Right – this comes back to actually examining people’s reasoning.
Relying on the authority status of an insider (who dismissed the argument) or on your ‘crank vibe’ of the outsider (who made the argument) is not a reliable way of checking whether a particular argument is good.
IMO it’s also fine to say “Hey, I don’t have time to assess this argument, so for now I’m going to go with these priors that seemed to broadly kinda work in the past for filtering out poorly substantiated claims. But maybe someone else actually has a chance to go through the argument, I’ll keep an eye open.”
I’m putting these quotes together because I want to check whether you’re tracking the epistemic process I’m proposing here.
Reasoning logically from premises is necessarily black-and-white thinking. Either the truth value is true or it is false.
A way to check the reasoning is to first consider the premises (in how they are described using defined terms, do they correspond comprehensively enough with how the world works?). And then check whether the logic follows from the premises through to each next argument step until you reach the conclusion.
Finally, when you reach the conclusion, and you could not find any soundness or validity issues, then that is the conclusion you have reasoned to.
If the conclusion is that it turns out impossible for some physical/informational system to meet several specified desiderata at the same time, this conclusion may sound extreme.
But if you (and many other people in the field who are inclined to disagree with the conclusion) cannot find any problem with the reasoning, the rational thing would be to accept it, and then consider how it applies to the real world.
Apparently, computer scientists hotly contested CAP theorem for a while. They wanted to build distributed data stores that could send messages that consistently represented new data entries, while the data was also made continuously available throughout the network, while the network was also tolerant to partitions. It turns out that you cannot have all three desiderata at once. Grumbling computer scientists just had to face the reality and turn to designing systems that would fail in the least bad way.
Now, assume there is a new theorem for which the research community in all their efforts have not managed to find logical inconsistencies nor empirical soundness issues. Based on this theorem, it turns out that you cannot both have machinery that keeps operating and learning autonomously across domains, and a control system that would contain the effects of that machinery enough to not feedback in ways that destabilise our environment outside the ranges we can survive in.
We need to make a decision then – what would be the least bad way to fail here? On one hand we could decide against designing increasingly autonomous machines, and lose out on the possibility of having machines running around doing things for us. On the other hand, we could have the machinery fail in about the worst way possible, which is to destroy all existing life on this planet.