That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It’s guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction. The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly—such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn’t write, so didn’t try. I’m not particularly hopeful of this turning out to be true in real life, but I suppose it’s one possible place for a “positive model violation” (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that. I knew I did not actually have the physical stamina to be a star researcher, I tried really really hard to replace myself before my health deteriorated further, and yet here I am writing this. That’s not what surviving worlds look like.
To say that somebody else should have written up this list before is such a ridiculously unfair criticism. This is an assorted list of some thoughts which are relevant to AI alignment—just by the combinatorics of how many such thoughts there are and how many you chose to include in this list, of course nobody will have written up something like it before. Every time anybody writes up any overview of AI safety, they have to make tradeoffs between what they want to include and what they don’t want to include that will inevitably leave some things off and include some things depending on what the author personally believes is most important/relevant to say—ensuring that all such introductions will always inevitably cover somewhat different material. Furthermore, many of these are responses to particular bad alignment plans, of which there are far too many to expect anyone to have previously written up specific responses to.
Nevertheless, I am confident that every core technical idea in this post has been written about before by either me, Paul Christiano, Richard Ngo, or Scott Garrabrant. Certainly, they have been written up in different ways than how Eliezer describes them, but all of the core ideas are there. Let’s go through the list:
(3) This is a common concept, see e.g. Homogeneity vs. heterogeneity in AI takeoff scenarios (“Homogeneity makes the alignment of the first advanced AI system absolutely critical (in a similar way to fast/discontinuous takeoff without the takeoff actually needing to be fast/discontinuous), since whether the first AI is aligned or not is highly likely tano determine/be highly correlated with whether all future AIs built after that point are aligned as well.”).
To spot check the above list, I generated the following three random numbers from 1 − 36 after I wrote the list: 32, 34, 15. Since 34 corresponds to a particular bad plan, I then generated another to replace it: 14. Let’s spot check those three—14, 15, 32—more carefully.
(14) Eliezer claims that “Some problems, like ‘the AGI has an option that (looks to it like) it could successfully kill and replace the programmers to fully optimize over its environment’, seem like their natural order of appearance could be that they first appear only in fully dangerous domains.” In Risks from Learned Optimization in Advanced Machine Learning Systems, we say very directly:
In current AI systems, a small amount of distributional shift between training and deployment need not be problematic: so long as the difference is small enough in the task-relevant areas, the training distribution does not need to perfectly reflect the deployment distribution. However, this may not be the case for a deceptively aligned mesa-optimizer. If a deceptively aligned mesa-optimizer is sufficiently advanced, it may detect very subtle distributional shifts for the purpose of inferring when the threat of modification has ceased.
[...] Some examples of differences that a mesa-optimizer might be able to detect include:
[...]
The presence or absence of good opportunities for the mesa-optimizer to defect against its programmers.
(15) Eliezer says “Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.” Richard says:
If AI development proceeds very quickly, then our ability to react appropriately will be much lower. In particular, we should be interested in how long it will take for AGIs to proceed from human-level intelligence to superintelligence, which we’ll call the takeoff period. The history of systems like AlphaStar, AlphaGo and OpenAI Five provides some evidence that this takeoff period will be short: after a long development period, each of them was able to improve rapidly from top amateur level to superhuman performance. A similar phenomenon occurred during human evolution, where it only took us a few million years to become much more intelligent than chimpanzees. In our case one of the key factors was scaling up our brain hardware—which, as I have already discussed, will be much easier for AGIs than it was for humans.
While the question of what returns we will get from scaling up hardware and training time is an important one, in the long term the most important question is what returns we should expect from scaling up the intelligence of scientific researchers—because eventually AGIs themselves will be doing the vast majority of research in AI and related fields (in a process I’ve been calling recursive improvement). In particular, within the range of intelligence we’re interested in, will a given increase δ in the intelligence of an AGI increase the intelligence of the best successor that AGI can develop by more than or less than δ? If more, then recursive improvement will eventually speed up the rate of progress in AI research dramatically.
Note: for this one, I originally had the link above point to AGI safety from first principles: Superintelligence specifically, but changed it to point to the whole sequence after I realized during the spot-checking that Richard mostly talks about this in the Control section.
(32) Eliezer says “This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.” In An overview of 11 proposals for building safe advanced AI, I say:
if RL is necessary to do anything powerful and simple language modeling is insufficient, then whether or not language modeling is easier is a moot point. Whether RL is really necessary seems likely to depend on the extent to which it is necessary to explicitly train agents—which is very much an open question. Furthermore, even if agency is required, it could potentially be obtained just by imitating an actor such as a human that already has it rather than training it directly via RL.
and
the training competitiveness of imitative amplification is likely to depend on whether pure imitation can be turned into a rich enough reward signal to facilitate highly sample-efficient learning. In my opinion, it seems likely that human language imitation (where language includes embedded images, videos, etc.) combined with techniques to improve sample efficiency will be competitive at some tasks—namely highly-cognitive tasks such as general-purpose decision-making—but not at others, such as fine motor control. If that’s true, then as long as the primary economic use cases for AGI fall into the highly-cognitive category, imitative amplification should be training competitive. For a more detailed analysis of this question, see “Outer alignment and imitative amplification.”
I agree this list doesn’t seem to contain much unpublished material, and I think the main value of having it in one numbered list is that “all of it is in one, short place”, and it’s not an “intro to computers can think” and instead is “these are a bunch of the reasons computers thinking is difficult to align”.
The thing that I understand to be Eliezer’s “main complaint” is something like: “why does it seem like No One Else is discovering new elements to add to this list?”. Like, I think Risks From Learned Optimization was great, and am glad you and others wrote it! But also my memory is that it was “prompted” instead of “written from scratch”, and I imagine Eliezer reading it more had the sense of “ah, someone made ‘demons’ palatable enough to publish” instead of “ah, I am learning something new about the structure of intelligence and alignment.”
[I do think the claim that Eliezer ‘figured it out from the empty string’ doesn’t quite jive with the Yudkowsky’s Coming of Age sequence.]
Nearly empty string of uncommon social inputs. All sorts of empirical inputs, including empirical inputs in the social form of other people observing things.
It’s also fair to say that, though they didn’t argue me out of anything, Moravec and Drexler and Ed Regis and Vernor Vinge and Max More could all be counted as social inputs telling me that this was an important thing to look at.
Eliezer’s post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you’d see elsewhere, and it’s frank on this point.
I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and “There is no fire alarm”.
Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment.
Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.
Re (15), “Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously”, which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about “fast capability gain seems possible and likely”, and isn’t about connecting that to increased difficulty in succeeding at the alignment problem?
Re (32), I don’t think your quote isn’t talking about the thing Eliezer is talking about, which is that in order to be human level at modelling human-generated text, your AI must be doing something on par with human thought that figures out what humans would say. Your quote just isn’t discussing this, namely that strong imitation requires cognition that is dangerous.
So I guess I don’t take much issue with (14) or (15), but I think you’re quite off the mark about (32). In any case, I still have a strong sense that Eliezer is successfully being more on the mark here than the rest of us manage. Kudos of course to you and others that are working on writing things up and figuring things out. Though I remain sympathetic to Eliezer’s complaint.
Well, my disorganized list sure wasn’t complete, so why not go ahead and list some of the foreseeable difficulties I left out? Bonus points if any of them weren’t invented by me, though I realize that most people may not realize how much of this entire field is myself wearing various trenchcoats.
It is impossible to verify a model’s safety—even given arbitrarily good transparency tools—without access to that model’s training process. For example, you could get a deceptive model that gradient hacks itself in such a way that cryptographically obfuscates its deception.
It is impossible in general to use interpretability tools to select models to have a particular behavioral property. I think this is clear if you just stare at Rice’s theorem enough: checking non-trivial behavioral properties, even with mechanistic access, is in general undecidable. Note, however, that this doesn’t rule out checking a mechanistic property that implies a behavioral property.
Consider my vote to be placed that you should turn this into a post, keep going for literally as long as you can, expand things to paragraphs, and branch out beyond things you can easily find links for.
(I do think there’s a noticeable extent to which I was trying to list difficulties more central than those, but I also think many people could benefit from reading a list of 100 noncentral difficulties.)
I do think there’s a noticeable extent to which I was trying to list difficulties more central than those
Probably people disagree about which things are more central, or as evhub put it:
Every time anybody writes up any overview of AI safety, they have to make tradeoffs [...] depending on what the author personally believes is most important/relevant to say
Now FWIW I thought evhub was overly dismissive of (4) in which you made an important meta-point:
EY: 4. We can’t just “decide not to build AGI” because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world. The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit—it does not lift it [...]
evhub: This is just answering a particular bad plan.
But I would add a criticism of my own, that this “List of Lethalities” somehow just takes it for granted that AGI will try to kill usall without ever specifically arguing that case. Instead you just argue vaguely in that direction, in passing, while making broader/different points:
an AGI strongly optimizing on that signal will kill you, because the sensory reward signal was not a ground truth about alignment (???)
All of these kill you if optimized-over by a sufficiently powerful intelligence, because they imply strategies like ‘kill everyone in the world using nanotech to strike before they know they’re in a battle, and have control of your reward button forever after’. (I guess that makes sense)
If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. (???)
Perhaps you didn’t bother because your audience is meant to be people who already believe this? I would at least expect to see it in the intro: “-5. unaligned superintelligences tend to try to kill everyone, here’s why <link>.… −4. all the most obvious proposed solutions to (-5) don’t work, here’s why <link>”.
(Note that these have a theme: you can’t wrangle general computation / optimization. That’s why I’m short universal approaches to AI alignment (approaches that aim at making general optimization safe by enforcing universal rules), and long existential approaches (approaches that try to find specific mechanisms that can be analytically seen to do the right thing).)
Question: is it actually possible to cryptographically obfuscate goals/plans/precomittments (or other elements of “deception” as mentioned in your second bullet point) against something inspecting your algorithm and working memory?
In particular, it seems to me that anything properly able to inspect you can, at a minimum, discover the use of encryption, and find where the encryption key is read from, and then read the encrypted content. Or fail the model once encryption is detected at all.
I’m not saying it will be easy to build such transparency tools—in fact, that must be a herculean challenge, since selecting against interpreted thoughts selects for uninterpretable thoughts, and a superintelligence might use principles not yet known to humanity—I just mean that, from the perspective of the AI itself, I think perfect obfuscation against ideal logically-possible interpretability tools might be impossible.
Edit:
Nevermind, I answered my own question. There is no such logically-possible interpretability tool. For any such tool we logically-counterfactually assume could exist, then the AI could implement that tool, and utilize encryption if and only if that tool asserts that the AI doesn’t use encryption. …therefore no algorithm exists which can verify the use of encryption in the general case.
If you find that (for reasons still left explained)
… selection of code for intentionality is coupled – over the long run, in mostly non-reverse-engineerable ways – to various/most of the physical/chemical properties
… of the molecular substrate through which discrete code is necessarily computed/expressed (via input and output channels of information/energy packet transmission),
then given that
… the properties of the solid-state substrate (e.g. silicon-based hardware) computing AGI’s code
… differ from the properties of the substrate of humans (carbon-based wetware),
a conclusion that follows is that
… the intentionality being selected for in AGI over the long run
… will diverge from the intentionality that was selected for in humans.
What do you mean by ‘intentionality’? Per SEP, “In philosophy, intentionality is the power of minds and mental states to be about, to represent, or to stand for, things, properties and states of affairs.” So I read your comment as saying, a la Searle, ‘maybe AI can never think like a human because there’s something mysterious and crucial about carbon atoms in particular, or about capital-b Biology, for doing reasoning.’
This seems transparently silly to me—I know of no reasonable argument for thinking carbon differs from silicon on this dimension—and also not relevant to AGI risk. You can protest “but AlphaGo doesn’t really understand Go!” until the cows come home, and it will still beat you at Go. You can protest “but you don’t really understand killer nanobots!” until the cows come home, and superintelligent Unfriendly AI will still build the nanobots and kill you with them.
By the same reasoning, Searle-style arguments aren’t grounds for pessimism either. If Friendly AI lacks true intentionality or true consciousness or whatever, it can still do all the same mechanistic operations, and therefore still produce the same desirable good outcomes as if it had human-style intentionality or whatver.
So I read your comment as saying, a la Searle, ‘maybe AI can never think like a human because there’s something mysterious and crucial about carbon atoms in particular, or about capital-b Biology, for doing reasoning.’
That’s not the argument. Give me a few days to write a response. There’s a minefield of possible misinterpretations here.
whatever, it can still do all the same mechanistic operations, and therefore still produce the same desirable good outcomes as if it had human-style intentionality or whatver.
However, the argumentation does undermine the idea that designing for mechanistic (alignment) operations is going to work. I’ll try and explain why.
Particularly note the shift from trivial self-replication (e.g. most computer viruses) to non-trivial self-replication (e.g. as through substrate-environment pathways to reproduction).
None of this is sufficient for you to guess what the argumentation is (you might be able to capture a bit of it, along with a lot of incorrect and often implicit assumptions we must dig into).
If you could call on some patience and openness to new ideas, I would really appreciate it! I am already bracing for a next misinterpretation (which is fine, if we can talk about that). I apologise for that I cannot find a viable way yet to throw out all the argumentation in one go, and also for that this will get a bit disorientating when we go through arguments step-by-step.
To say that somebody else should have written up this list before is such a ridiculously unfair criticism. This is an assorted list of some thoughts which are relevant to AI alignment—just by the combinatorics of how many such thoughts there are and how many you chose to include in this list, of course nobody will have written up something like it before. Every time anybody writes up any overview of AI safety, they have to make tradeoffs between what they want to include and what they don’t want to include that will inevitably leave some things off and include some things depending on what the author personally believes is most important/relevant to say—ensuring that all such introductions will always inevitably cover somewhat different material. Furthermore, many of these are responses to particular bad alignment plans, of which there are far too many to expect anyone to have previously written up specific responses to.
Nevertheless, I am confident that every core technical idea in this post has been written about before by either me, Paul Christiano, Richard Ngo, or Scott Garrabrant. Certainly, they have been written up in different ways than how Eliezer describes them, but all of the core ideas are there. Let’s go through the list:
(1, 2, 4, 15) AGI safety from first principles
(3) This is a common concept, see e.g. Homogeneity vs. heterogeneity in AI takeoff scenarios (“Homogeneity makes the alignment of the first advanced AI system absolutely critical (in a similar way to fast/discontinuous takeoff without the takeoff actually needing to be fast/discontinuous), since whether the first AI is aligned or not is highly likely tano determine/be highly correlated with whether all future AIs built after that point are aligned as well.”).
(4) This is just answering a particular bad plan.
(5, 6, 7) This is just the concept of competitiveness, see e.g. An overview of 11 proposals for building safe advanced AI.
(8) Risks from Learned Optimization in Advanced Machine Learning Systems: Conditions for Mesa-Optimization
(9) Worst-case guarantees (Revisted)
(10, 13, 14) Risks from Learned Optimization in Advanced Machine Learning Systems: Deceptive Alignment: Distributional shift and deceptive alignment
(11) Another specific bad plan.
(12, 35) Robustness to Scale
(16, 17, 19) Risks from Learned Optimization in Advanced Machine Learning Systems
(18, 20) ARC’s first technical report: Eliciting Latent Knowledge
(21, 22) 2-D Robustness for the concept, Risks from Learned Optimization in Advanced Machine Learning Systems for why it occurs.
(23, 24.2) Towards a mechanistic understanding of corrigibility
(24.1) Risks from Learned Optimization in Advanced Machine Learning Systems: Deceptive Alignment: Internalization or deception after extensive training
(25, 26, 27, 29, 31) Acceptability Verification: A Research Agenda
(28) Relaxed adversarial training for inner alignment
(30, 33) Chris Olah’s views on AGI safety: What if interpretability breaks down as AI gets more powerful?
(32) An overview of 11 proposals for building safe advanced AI
(34) Response to a particular bad plan.
(36) Outer alignment and imitative amplification: The case for imitative amplification
To spot check the above list, I generated the following three random numbers from 1 − 36 after I wrote the list: 32, 34, 15. Since 34 corresponds to a particular bad plan, I then generated another to replace it: 14. Let’s spot check those three—14, 15, 32—more carefully.
(14) Eliezer claims that “Some problems, like ‘the AGI has an option that (looks to it like) it could successfully kill and replace the programmers to fully optimize over its environment’, seem like their natural order of appearance could be that they first appear only in fully dangerous domains.” In Risks from Learned Optimization in Advanced Machine Learning Systems, we say very directly:
(15) Eliezer says “Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.” Richard says:
Note: for this one, I originally had the link above point to AGI safety from first principles: Superintelligence specifically, but changed it to point to the whole sequence after I realized during the spot-checking that Richard mostly talks about this in the Control section.
(32) Eliezer says “This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.” In An overview of 11 proposals for building safe advanced AI, I say:
and
I agree this list doesn’t seem to contain much unpublished material, and I think the main value of having it in one numbered list is that “all of it is in one, short place”, and it’s not an “intro to computers can think” and instead is “these are a bunch of the reasons computers thinking is difficult to align”.
The thing that I understand to be Eliezer’s “main complaint” is something like: “why does it seem like No One Else is discovering new elements to add to this list?”. Like, I think Risks From Learned Optimization was great, and am glad you and others wrote it! But also my memory is that it was “prompted” instead of “written from scratch”, and I imagine Eliezer reading it more had the sense of “ah, someone made ‘demons’ palatable enough to publish” instead of “ah, I am learning something new about the structure of intelligence and alignment.”
[I do think the claim that Eliezer ‘figured it out from the empty string’ doesn’t quite jive with the Yudkowsky’s Coming of Age sequence.]
Nearly empty string of uncommon social inputs. All sorts of empirical inputs, including empirical inputs in the social form of other people observing things.
It’s also fair to say that, though they didn’t argue me out of anything, Moravec and Drexler and Ed Regis and Vernor Vinge and Max More could all be counted as social inputs telling me that this was an important thing to look at.
Thank you, Evan, for living the Virture of Scholarship. Your work is appreciated.
Eliezer’s post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you’d see elsewhere, and it’s frank on this point.
I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and “There is no fire alarm”.
Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment.
Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels.
Re (15), “Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously”, which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about “fast capability gain seems possible and likely”, and isn’t about connecting that to increased difficulty in succeeding at the alignment problem?
Re (32), I don’t think your quote isn’t talking about the thing Eliezer is talking about, which is that in order to be human level at modelling human-generated text, your AI must be doing something on par with human thought that figures out what humans would say. Your quote just isn’t discussing this, namely that strong imitation requires cognition that is dangerous.
So I guess I don’t take much issue with (14) or (15), but I think you’re quite off the mark about (32). In any case, I still have a strong sense that Eliezer is successfully being more on the mark here than the rest of us manage. Kudos of course to you and others that are working on writing things up and figuring things out. Though I remain sympathetic to Eliezer’s complaint.
Well, my disorganized list sure wasn’t complete, so why not go ahead and list some of the foreseeable difficulties I left out? Bonus points if any of them weren’t invented by me, though I realize that most people may not realize how much of this entire field is myself wearing various trenchcoats.
Sure—that’s easy enough. Just off the top of my head, here’s five safety concerns that I think are important that I don’t think you included:
The fact that there exist functions that are easier to verify than satisfy ensures that adversarial training can never guarantee the absence of deception.
It is impossible to verify a model’s safety—even given arbitrarily good transparency tools—without access to that model’s training process. For example, you could get a deceptive model that gradient hacks itself in such a way that cryptographically obfuscates its deception.
It is impossible in general to use interpretability tools to select models to have a particular behavioral property. I think this is clear if you just stare at Rice’s theorem enough: checking non-trivial behavioral properties, even with mechanistic access, is in general undecidable. Note, however, that this doesn’t rule out checking a mechanistic property that implies a behavioral property.
Any prior you use to incentivize models to behave in a particular way doesn’t necessarily translate to situations where that model itself runs another search over algorithms. For example, the fastest way to search for algorithms isn’t to search for the fastest algorithm.
Even if a model is trained in a myopic way—or even if a model is in fact myopic in the sense that it only optimizes some single-step objective—such a model can still end up deceiving you, e.g. if it cooperates with other versions of itself.
Consider my vote to be placed that you should turn this into a post, keep going for literally as long as you can, expand things to paragraphs, and branch out beyond things you can easily find links for.
(I do think there’s a noticeable extent to which I was trying to list difficulties more central than those, but I also think many people could benefit from reading a list of 100 noncentral difficulties.)
Probably people disagree about which things are more central, or as evhub put it:
Now FWIW I thought evhub was overly dismissive of (4) in which you made an important meta-point:
But I would add a criticism of my own, that this “List of Lethalities” somehow just takes it for granted that AGI will try to kill us all without ever specifically arguing that case. Instead you just argue vaguely in that direction, in passing, while making broader/different points:
Perhaps you didn’t bother because your audience is meant to be people who already believe this? I would at least expect to see it in the intro: “-5. unaligned superintelligences tend to try to kill everyone, here’s why <link>.… −4. all the most obvious proposed solutions to (-5) don’t work, here’s why <link>”.
(Note that these have a theme: you can’t wrangle general computation / optimization. That’s why I’m short universal approaches to AI alignment (approaches that aim at making general optimization safe by enforcing universal rules), and long existential approaches (approaches that try to find specific mechanisms that can be analytically seen to do the right thing).)
Question: is it actually possible to cryptographically obfuscate goals/plans/precomittments (or other elements of “deception” as mentioned in your second bullet point) against something inspecting your algorithm and working memory?
In particular, it seems to me that anything properly able to inspect you can, at a minimum, discover the use of encryption, and find where the encryption key is read from, and then read the encrypted content. Or fail the model once encryption is detected at all.
I’m not saying it will be easy to build such transparency tools—in fact, that must be a herculean challenge, since selecting against interpreted thoughts selects for uninterpretable thoughts, and a superintelligence might use principles not yet known to humanity—I just mean that, from the perspective of the AI itself, I think perfect obfuscation against ideal logically-possible interpretability tools might be impossible.
Edit:
Nevermind, I answered my own question. There is no such logically-possible interpretability tool. For any such tool we logically-counterfactually assume could exist, then the AI could implement that tool, and utilize encryption if and only if that tool asserts that the AI doesn’t use encryption. …therefore no algorithm exists which can verify the use of encryption in the general case.
Eliezer:
If you find that (for reasons still left explained)
… selection of code for intentionality is coupled – over the long run, in mostly non-reverse-engineerable ways – to various/most of the physical/chemical properties
… of the molecular substrate through which discrete code is necessarily computed/expressed (via input and output channels of information/energy packet transmission),
then given that
… the properties of the solid-state substrate (e.g. silicon-based hardware) computing AGI’s code
… differ from the properties of the substrate of humans (carbon-based wetware),
a conclusion that follows is that
… the intentionality being selected for in AGI over the long run
… will diverge from the intentionality that was selected for in humans.
What do you mean by ‘intentionality’? Per SEP, “In philosophy, intentionality is the power of minds and mental states to be about, to represent, or to stand for, things, properties and states of affairs.” So I read your comment as saying, a la Searle, ‘maybe AI can never think like a human because there’s something mysterious and crucial about carbon atoms in particular, or about capital-b Biology, for doing reasoning.’
This seems transparently silly to me—I know of no reasonable argument for thinking carbon differs from silicon on this dimension—and also not relevant to AGI risk. You can protest “but AlphaGo doesn’t really understand Go!” until the cows come home, and it will still beat you at Go. You can protest “but you don’t really understand killer nanobots!” until the cows come home, and superintelligent Unfriendly AI will still build the nanobots and kill you with them.
By the same reasoning, Searle-style arguments aren’t grounds for pessimism either. If Friendly AI lacks true intentionality or true consciousness or whatever, it can still do all the same mechanistic operations, and therefore still produce the same desirable good outcomes as if it had human-style intentionality or whatver.
That’s not the argument. Give me a few days to write a response. There’s a minefield of possible misinterpretations here.
However, the argumentation does undermine the idea that designing for mechanistic (alignment) operations is going to work. I’ll try and explain why.
BTW, with ‘intentionality’, I meant something closer to everyday notions of ‘intentions one has’. Will more precisely define that meaning later.
I should have checked for diverging definitions from formal fields. Thanks for catching that.
If you happen to have time, this paper serves as useful background reading: https://royalsocietypublishing.org/doi/full/10.1098/rsif.2012.0869
Particularly note the shift from trivial self-replication (e.g. most computer viruses) to non-trivial self-replication (e.g. as through substrate-environment pathways to reproduction).
None of this is sufficient for you to guess what the argumentation is (you might be able to capture a bit of it, along with a lot of incorrect and often implicit assumptions we must dig into).
If you could call on some patience and openness to new ideas, I would really appreciate it! I am already bracing for a next misinterpretation (which is fine, if we can talk about that). I apologise for that I cannot find a viable way yet to throw out all the argumentation in one go, and also for that this will get a bit disorientating when we go through arguments step-by-step.
Returning to this:
Key idea: Different basis of existence→ different drives→ different intentions→ different outcomes.
@Rob, I wrote up a longer explanation here, which I prefer to discuss with you in private first. Will email you a copy
tomorrowin the next weeks.