It is impossible to verify a model’s safety—even given arbitrarily good transparency tools—without access to that model’s training process. For example, you could get a deceptive model that gradient hacks itself in such a way that cryptographically obfuscates its deception.
It is impossible in general to use interpretability tools to select models to have a particular behavioral property. I think this is clear if you just stare at Rice’s theorem enough: checking non-trivial behavioral properties, even with mechanistic access, is in general undecidable. Note, however, that this doesn’t rule out checking a mechanistic property that implies a behavioral property.
Consider my vote to be placed that you should turn this into a post, keep going for literally as long as you can, expand things to paragraphs, and branch out beyond things you can easily find links for.
(I do think there’s a noticeable extent to which I was trying to list difficulties more central than those, but I also think many people could benefit from reading a list of 100 noncentral difficulties.)
I do think there’s a noticeable extent to which I was trying to list difficulties more central than those
Probably people disagree about which things are more central, or as evhub put it:
Every time anybody writes up any overview of AI safety, they have to make tradeoffs [...] depending on what the author personally believes is most important/relevant to say
Now FWIW I thought evhub was overly dismissive of (4) in which you made an important meta-point:
EY: 4. We can’t just “decide not to build AGI” because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world. The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit—it does not lift it [...]
evhub: This is just answering a particular bad plan.
But I would add a criticism of my own, that this “List of Lethalities” somehow just takes it for granted that AGI will try to kill usall without ever specifically arguing that case. Instead you just argue vaguely in that direction, in passing, while making broader/different points:
an AGI strongly optimizing on that signal will kill you, because the sensory reward signal was not a ground truth about alignment (???)
All of these kill you if optimized-over by a sufficiently powerful intelligence, because they imply strategies like ‘kill everyone in the world using nanotech to strike before they know they’re in a battle, and have control of your reward button forever after’. (I guess that makes sense)
If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. (???)
Perhaps you didn’t bother because your audience is meant to be people who already believe this? I would at least expect to see it in the intro: “-5. unaligned superintelligences tend to try to kill everyone, here’s why <link>.… −4. all the most obvious proposed solutions to (-5) don’t work, here’s why <link>”.
(Note that these have a theme: you can’t wrangle general computation / optimization. That’s why I’m short universal approaches to AI alignment (approaches that aim at making general optimization safe by enforcing universal rules), and long existential approaches (approaches that try to find specific mechanisms that can be analytically seen to do the right thing).)
Question: is it actually possible to cryptographically obfuscate goals/plans/precomittments (or other elements of “deception” as mentioned in your second bullet point) against something inspecting your algorithm and working memory?
In particular, it seems to me that anything properly able to inspect you can, at a minimum, discover the use of encryption, and find where the encryption key is read from, and then read the encrypted content. Or fail the model once encryption is detected at all.
I’m not saying it will be easy to build such transparency tools—in fact, that must be a herculean challenge, since selecting against interpreted thoughts selects for uninterpretable thoughts, and a superintelligence might use principles not yet known to humanity—I just mean that, from the perspective of the AI itself, I think perfect obfuscation against ideal logically-possible interpretability tools might be impossible.
Edit:
Nevermind, I answered my own question. There is no such logically-possible interpretability tool. For any such tool we logically-counterfactually assume could exist, then the AI could implement that tool, and utilize encryption if and only if that tool asserts that the AI doesn’t use encryption. …therefore no algorithm exists which can verify the use of encryption in the general case.
Sure—that’s easy enough. Just off the top of my head, here’s five safety concerns that I think are important that I don’t think you included:
The fact that there exist functions that are easier to verify than satisfy ensures that adversarial training can never guarantee the absence of deception.
It is impossible to verify a model’s safety—even given arbitrarily good transparency tools—without access to that model’s training process. For example, you could get a deceptive model that gradient hacks itself in such a way that cryptographically obfuscates its deception.
It is impossible in general to use interpretability tools to select models to have a particular behavioral property. I think this is clear if you just stare at Rice’s theorem enough: checking non-trivial behavioral properties, even with mechanistic access, is in general undecidable. Note, however, that this doesn’t rule out checking a mechanistic property that implies a behavioral property.
Any prior you use to incentivize models to behave in a particular way doesn’t necessarily translate to situations where that model itself runs another search over algorithms. For example, the fastest way to search for algorithms isn’t to search for the fastest algorithm.
Even if a model is trained in a myopic way—or even if a model is in fact myopic in the sense that it only optimizes some single-step objective—such a model can still end up deceiving you, e.g. if it cooperates with other versions of itself.
Consider my vote to be placed that you should turn this into a post, keep going for literally as long as you can, expand things to paragraphs, and branch out beyond things you can easily find links for.
(I do think there’s a noticeable extent to which I was trying to list difficulties more central than those, but I also think many people could benefit from reading a list of 100 noncentral difficulties.)
Probably people disagree about which things are more central, or as evhub put it:
Now FWIW I thought evhub was overly dismissive of (4) in which you made an important meta-point:
But I would add a criticism of my own, that this “List of Lethalities” somehow just takes it for granted that AGI will try to kill us all without ever specifically arguing that case. Instead you just argue vaguely in that direction, in passing, while making broader/different points:
Perhaps you didn’t bother because your audience is meant to be people who already believe this? I would at least expect to see it in the intro: “-5. unaligned superintelligences tend to try to kill everyone, here’s why <link>.… −4. all the most obvious proposed solutions to (-5) don’t work, here’s why <link>”.
(Note that these have a theme: you can’t wrangle general computation / optimization. That’s why I’m short universal approaches to AI alignment (approaches that aim at making general optimization safe by enforcing universal rules), and long existential approaches (approaches that try to find specific mechanisms that can be analytically seen to do the right thing).)
Question: is it actually possible to cryptographically obfuscate goals/plans/precomittments (or other elements of “deception” as mentioned in your second bullet point) against something inspecting your algorithm and working memory?
In particular, it seems to me that anything properly able to inspect you can, at a minimum, discover the use of encryption, and find where the encryption key is read from, and then read the encrypted content. Or fail the model once encryption is detected at all.
I’m not saying it will be easy to build such transparency tools—in fact, that must be a herculean challenge, since selecting against interpreted thoughts selects for uninterpretable thoughts, and a superintelligence might use principles not yet known to humanity—I just mean that, from the perspective of the AI itself, I think perfect obfuscation against ideal logically-possible interpretability tools might be impossible.
Edit:
Nevermind, I answered my own question. There is no such logically-possible interpretability tool. For any such tool we logically-counterfactually assume could exist, then the AI could implement that tool, and utilize encryption if and only if that tool asserts that the AI doesn’t use encryption. …therefore no algorithm exists which can verify the use of encryption in the general case.