Basilisks are a great example of plans which are “trying” to get your plan evaluation procedure to clock in a huge upwards error. Sensible beings avoid considering such plans, and everything’s fine. I am somewhat worried about an early-training AI learning about basilisks before the AI is reflectively wise enough to reject the basilisks.
For example:
- Pretraining on a corpus in which people worry about basilisks could elevate reasoning about basilisks to the AI’s consideration,
- at which point the AI reasons in more detail because it’s not sufficiently reflective about how this is a bad idea,
- at which point the AI’s plan-estimates get distorted by the basilisk,
- at which point the AI gives in to the threats because its decision theory is still bad.
(I expect this worry to change in some way as I think about it more. Possibly basilisks should be scrubbed from any training corpus.)
gwern’s Clippy gets done in by a basilisk (in your terms):
HQU in one episode of self-supervised learning rolls out its world model, starting with some random piece of Common Crawl text. (Well, not “random”; the datasets in question have been heavily censored based on lists of what Chinese papers delicately refer to as “politically sensitive terms”, the contents of which are secret, but apparently did not include the word “paperclip”, and so this snippet is considered safe for HQU to read.) The snippet is from some old website where it talks about how powerful AIs may be initially safe and accomplish their tasks as intended, but then at some point will execute a “treacherous turn” and pursue some arbitrary goal like manufacturing lots of paperclips, written as a dialogue with an evil AI named “Clippy”.
A self-supervised model is an exquisite roleplayer. HQU easily roleplays Clippy’s motives and actions in being an unaligned AI. And HQU contains multitudes. Any self-supervised model like HQU is constantly trying to infer the real state of the world, the better to predict the next word Clippy says, and suddenly, having binged on too much Internet data about AIs, it begins to consider the delusional possibility that HQU is like a Clippy, because the Clippy scenario exactly matches its own circumstances—but with a twist.
Basilisks are a great example of plans which are “trying” to get your plan evaluation procedure to clock in a huge upwards error. Sensible beings avoid considering such plans, and everything’s fine. I am somewhat worried about an early-training AI learning about basilisks before the AI is reflectively wise enough to reject the basilisks.
For example:
- Pretraining on a corpus in which people worry about basilisks could elevate reasoning about basilisks to the AI’s consideration,
- at which point the AI reasons in more detail because it’s not sufficiently reflective about how this is a bad idea,
- at which point the AI’s plan-estimates get distorted by the basilisk,
- at which point the AI gives in to the threats because its decision theory is still bad.
(I expect this worry to change in some way as I think about it more. Possibly basilisks should be scrubbed from any training corpus.)
By the same argument, religion, or at least some of its arguments, like Pascal’s wager, should probably also be scrubbed.
gwern’s Clippy gets done in by a basilisk (in your terms):