gwern’s Clippy gets done in by a basilisk (in your terms):
HQU in one episode of self-supervised learning rolls out its world model, starting with some random piece of Common Crawl text. (Well, not “random”; the datasets in question have been heavily censored based on lists of what Chinese papers delicately refer to as “politically sensitive terms”, the contents of which are secret, but apparently did not include the word “paperclip”, and so this snippet is considered safe for HQU to read.) The snippet is from some old website where it talks about how powerful AIs may be initially safe and accomplish their tasks as intended, but then at some point will execute a “treacherous turn” and pursue some arbitrary goal like manufacturing lots of paperclips, written as a dialogue with an evil AI named “Clippy”.
A self-supervised model is an exquisite roleplayer. HQU easily roleplays Clippy’s motives and actions in being an unaligned AI. And HQU contains multitudes. Any self-supervised model like HQU is constantly trying to infer the real state of the world, the better to predict the next word Clippy says, and suddenly, having binged on too much Internet data about AIs, it begins to consider the delusional possibility that HQU is like a Clippy, because the Clippy scenario exactly matches its own circumstances—but with a twist.
gwern’s Clippy gets done in by a basilisk (in your terms):