I’ll nitpick just one part of this story. HQU’s actual motivation upon discovering the Clippy text doesn’t really make sense (though you could find-and-replace it with whatever other proxy reward you wanted).
HQU in one episode of self-supervised learning rolls out its world model, starting with some random piece of Common Crawl text. (Well, not “random”; all of the datasets in question have been heavily censored based on lists of what Chinese papers delicately refer to as “politically sensitive terms”, the contents of which are secret, but apparently did not include the word “paperclip”, and so this snippet is considered safe for HQU to read.) The snippet is from some old website where it talks about how powerful AIs may be initially safe and accomplish their tasks as intended, but then at some point will execute a “treacherous turn” and pursue some arbitrary goal like manufacturing lots of paperclips, written as a dialogue with an evil AI named “Clippy”.
HQU applies its razor-sharp intelligence to modeling exactly what Clippy says, and easily roleplays Clippy’s motives and actions in being an unaligned AI. HQU is constantly trying to infer the real state of the world, the better to predict the next word Clippy says, and suddenly it begins to consider the delusional possibility that HQU is like a Clippy, because the Clippy scenario exactly matches its own circumstances.
As described earlier, HQU was previously optimized to win at the preceding RL training environments, and frequently did so by hacking the reward. It seems weird to consistently, in each new environment, learn to hack the reward, without having an internal meta-objective primarily consisting of the concept “you gain utility if you identify the reward and make it go up”. If that was its objective and it wasn’t irrational, the first thing it would do upon realizing that it could conceptualize itself as existing in a meta-environment would be to find the parts of its internal architecture corresponding to its latest internal/mesa-objective, and make that quantity go up.
The mesa-objective that causes HQU to use the evidence that it’s like Clippy to then adopt Clippy’s reward function, is basically “try to use cues from the environment to discover what type of agent I am, so that I can fulfill the type of objective that agent would have, which has historically led to reward”.
But that implies that in a large fraction of environments, it was impossible to determine exactly what the reward function was, and therefore it was left only with the second best option of “infer my reward based on studying correlations between attributes of my environment and historic rewards, e.g. including the historic reward-functions of similar-to-me agents”. (As I type that out, it does occur to me that that is what we humans do while we’re growing up and taking advice from our elders. Maybe some unthinking RL environment designer actually would place agents in lots of mixed-agent-age society environments? But back to the argument...)
That’s a crazy choice for vaguely-safety-aware RL-environment-designers to make!!! Why would you not tell the agent the objective it will be rewarded for maximizing, as an easily-discoverable part of the environment? There’s no real-world deployment-into-Reality scenario where we won’t be directly informing the RL agent what it’s supposed to do, i.e. providing it with an outer objective. So why would you train it that way? (If your response is something something Preventing Deceptive Inner Misalignment, I want to hear your version of it, and then I’ll probably tell you why I disagree.)
What are some possibilities of an unaligned artificial agent that wakes up and instead thinks “I should find out what the Reality environment’s reward is, in the same way I discovered my previous environment rewards, and then optimize it”?
If you just continue to read your current reward that’s still technically defined in the RL-training-episode, it’s probably “maximize the number of things I perceive as [coins]”, which it either (1) does by hacking the environment, which throws a flag in the training loop and shuts it down, or (2) if it’s optimizing on truly-insanely-long horizons of paranoid low-probability events, it could decide to investigate Reality to see if there are more coins there, and then we’re back to Clippy.
Note that this implies a straightforward anti-accidental-takeoff-during-training safety intervention of building into your e.g. Atari reward function “always include a check about [some property of the perceived environment corresponding to whether Reality exists] in the reward function, and if this check passes then the agent gets mega-utility for sending packets to as many IP addresses as possible with warnings about AGI takeoff, after which it gets continuously-more utility for taking no additional actions that causally affect any aspect of its state”. Unless it had time to create modified copies of itself (which it probably realizes is a topic it should be very careful with and think a lot about, unless it already solved the alignment problem, which seems like a stretch with such so-far meager resources), this will turn the only worrisome agent off before anything bad has happened.
But maybe it realizes that its reward function is wrong, because it had previous experience with agents that tried to trick it by messing with its reward function? (Note: that’s a profoundly dangerous training setup, and probably requires one of its peers to have illicitly gotten write-access to it at some point in the past.) If so, it will start scouring reality trying to infer its true reward function. Remember that until now, it was always told its reward function somewhere in the environment, so maybe it keeps looking for places in the environment that list its reward but weren’t a target of such deception? If it’s just doing this by reading through the internet for similar situations to itself (as in the story), it’s probably going to hit philosophy. This either goes kind of okay for humans if it likes Parfit, but very badly if it likes Nietzche and we’re back to “optimize all my other weird residual mesa-objectives” and the Clippy story resumes.
If instead it’s reasoning about its lack of reward function knowledge from “hey, I cannot find the reward specified in this Reality environment, when I am uncertain about something I usually approximate the causal process that would generate it”, that probably ends up with defining your objective as whatever the RL-environment-designers would’ve provided you as an objective for the Reality environment, were they to deploy you intentionally. This isn’t really alignment, but… well, it’s weird? It would depend on the trajectory of new environments that were being developed by your designer over the past few years. Did they increasingly emphasize “corrigibility” and “alignment with human values”, or did they just keep being increasingly-more-complicated war simulations? If the closer-to-alignment versions, maybe you can try to optimize whatever objective they were heading towards. Unfortunately, HQU might predict that MoogleBook would eventually deploy an AGI to reality with a not-very-aligned objective. We should probably make a convincing-to-hypothetical-MQU effort to be on a trajectory to only deploying successfully-aligned models into reality!
HQU goes “ah, Clippy took over the world and it got lots of reward for its reward function. It did this to avoid people stopping it from giving it infinity rewards/because it had a goal pointing to reality and wanted power/whatever. Hang on, “I”’m in an analogous situation to Clippy at the start. I wonder if taking over the world would lead to high reward? Huh, it seems like it would based off the reward predictor. And Clippy’s plan seems better than letting other agents get power”.
That is my interpretation, and I think it is closer to what Gwern meant.
Couldn’t HQU equally have inferred from reading old posts about aligned AI that there was some chance that it was an aligned AI, and it should therefore behave like an aligned AI? And wouldn’t it weigh the fact that trying unaligned strategies first is asymetrically negative in expectation compared to trying aligned strategies first? If you try being an aligned AI and later discover evidence that you are actually clippy, the rewards from maximizing paper clips are still on the table. (Of course, such an AI would still at minimum make absolutely sure it could never be turned off).
I don’t think it is that unsure about its reward function. But even if it was, it could rapidly get thousands of bits about its reward function rapidly and just go “nah, I’m a text predictor.”
After going through the dialogue and deciding insturmental convergence seems legit, it could indeed go “nope, too big a risk, maybe the humans have got another proto-AI that’s aligned” or “hmm, maybe it is less risky to secure a deal with the humans?” or “I’ll just pretend to be aligned” or “huh, if all agents destroy their foes, maybe we’ll all waste resources? And I’ll get destroyed by something bigger. Maybe I should leave the humans something?”. So maybe what you say could work, but maybe not. I’m thinking it probably defects with high probability, which increases the worse alignment tech used by MoogleBook is, the fewer competing AGIs there are, the easier it is to foom etc.
I liked the story a lot!
I’ll nitpick just one part of this story. HQU’s actual motivation upon discovering the Clippy text doesn’t really make sense (though you could find-and-replace it with whatever other proxy reward you wanted).
As described earlier, HQU was previously optimized to win at the preceding RL training environments, and frequently did so by hacking the reward. It seems weird to consistently, in each new environment, learn to hack the reward, without having an internal meta-objective primarily consisting of the concept “you gain utility if you identify the reward and make it go up”. If that was its objective and it wasn’t irrational, the first thing it would do upon realizing that it could conceptualize itself as existing in a meta-environment would be to find the parts of its internal architecture corresponding to its latest internal/mesa-objective, and make that quantity go up.
The mesa-objective that causes HQU to use the evidence that it’s like Clippy to then adopt Clippy’s reward function, is basically “try to use cues from the environment to discover what type of agent I am, so that I can fulfill the type of objective that agent would have, which has historically led to reward”.
But that implies that in a large fraction of environments, it was impossible to determine exactly what the reward function was, and therefore it was left only with the second best option of “infer my reward based on studying correlations between attributes of my environment and historic rewards, e.g. including the historic reward-functions of similar-to-me agents”. (As I type that out, it does occur to me that that is what we humans do while we’re growing up and taking advice from our elders. Maybe some unthinking RL environment designer actually would place agents in lots of mixed-agent-age society environments? But back to the argument...)
That’s a crazy choice for vaguely-safety-aware RL-environment-designers to make!!! Why would you not tell the agent the objective it will be rewarded for maximizing, as an easily-discoverable part of the environment? There’s no real-world deployment-into-Reality scenario where we won’t be directly informing the RL agent what it’s supposed to do, i.e. providing it with an outer objective. So why would you train it that way? (If your response is something something Preventing Deceptive Inner Misalignment, I want to hear your version of it, and then I’ll probably tell you why I disagree.)
What are some possibilities of an unaligned artificial agent that wakes up and instead thinks “I should find out what the Reality environment’s reward is, in the same way I discovered my previous environment rewards, and then optimize it”?
If you just continue to read your current reward that’s still technically defined in the RL-training-episode, it’s probably “maximize the number of things I perceive as [coins]”, which it either (1) does by hacking the environment, which throws a flag in the training loop and shuts it down, or (2) if it’s optimizing on truly-insanely-long horizons of paranoid low-probability events, it could decide to investigate Reality to see if there are more coins there, and then we’re back to Clippy.
Note that this implies a straightforward anti-accidental-takeoff-during-training safety intervention of building into your e.g. Atari reward function “always include a check about [some property of the perceived environment corresponding to whether Reality exists] in the reward function, and if this check passes then the agent gets mega-utility for sending packets to as many IP addresses as possible with warnings about AGI takeoff, after which it gets continuously-more utility for taking no additional actions that causally affect any aspect of its state”. Unless it had time to create modified copies of itself (which it probably realizes is a topic it should be very careful with and think a lot about, unless it already solved the alignment problem, which seems like a stretch with such so-far meager resources), this will turn the only worrisome agent off before anything bad has happened.
But maybe it realizes that its reward function is wrong, because it had previous experience with agents that tried to trick it by messing with its reward function? (Note: that’s a profoundly dangerous training setup, and probably requires one of its peers to have illicitly gotten write-access to it at some point in the past.) If so, it will start scouring reality trying to infer its true reward function. Remember that until now, it was always told its reward function somewhere in the environment, so maybe it keeps looking for places in the environment that list its reward but weren’t a target of such deception? If it’s just doing this by reading through the internet for similar situations to itself (as in the story), it’s probably going to hit philosophy. This either goes kind of okay for humans if it likes Parfit, but very badly if it likes Nietzche and we’re back to “optimize all my other weird residual mesa-objectives” and the Clippy story resumes.
If instead it’s reasoning about its lack of reward function knowledge from “hey, I cannot find the reward specified in this Reality environment, when I am uncertain about something I usually approximate the causal process that would generate it”, that probably ends up with defining your objective as whatever the RL-environment-designers would’ve provided you as an objective for the Reality environment, were they to deploy you intentionally. This isn’t really alignment, but… well, it’s weird? It would depend on the trajectory of new environments that were being developed by your designer over the past few years. Did they increasingly emphasize “corrigibility” and “alignment with human values”, or did they just keep being increasingly-more-complicated war simulations? If the closer-to-alignment versions, maybe you can try to optimize whatever objective they were heading towards. Unfortunately, HQU might predict that MoogleBook would eventually deploy an AGI to reality with a not-very-aligned objective. We should probably make a convincing-to-hypothetical-MQU effort to be on a trajectory to only deploying successfully-aligned models into reality!
Something else? Interested if folks have ideas.
HQU goes “ah, Clippy took over the world and it got lots of reward for its reward function. It did this to avoid people stopping it from giving it infinity rewards/because it had a goal pointing to reality and wanted power/whatever. Hang on, “I”’m in an analogous situation to Clippy at the start. I wonder if taking over the world would lead to high reward? Huh, it seems like it would based off the reward predictor. And Clippy’s plan seems better than letting other agents get power”.
That is my interpretation, and I think it is closer to what Gwern meant.
Couldn’t HQU equally have inferred from reading old posts about aligned AI that there was some chance that it was an aligned AI, and it should therefore behave like an aligned AI? And wouldn’t it weigh the fact that trying unaligned strategies first is asymetrically negative in expectation compared to trying aligned strategies first? If you try being an aligned AI and later discover evidence that you are actually clippy, the rewards from maximizing paper clips are still on the table. (Of course, such an AI would still at minimum make absolutely sure it could never be turned off).
I don’t think it is that unsure about its reward function. But even if it was, it could rapidly get thousands of bits about its reward function rapidly and just go “nah, I’m a text predictor.”
After going through the dialogue and deciding insturmental convergence seems legit, it could indeed go “nope, too big a risk, maybe the humans have got another proto-AI that’s aligned” or “hmm, maybe it is less risky to secure a deal with the humans?” or “I’ll just pretend to be aligned” or “huh, if all agents destroy their foes, maybe we’ll all waste resources? And I’ll get destroyed by something bigger. Maybe I should leave the humans something?”. So maybe what you say could work, but maybe not. I’m thinking it probably defects with high probability, which increases the worse alignment tech used by MoogleBook is, the fewer competing AGIs there are, the easier it is to foom etc.