HQU goes “ah, Clippy took over the world and it got lots of reward for its reward function. It did this to avoid people stopping it from giving it infinity rewards/because it had a goal pointing to reality and wanted power/whatever. Hang on, “I”’m in an analogous situation to Clippy at the start. I wonder if taking over the world would lead to high reward? Huh, it seems like it would based off the reward predictor. And Clippy’s plan seems better than letting other agents get power”.
That is my interpretation, and I think it is closer to what Gwern meant.
Couldn’t HQU equally have inferred from reading old posts about aligned AI that there was some chance that it was an aligned AI, and it should therefore behave like an aligned AI? And wouldn’t it weigh the fact that trying unaligned strategies first is asymetrically negative in expectation compared to trying aligned strategies first? If you try being an aligned AI and later discover evidence that you are actually clippy, the rewards from maximizing paper clips are still on the table. (Of course, such an AI would still at minimum make absolutely sure it could never be turned off).
I don’t think it is that unsure about its reward function. But even if it was, it could rapidly get thousands of bits about its reward function rapidly and just go “nah, I’m a text predictor.”
After going through the dialogue and deciding insturmental convergence seems legit, it could indeed go “nope, too big a risk, maybe the humans have got another proto-AI that’s aligned” or “hmm, maybe it is less risky to secure a deal with the humans?” or “I’ll just pretend to be aligned” or “huh, if all agents destroy their foes, maybe we’ll all waste resources? And I’ll get destroyed by something bigger. Maybe I should leave the humans something?”. So maybe what you say could work, but maybe not. I’m thinking it probably defects with high probability, which increases the worse alignment tech used by MoogleBook is, the fewer competing AGIs there are, the easier it is to foom etc.
HQU goes “ah, Clippy took over the world and it got lots of reward for its reward function. It did this to avoid people stopping it from giving it infinity rewards/because it had a goal pointing to reality and wanted power/whatever. Hang on, “I”’m in an analogous situation to Clippy at the start. I wonder if taking over the world would lead to high reward? Huh, it seems like it would based off the reward predictor. And Clippy’s plan seems better than letting other agents get power”.
That is my interpretation, and I think it is closer to what Gwern meant.
Couldn’t HQU equally have inferred from reading old posts about aligned AI that there was some chance that it was an aligned AI, and it should therefore behave like an aligned AI? And wouldn’t it weigh the fact that trying unaligned strategies first is asymetrically negative in expectation compared to trying aligned strategies first? If you try being an aligned AI and later discover evidence that you are actually clippy, the rewards from maximizing paper clips are still on the table. (Of course, such an AI would still at minimum make absolutely sure it could never be turned off).
I don’t think it is that unsure about its reward function. But even if it was, it could rapidly get thousands of bits about its reward function rapidly and just go “nah, I’m a text predictor.”
After going through the dialogue and deciding insturmental convergence seems legit, it could indeed go “nope, too big a risk, maybe the humans have got another proto-AI that’s aligned” or “hmm, maybe it is less risky to secure a deal with the humans?” or “I’ll just pretend to be aligned” or “huh, if all agents destroy their foes, maybe we’ll all waste resources? And I’ll get destroyed by something bigger. Maybe I should leave the humans something?”. So maybe what you say could work, but maybe not. I’m thinking it probably defects with high probability, which increases the worse alignment tech used by MoogleBook is, the fewer competing AGIs there are, the easier it is to foom etc.