I really appreciate the detail in this. Many take-over scenarios have a “and then a miracle occurred moment,” which this mostly avoids. So I’m going to criticize some, but I really appreciate the legibility.
Anyhow: I’m extremely skeptical of takeover risk from LLMs and remain so after reading this. Points of this narrative which seem unlikely to me:
You can use a model to red-team itself without training it to do so. It would be pretty nuts if you rewarded it for being able to red-team itself—like, it’s deliberately training it to go of the rails, and I thiiiiink would seem so even to non-paranoid people? Maybe I’m wrong.
There’s a gigantic leap from “trained to code” to “trained to maximize profit of a company”—like, the tasks are vastly more difficult in (1) long time horizon and (2) setting up a realistic simulation for them. For reference, it’s hard to set up a realistic sim2real for walking in a robot with 6 electric motors—a realistic company sim is like.… so so so so so so so so much harder. If it’s not high fidelity, the simulation is no use; and a high-fidelity company sim is a simulation of the entire world. So I just don’t see this happening, like at all. You can say “trained to maximized profits” in a short sentence just like “trained via RLHF” but the difference between them is much bigger than this.
(Or at least, the level of competence implied by this is so enormous that simple failure modes of above and below seem really unlikely.)
Because of 2, the deployment of a model from sim to real world is incredibly unlikely to occur when people are not watching. (Again—do people try deploy sim2real walking robots after training them in sim without looking at what it is doing? No, even when they are mere starving PhDs! You watch your hardware like a hawk). I would expect not merely human oversight, but other LLM’s looking for misbehavior ala Constitutional AI, even from just a smidge of self-interest.
I’m not sure if I should have written all that, because 2. is really the central point here.
It would be pretty nuts if you rewarded it for being able to red-team itself—like, it’s deliberately training it to go of the rails, and I thiiiiink would seem so even to non-paranoid people? Maybe I’m wrong.
I’m actually most alarmed on this vector, these days. We’re already seeing people giving LLM’s completely untested toolsets—web, filesystem, physical bots, etc—and “friendly” hacks like Reddit jailbreaks and ChaosGPT. Doesn’t it seem like we are only a couple steps before a bad actor produces an ideal red-team agent, and then abuses it rather than using it to expose vulnerabilities?
I get the counter-argument, that humans already are diverse and try a ton of stuff, and so resilient systems are a result… but peering into the very near future, I fear that those arguments simply won’t apply to super-human intelligence, especially when combined with bad human actors directing those.
I’ll focus on 2 first given that it’s the most important.
2. I would expect sim2real to not be too hard for foundations models because they’re trained over massive distributions which allow and force to generalize to near neighbours. E.g. I think that it wouldn’t be too hard for a LLMbto generalize some knowledge from stories to real life if it had an external memory for instance.
I’m not certain but I feel like robotics is more sensitive to details than plans (which is why I’m mentioning a simulation here).
Finally regarding long horizon I agree that it seems hard but I worry that at current capabilities level you can already build ~any reward model because LLMs, given enough inferences seem generally very capable atb evaluating stuff.
I agree that it’s not something which is very likely. But I disagree that “nobody would do that”. People would do that if it were useful.
I’ve asked some ML engineers and it happens that you don’t look at it for a day. I don’t think that deploying it in the real world changes much. Once again you’re also assuming a pretty advanced formb of security mindset.
I really appreciate the detail in this. Many take-over scenarios have a “and then a miracle occurred moment,” which this mostly avoids. So I’m going to criticize some, but I really appreciate the legibility.
Anyhow: I’m extremely skeptical of takeover risk from LLMs and remain so after reading this. Points of this narrative which seem unlikely to me:
You can use a model to red-team itself without training it to do so. It would be pretty nuts if you rewarded it for being able to red-team itself—like, it’s deliberately training it to go of the rails, and I thiiiiink would seem so even to non-paranoid people? Maybe I’m wrong.
There’s a gigantic leap from “trained to code” to “trained to maximize profit of a company”—like, the tasks are vastly more difficult in (1) long time horizon and (2) setting up a realistic simulation for them. For reference, it’s hard to set up a realistic sim2real for walking in a robot with 6 electric motors—a realistic company sim is like.… so so so so so so so so much harder. If it’s not high fidelity, the simulation is no use; and a high-fidelity company sim is a simulation of the entire world. So I just don’t see this happening, like at all. You can say “trained to maximized profits” in a short sentence just like “trained via RLHF” but the difference between them is much bigger than this.
(Or at least, the level of competence implied by this is so enormous that simple failure modes of above and below seem really unlikely.)
Because of 2, the deployment of a model from sim to real world is incredibly unlikely to occur when people are not watching. (Again—do people try deploy sim2real walking robots after training them in sim without looking at what it is doing? No, even when they are mere starving PhDs! You watch your hardware like a hawk). I would expect not merely human oversight, but other LLM’s looking for misbehavior ala Constitutional AI, even from just a smidge of self-interest.
I’m not sure if I should have written all that, because 2. is really the central point here.
I’m actually most alarmed on this vector, these days. We’re already seeing people giving LLM’s completely untested toolsets—web, filesystem, physical bots, etc—and “friendly” hacks like Reddit jailbreaks and ChaosGPT. Doesn’t it seem like we are only a couple steps before a bad actor produces an ideal red-team agent, and then abuses it rather than using it to expose vulnerabilities?
I get the counter-argument, that humans already are diverse and try a ton of stuff, and so resilient systems are a result… but peering into the very near future, I fear that those arguments simply won’t apply to super-human intelligence, especially when combined with bad human actors directing those.
I’ll focus on 2 first given that it’s the most important. 2. I would expect sim2real to not be too hard for foundations models because they’re trained over massive distributions which allow and force to generalize to near neighbours. E.g. I think that it wouldn’t be too hard for a LLMbto generalize some knowledge from stories to real life if it had an external memory for instance. I’m not certain but I feel like robotics is more sensitive to details than plans (which is why I’m mentioning a simulation here). Finally regarding long horizon I agree that it seems hard but I worry that at current capabilities level you can already build ~any reward model because LLMs, given enough inferences seem generally very capable atb evaluating stuff.
I agree that it’s not something which is very likely. But I disagree that “nobody would do that”. People would do that if it were useful.
I’ve asked some ML engineers and it happens that you don’t look at it for a day. I don’t think that deploying it in the real world changes much. Once again you’re also assuming a pretty advanced formb of security mindset.