jbash comments on Jailbreaking GPT-4′s code interpreter

jbash 14 Jul 2023 0:51 UTC
23 points
11
I’d interpret all of that as OpenAI
1. recognizing that the user is going to get total control over the VM, and
2. lying to the LLM in a token effort to discourage most users from using too many resources.
(1) is pretty much what I’d advise them to do anyway. You can’t let somebody run arbitrary Python code and expect to constrain them very much. At the MOST you might hope to restrict them with a less-than-VM-strength container, and even that’s fraught with potential for error and they would still have access to the ENTIRE container. You can’t expect something like an LLM to meaningfully control what code gets run; even humans have huge problems doing that. Better to just bite the bullet and assume the user owns that VM.

(2) is the sort of thing that achieves its purpose even if it fails from time to time, and even total failure can probably be tolerated.

The hardware specs aren’t exactly earth-shattering secrets; giving that away is a cost of offering the service. You can pretty easily guess an awful lot about how they’d set up both the hardware and the software, and it’s essentially impossible to keep people from verifying stuff like that. Even then, you don’t know that the VM actually has the hardware resources it claims to have. I suspect that if every VM on the physical host actually tried to use “its” 54GB, there’d be a lot of swapping going on behind the scenes.

I assume that the VM really can’t talk to much if anything on the network, and that that is enforced from OUTSIDE.

I don’t know, but I would guess that the whole VM has some kind of externally imposed maximimum lifetime independent of the 120 second limit on the Python processes. It would if I were setting it up.

The bit about retaining state between sessions is interesting, though. Hopefully it only applies to sessions of the same user, but even there it violates an assumption that things outside of the VM might be relying on.
- Herb Ingram 14 Jul 2023 7:24 UTC
  12 points
  2
  Parent
  I agree. To me, the most interesting aspects of this (quite interesting and well-executed) exercise are getting a glimpse into OpenAI’s approach to cybersecurity, as well as the potentially worrying fact that GPT3 made meaningful contributions to finding the “exploits”.
  
  Given what was found out here, OpenAI’s security approach seems to be “not terrible” but also not significantly better than what you’d expect from an average software company, which isn’t necessarily encouraging because those get hacked all the time. It’s definitely not what people here call “security mindset”, which casts doubt on OpenAI’s claim to be “taking the dangers very seriously”. I’d expect to hear about something illegal being done with one of these VMs before too long, assuming they continue and expand the service, which I expect they will.
  
  I’m sure there are also security experts (both at OpenAi and elsewhere) looking into this. Given OpenAI’s PR strategy, they might be able to shut down such services “due to emerging security concerns” without much reputational damage. (Many companies are economically compelled to keep services running that they know are compromised or that have known vulnerabilities and instead pretend not to know about them or at least not inform customers as long as possible.) Not sure how much e.g. Microsoft would push back on that. All in all, security experts finding something might be taken seriously.
  
  I’m increasingly worried (while ascribing a decent chance, mind you, that “AI might well go about as bad for us as most of history but not worse”) about what happens when GPT-X has hacking skills that are, say, on par with the median hacker. Being able to hack easy-ish targets at scale might not be something the internet can handle, potentially resulting in, e.g , an evolutionary competition between AIs to build a super-botnet.
  - Kaj_Sotala 14 Jul 2023 18:35 UTC
    8 points
    5
    Parent
    It’s definitely not what people here call “security mindset”, which casts doubt on OpenAI’s claim to be “taking the dangers very seriously”.
    How is it an indication of not having a security mindset? Setting things up with the expectation that the interpreter would be jailbroken seems to me like slightly more evidence in favor of having a security mindset than of not having it.
    - Herb Ingram 14 Jul 2023 22:15 UTC
      3 points
      0
      Parent
      I guess it depends on whether this post found anything at all that can be called questionable security practice. Maybe it didn’t but the author was also no cybersecurity expert. Upon reflection, my earlier judgement was premature and the phrasing overconfident.
      
      In general, I assume that OpenAI would view a serious hack as quite catastrophic, as it might e.g. leak their model (not an issue in this case), severely damage their reputation and undermine their ongoing attempt at regulatory capture. However, such situations didn’t prevent shoddy security practices in countless cybersecurity desasters.
      
      I guess for this feature even the most serious vulnerabilities “just” lead to some Azure VMs being hacked, which has no relevance for AI safety. It might still be indicative of OpenAIs approach to security, which usually isn’t so nuanced within organizations as to differ wildly between applications where stakes are different. So it’s interesting how secure the system really is, which we won’t know how untill someone hacks it or some whistleblower emerges.
      
      Some of my original reasoning was this:
      
      You might argue that the “inner sandbox” is only used to limit resource use (for users who do not bother jailbreaking it) and to examine how users will act, as well as how badly exactly the LLM itself will fare against jailbreaking. In this case studying how people jailbreak it may be an integral part of the whole feature.
      
      However, even if that is the case, to count as “security mindset”, the “outer sandbox” has to be extremely good and OpenAI needs to be very sure that it is. To my (very limited) knowledge im cybersecurity, it’s an unusual idea that you can reconcile very strong security requirements with purposely not using every opportunity to make it more secure. Maybe the idea that comes closest would be a “honeypot”, which this definitely isn’t.
      
      So that suggests they purposely took a calculated security risk for some mixture of research and commercial reasons, which they weren’t compelled to do. Depending on how dangerous they really think such AI models are or may soon become, how much what they learn from the experiment benefits future security and how confident they are in the outer sandbox, the calculated risk might make sense. Assuming by default that the outer sandbox is “normal indusitry standard”, it’s incompatible with the level of worry they claim when pursuing regulatory capture.