iceman comments on Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.)

iceman 15 Oct 2023 19:11 UTC
18 points
4
On the topic of security mindset, the thing that the LW community calls “security mindset” isn’t even an accurate rendition of what computer security people would call security mindset. As noted by lc, actual computer security mindset is POC || GTFO, or trying to translate that into lesswrongesse, you do not have warrant to believe in something until you have an example of the thing you’re maybe worried about being a real problem because you are almost certain to be privileging the hypothesis.
What links here?
- Zach Furman 16 Oct 2023 22:43 UTC
  18 points
  13
  Parent
  In the cybersecurity analogy, it seems like there are two distinct scenarios being conflated here:
  1) Person A says to Person B, “I think your software has X vulnerability in it.” Person B says, “This is a highly specific scenario, and I suspect you don’t have enough evidence to come to that conclusion. In a world where X vulnerability exists, you should be able to come up with a proof-of-concept, so do that and come back to me.”
  2) Person B says to Person A, “Given XYZ reasoning, my software almost certainly has no critical vulnerabilities of any kind. I’m so confident, I give it a 99.99999%+ chance.” Person A says, “I can’t specify the exact vulnerability your software might have without it in front of me, but I’m fairly sure this confidence is unwarranted. In general it’s easy to underestimate how your security story can fail under adversarial pressure. If you want, I could name X hypothetical vulnerability, but this isn’t because I think X will actually be the vulnerability, I’m just trying to be illustrative.”
  Story 1 seems to be the case where “POC or GTFO” is justified. Story 2 seems to be the case where “security mindset” is justified.
  It’s very different to suppose a particular vulnerability exists (not just as an example, but as the scenario that will happen), than it is to suppose that some vulnerability exists. Of course in practice someone simply saying “your code probably has vulnerabilities,” while true, isn’t very helpful, so you may still want to say “POC or GTFO”—but this isn’t because you think they’re wrong, it’s because they haven’t given you any new information.
  Curious what others have to say, but it seems to me like this post is more analogous to story 2 than story 1.
  - bigjeff5 17 Oct 2023 20:27 UTC
    7 points
    0
    Parent
    The reason Person A in scenario 2 has the intuition that Person B is very wrong is because there are dozens, if not hundreds of examples where people claimed no vulnerabilities and were proven wrong. Usually spectacularly so, and often nearly immediately. Consider the fact that the most robust software developed by the most wealthy and highly motivated companies in the world, who employ vast teams of talented software engineers, have monthly patch schedules to fix their constant stream vulnerabilities, and I think it’s pretty easy to immediately discount anybody’s claim of software perfection without requiring any further evidence.
    All the evidence Person A needs is the complete and utter lack of anybody having achieved such a thing in the history of software to discount Person B’s claims.
    I’ve never heard of an equivalent example for AI. It just seems to me like Scenario 2 doesn’t apply, or at least it cannot apply at this point in time. Maybe in 50 years we’ll have the vast swath of utter failures to point to, and thus a valid intuition against someone’s 9-9′s confidence of success, but we don’t have that now. Otherwise people would be pointing out examples in these arguments instead of vague unease regarding problem spaces.
    - Daniel Kokotajlo 18 Oct 2023 14:36 UTC
      4 points
      1
      Parent
      Well, no one has built an AGI yet, and if your plan is to wait until we have years of experience with unaligned AGIs before it’s OK to start worrying about the problem, that’s a bad plan.
      
      Also, there are things which are not AGI but which are similar in various ways (software, deep neural nets, rocket navigation mechanisms, prisons, childrearing strategies, tiger-training-strategies) which provide ample examples of unseen errors.
      
      Also, like I said, there ARE plenty of POCs for AGI risk.
- Steven Byrnes 16 Oct 2023 15:47 UTC
  16 points
  9
  Parent
  At the very least I think it would be more accurate to say “one aspect of actual computer security mindset is POC || GTFO”. Right? Are you really arguing that there’s nothing more to it than that?? That seems insane to me.
  Even leaving that aside, here’s a random bug thread:
  Mozilla developers identified and fixed several stability bugs in the browser engine used in Firefox and other Mozilla-based products. Some of these crashes showed evidence of memory corruption under certain circumstances and we presume that with enough effort at least some of these could be exploited to run arbitrary code. [emphasis added]
  IIUC they treated these crashes as a security vulnerability, not a mere usability problem, and thus did things like not publicly disclosing the details until they had a fix ready to go, categorizing the fix as a high-priority security update, etc.
  If your belief is that “actual computer security mindset is POC||GTFO”, then I think you’d have to say that these Mozilla developers do not have computer security mindset, and instead were being silly and overly paranoid. Is that what you think?
  - lc 17 Oct 2023 6:10 UTC
    17 points
    5
    Parent
    You’re right that this is definitely not “security mindset”. Iceman is distorting the point of the original post. But also, the reason Mozilla’s developers can do that and get public credit for it is partially because the infosec community has developed tens of thousands of catastrophic RCE’s from very similar exploit primitives, and so there is loads of historical evidence that those particular kinds of crashes lead to exploitable bugs. Alignment researchers lack the same shared understanding—they’re mostly philosopher-mathematicians with no consensus even among themselves about what the real issues are, and so if one tries to claim credit for averting catastrophe in a similar situation it’s impossible to tell if they’re right.
    - bigjeff5 17 Oct 2023 20:32 UTC
      3 points
      2
      Parent
      This is exactly right. To put it more succinctly: Memory corruption is a known vector for exploitation, therefore any bug that potentially leads to memory corruption also has the potential to be a security vulnerability. Thus memory corruption should be treated with similar care as a security vulnerability.
- lc 17 Oct 2023 5:42 UTC
  14 points
  2
  Parent
  POC || GTFO is not “security mindset”, it’s a norm. It’s like science in that it’s a social technology for making legible intellectual progress on engineering issues, and allows the field to parse who is claiming to notice security issues to signal how smart they are vs. who is identifying actual bugs. But a lack of “POC || GTFO” culture doesn’t tell you that nothing is wrong, and demanding POCs for everything obviously doesn’t mean you understand what is and isn’t secure. Or to translate that into lesswrongese, reversed stupidity is not intelligence.
  - iceman 18 Oct 2023 17:01 UTC
    6 points
    0
    Parent
    But POC||GTFO is really important to constraining your expectations. We do not really worry about Rowhammer since the few POCs are hard, slow and impractical. We worry about Meltdown and other speculative execution attacks because Meltdown shipped with a POC that read passwords from a password manager in a different process, was exploitable from within Chrome’s sandbox, and my understanding is that POCs like that were the only reason Intel was made to take it seriously.
    
    Meanwhile, Rowhammer is maybe a real issue but is so hard to pull off consistently and stealthily that nobody worries about it. My recollection was when it was first discovered, people didn’t panic that much because there wasn’t warrant to panic. OK, so there was a problem with the DRAM. OK, what are the constraints on exploitation? Oh, the POCs are super tricky to pull off and will often make the machine hard to use during exploitation?
    
    A POC provides warrant to believe in something.
    - niplav 3 Mar 2024 3:26 UTC
      2 points
      0
      Parent
      I’m confused about how POC||GTFO fits together with cryptographers starting to worry about post-quantum cryptography already in 2006, when the proof of concept was we have factored 15 into 3×5 using Shor’s algorithm? (They were running a whole conference on it!)
- Daniel Kokotajlo 16 Oct 2023 13:57 UTC
  10 points
  4
  Parent
  Citation needed? The one computer security person I know who read Yudkowsky’s post said it was a good description of security mindset. POC||GTFO sounds useful and important too but I doubt it’s the core of the concept.
  
  Also, if the toy models, baby-AGI-setups like AutoGPT, and historical examples we’ve provided so far don’t meet your standards for “example of the thing you’re maybe worried about” with respect to AGI risk, (and you think that we should GTFO until we have an example that meets your standards) then your standards are way too high.
  
  If instead POC||GTFO applied to AGI risk means “we should try really hard to get concrete, use formal toy models when possible, create model organisms to study, etc.” then we are already doing that and have been.
  - Noosphere89 16 Oct 2023 14:30 UTC
    11 points
    3
    Parent
    On POCs for misalignment, specifically for goal misgeneralization, there are pretty fundamental differences between what was shown and what was predicted so far, and one of them is that the train and test behavior in different environments are similar or the same, while in goal misgeneralization speculations, the train and test behavior are wildly different:
    
    Rohin Shah has a comment on why most POCs aren’t that great here:
    
    https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment#P3phaBxvzX7KTyhf5
    - Daniel Kokotajlo 16 Oct 2023 16:42 UTC
      2 points
      0
      Parent
      Nevertheless, if you think that this isn’t good enough and that people worried about AGI risk should GTFO until they have something better, you are the one who is wrong.
      - Noosphere89 16 Oct 2023 17:04 UTC
        3 points
        −1
        Parent
        I don’t think people worried about AGI risk should GTFO.
        
        I do think we should stop giving them as much credit as we do, because of the fact that you are likely to privilege the hypothesis, and it does mean that we shouldn’t count the POCs as vindicating the people worried about AI safety, since their evidence doesn’t really work to support the claim of goal misgeneralization.
        Daniel Kokotajlo 18 Oct 2023 14:46 UTC
        4 points
        0
        Parent
        I think that’s a vague enough claim that it’s basically a setup for motte-and-bailey. “Stop giving them as much credit as we do.” Well I think that if ‘we’ = society in general, then we should start giving them way more credit, in fact. If ‘we’ = various LWers who don’t think for themselves and just repeat what Yudkowsky says, then yes I agree. If ‘we’ = me, then no thank you I believe I am allocating credit appropriately, I take the point about privileging the hypothesis but I was well aware of it already.
        Noosphere89 18 Oct 2023 17:23 UTC
        2 points
        0
        Parent
        What this would look like in practice would be the following (Taken from the proposed break of optimization daemon/inner misalignment):
        
        Someone proposes a break of AI that threatens alignment like optimization daemons.
        
        We test the claim on toy AIs, and either it doesn’t work or it does work on them, then we move to the next step.
        
        We test the alignment break on a more realistic setting, and it turns out that the perceived break was going away.
        
        Now, the key point is if a proposed break goes away or is made harder in more realistic settings, and especially if it keeps happening, we need to avoid giving credit to them for predicting the failure.
        
        More generally, one issue I have is that I perceive an asymmetry between AI is dangerous and AI is safe people, in that if people were wrong about a danger, they’ll forget or not reference the fact that they’re wrong, but if they’re right about a danger, even if it’s much milder and some of their other predictions were wrong, people will treat you as an oracle.
        
        A quote from lc’s post on POC or GTFO culture as counter to alignment wordcelism explains my thoughts on the issue better than I can:
        
        The computer security industry happens to know this dynamic very well. No one notices the Fortune 500 company that doesn’t suffer the ransomware attack. Outside the industry, this active vs. negative bias is so prevalent that information security standards are constantly derided as “horrific” without articulating the sense in which they fail, and despite the fact that online banking works pretty well virtually all of the time. Inside the industry, vague and unverified predictions that Companies Will Have Security Incidents, or that New Tools Will Have Security Flaws, are treated much more favorably in retrospect than vague and unverified predictions that companies will mostly do fine. Even if you’re right that an attack vector is unimportant and probably won’t lead to any real world consequences, in retrospect your position will be considered obvious. On the other hand, if you say that an attack vector is important, and you’re wrong, people will also forget about that in three years. So better list everything that could possibly go wrong[1], even if certain mishaps are much more likely than others, and collect oracle points when half of your failure scenarios are proven correct.
        
        Vladimir_Nesov 18 Oct 2023 20:39 UTC
        4 points
        0
        Parent
        Scott Alexander writes about the asymmetry in From Nostradamus To Fukuyama. Reversing biases of public perception isn’t much use for sorting out correctness of arguments.
        Noosphere89 18 Oct 2023 23:10 UTC
        0 points
        −6
        Parent
        I do have other issues with the security mindset, but that is an important issue I had.
        
        Turning to this part though, I think I might see where I disagree:
        
        Reversing biases of public perception isn’t much use for sorting out correctness of arguments.
        
        It’s not just public perception, but also the very researchers are biased to believe that danger is or will happen. Critically, since this is asymmetrical, it means that this has more implications for doomy people than for optimistic people.
        
        It’s why I’m a priori a bit skeptical of AI doom, and it’s also why it’s consistent to believe that the real probability of doom is very low, almost arbitrarily low, while people think the probability of doom is quite high: You don’t pay attention to the not doom or the things that went right, only the things that went wrong.
        Vladimir_Nesov 19 Oct 2023 20:54 UTC
        6 points
        0
        Parent
        
        Reversing biases of public perception isn’t much use for sorting out correctness of arguments.
        
        It’s not just public perception, but also the very researchers are biased to believe that danger is or will happen.
        
        The researchers are not the arguments. You are discussing correctness of researchers.
        Noosphere89 19 Oct 2023 21:41 UTC
        0 points
        0
        Parent
        
        The researchers are not the arguments.
        
        Yes, that’s true, but I have more evidence than that, and in particular I have evidence that directly argues against the proposition of AI doom, and that a lot of common arguments for AI doom.
        
        The researchers aren’t the arguments, but the properties of the researchers looking into the arguments, especially the way they’re biased, does provide some evidence for certain proposition.