Zach Furman comments on Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.)

Zach Furman 16 Oct 2023 22:43 UTC
18 points
13
In the cybersecurity analogy, it seems like there are two distinct scenarios being conflated here:
1) Person A says to Person B, “I think your software has X vulnerability in it.” Person B says, “This is a highly specific scenario, and I suspect you don’t have enough evidence to come to that conclusion. In a world where X vulnerability exists, you should be able to come up with a proof-of-concept, so do that and come back to me.”
2) Person B says to Person A, “Given XYZ reasoning, my software almost certainly has no critical vulnerabilities of any kind. I’m so confident, I give it a 99.99999%+ chance.” Person A says, “I can’t specify the exact vulnerability your software might have without it in front of me, but I’m fairly sure this confidence is unwarranted. In general it’s easy to underestimate how your security story can fail under adversarial pressure. If you want, I could name X hypothetical vulnerability, but this isn’t because I think X will actually be the vulnerability, I’m just trying to be illustrative.”
Story 1 seems to be the case where “POC or GTFO” is justified. Story 2 seems to be the case where “security mindset” is justified.
It’s very different to suppose a particular vulnerability exists (not just as an example, but as the scenario that will happen), than it is to suppose that some vulnerability exists. Of course in practice someone simply saying “your code probably has vulnerabilities,” while true, isn’t very helpful, so you may still want to say “POC or GTFO”—but this isn’t because you think they’re wrong, it’s because they haven’t given you any new information.
Curious what others have to say, but it seems to me like this post is more analogous to story 2 than story 1.
- bigjeff5 17 Oct 2023 20:27 UTC
  7 points
  0
  Parent
  The reason Person A in scenario 2 has the intuition that Person B is very wrong is because there are dozens, if not hundreds of examples where people claimed no vulnerabilities and were proven wrong. Usually spectacularly so, and often nearly immediately. Consider the fact that the most robust software developed by the most wealthy and highly motivated companies in the world, who employ vast teams of talented software engineers, have monthly patch schedules to fix their constant stream vulnerabilities, and I think it’s pretty easy to immediately discount anybody’s claim of software perfection without requiring any further evidence.
  All the evidence Person A needs is the complete and utter lack of anybody having achieved such a thing in the history of software to discount Person B’s claims.
  I’ve never heard of an equivalent example for AI. It just seems to me like Scenario 2 doesn’t apply, or at least it cannot apply at this point in time. Maybe in 50 years we’ll have the vast swath of utter failures to point to, and thus a valid intuition against someone’s 9-9′s confidence of success, but we don’t have that now. Otherwise people would be pointing out examples in these arguments instead of vague unease regarding problem spaces.
  - Daniel Kokotajlo 18 Oct 2023 14:36 UTC
    4 points
    1
    Parent
    Well, no one has built an AGI yet, and if your plan is to wait until we have years of experience with unaligned AGIs before it’s OK to start worrying about the problem, that’s a bad plan.
    
    Also, there are things which are not AGI but which are similar in various ways (software, deep neural nets, rocket navigation mechanisms, prisons, childrearing strategies, tiger-training-strategies) which provide ample examples of unseen errors.
    
    Also, like I said, there ARE plenty of POCs for AGI risk.