trevor comments on Pitching an Alignment Softball

trevor 8 Jun 2022 1:37 UTC
6 points
Yudkowsky once framed it a lot better than anything regarding nanopunk stuff:
If you have an untrustworthy general superintelligence generating English strings meant to be “reasoning/arguments/proofs/explanations” about eg a nanosystem design, then I would not only expect the superintelligence to be able to fool humans in the sense of arguing for things that were not true in a way that fooled the humans, I’d expect the superintelligence to be able to covertly directly hack the humans in ways that I wouldn’t understand even after having been told what happened. So you must have some prior belief about the superintelligence being aligned before you dared to look at the arguments.
In the contest you mentioned (which I tried to prevent you from missing but I wasn’t in time to catch you and several others), I made a highly optimized version of this for policymakers:
If you have an untrustworthy general superintelligence generating [sentences] meant to [prove something], then I would not only expect the superintelligence to be [smart enough] to fool humans in the sense of arguing for things that were [actually lies]… I’d expect the superintelligence to be able to covertly hack the human [mind] in ways that I wouldn’t understand, even after having been told what happened[, because a superintelligence is, by definition, at least as smart to humans as humans are to chimpanzees]. So you must have some belief about the superintelligence being aligned before you dared to look at [any sentences it generates].
What you’re talking about, bypassing talk of superintelligence or recursive self-improvement, is something that I agree would be pure gold but only if it’s possible and reasonable to skip that part. Hyperintelligent AI is sorta the bread and butter of the whole thing, but talking to policymakers means putting yourself in their shoes and you’ve done a fantastic job of that here.
The problem is that this body of knowledge is very, very cursed. There are massive vested interests, a ton of money and national security, built on a foundation of what is referred to as “bots” in this post. Talking about it scares me enough, but claiming that you have solutions for it is a very risky thing when there are many people deep in the military who live and breathe this stuff every day (including using AI for killing people such as the AI mounted on nuclear stealth missiles). At this time, I don’t know how much sense it makes to risk posing as someone you’re not (or, at least, accidentally making a disinterested policymaker incorrectly think that’s what you’re doing).
But I have essentially zero experience working in IT or cybersecurity so I can’t say for sure. I super-upvoted this post because I highly thing it’s worth consideration.
Also, for the record, your phrasing of this is superior:
...it’s likely that when the bad-actor bots get smart enough, they’ll be able to make themselves smarter, and we don’t know where the limit is on that… It’s already happening—the bots are getting smart enough to put us on the cusp of a new generation of info-warfare, and they’re not even very smart yet.
Generally, your approach is highly worthy of the contest and I wish you were there at the time (even now, I’m still thinking of one-liners that I should have thought of before it ended). There is allegedly another one coming up with another $20k but it is for fully fleshed out reports, not individual paragraphs.
Make sure to click the bell for the active bounties tag.
- trevor 8 Jun 2022 1:47 UTC
  3 points
  Parent
  Also this might be a really good fit for the Red-Teaming Contest which has 5 times the total payout. I think it’s too apparent and sensible for red-teaming, but it’s possible that a very large number of alignment people disagree with me on that and find it very unreasonable.
  IMO, saying that “people should stfu about nanopunk scenarios” seems worthy enough for the red-teaming contest on its own. AI is taken very seriously by the national security establishment, and deranged-appearing cults are not.
  - mu_(negative) 8 Jun 2022 3:20 UTC
    3 points
    Parent
    Thanks for your replies! I’m really glad my thoughts were valuable. I did see your post promoting the contest before it was over, but my thoughts on this hadn’t coalesced yet.
    At this time, I don’t know how much sense it makes to risk posing as someone you’re not (or, at least, accidentally making a disinterested policymaker incorrectly think that’s what you’re doing).
    Thanks especially for this comment. I noticed I was uncomfortable while writing that part of my post , and I should have paid more attention to that signal. I think I didn’t want to water down the ending because the post was already getting long. I should have put a disclaimer that I didn’t really know how to conclude, and that section is mostly a placeholder for what people who understand this better than me would pitch. To be clearer here: I do not intend to express any opinion on what to tell policymakers about solutions to these problems. I know hardly anything about practical alignment, just the general theory of why it is important. (I’m going to edit my post to point at this comment to make sure that’s clear.)
    What you’re talking about, bypassing talk of superintelligence or recursive self-improvement, is something that I agree would be pure gold but only if it’s possible and reasonable to skip that part. Hyperintelligent AI is sorta the bread and butter of the whole thing [...]
    Yup, I agree completely. I should have said in the post that I only weakly endorse my proposed approach. It would need to be workshopped to explore its value—especially, which signals from the listener suggested going deeper into the rabbithole versus popping back out into impacts on present day issues. My experience talking to people outside my field is that at the first signal someone doesn’t take your niche issue seriously, you had better immediately connect it back to something they already care about or you’ve lost them. I wrote with the intention to provide the lowest common denominator set of arguments to get someone to take anything in the problem space seriously, so they at least have a hope of being worked slowly towards the idea of the real problem. I also wrote it as an ELI5-level for politicians who think the internet still runs on telephones. So like a “worst case scenario” conversation. But if this approach got someone worrying about the wrong aspect of the issue or misunderstanding critical pieces, it could backfire.
    If I were going to update my pitch to better emphasize superintelligence, my intuition would be to lean into the video spoofing angle. It doesn’t require any technical background to imagine a fake person socially engineering you on a zoom call. GPT3 examples are already sufficient to put home the Turing Test “this is really already happening” point. So the missing pieces are just seamless audio/video generation, and the ability of the bot to improvise its text-generation towards a goal as it converses. It’s then a simple further step to envision the bad-actor bot’s improvisation getting better and better until it doesn’t make mistakes, is smarter than a person and can manipulate us into doing horrible things—especially because it can be everywhere at once. This argument scales from there to however much “AI-pill” the listener can swallow. I think the core strength of this framing is that the AI is embodied. Even if it takes the form of multiple people, you can see it and speak to it. You could experience it getting smarter, if that happened slowly enough. This should help someone naive get a handle on what it would feel like to be up against such an adversary.
    The problem is that this body of knowledge is very, very cursed. There are massive vested interests, a ton of money and national security, built on a foundation of what is referred to as “bots” in this post.
    Yeah, absolutely...I was definitely tiptoeing around this in my approach rather than addressing it head on. That’s because I don’t have good ideas about that and suspect there might not be any general solutions. Approaching a person with those interests might just require a lot more specific knowledge and arguments about those interests to be effective. There is that old saying “You cannot wake someone who is pretending to sleep.” Maybe you can, but you have to enter their dream to do it.
    - trevor 8 Jun 2022 23:21 UTC
      3 points
      Parent
      There is that old saying “You cannot wake someone who is pretending to sleep.” Maybe you can, but you have to enter their dream to do it.
      I understand that vagueness is really appropriate under some circumstances. But you flipped a lot of switches in my brain when you wrote that, regarding things that you might potentially have been referencing. Was that a reference to things like sensor fusion or sleep tracking, or was that referring to policymakers who choose to be vague, was it about skeptical policymakers being turned off by off-putting phrases like “doom soon” or “cosmic endowment”, or was it something else that I didn’t understand? Whatever you’re comfortable with divulging is fine with me.
      - mu_(negative) 9 Jun 2022 1:03 UTC
        2 points
        Parent
        Whoops, apologies, none of the above. I meant to use the adage “you can’t wake someone who is pretending to sleep” similarly to the old “It is difficult to make a man understand a thing when his salary depends on not understanding it.” A person with vested interests is like a person pretending to sleep. They are predisposed not to acknowledge arguments misaligned with their vested interests, even if they do in reality understand and agree with the logic of those arguments. The most classic form of bias.
        I was trying to express that in order to make any impression on such a person you would have to enter the conversation on a vector at least partially aligned with their vested interests, or risk being ignored at best and creating an enemy at worst. Metaphorically, this is like entering into the false “dream” of the person pretending to sleep.