Thane Ruthenis comments on Principles of Privacy for Alignment Research

Thane Ruthenis Jul 28, 2022, 12:56 AM
6 points
2
The case where we need extreme paranoia is where both (1) an adversary is plausibly likely to pay attention, and (2) our research might allow for immediate and direct and very large capability gains, without any significant theory-practice gap.
The problem with this is...
- In my model, most useful research is incremental and builds upon itself. As you point out, it’s difficult to foresee how useful what you’re currently working on will be, but if it is useful, your or others’ later work will probably use it as an input.
- The successful, fully filled-out alignment research tech tree will necessarily contain crucial capabilities insights. That is, (2) will necessarily be true if we succeed.
- At the very end, once we have the alignment solution, we’ll need to ensure that it’s implemented, which means influencing AI Labs and/or government policy, which means becoming visible and impossible-to-ignore by these entities. So (1) will be true as well. Potentially because we’ll have to publish a lot of flashy demos.
  - In theory, this can be done covertly, by e. g. privately contacting key people and quietly convincing them or something. I wouldn’t rely on us having this kind of subtle skill and coordination.
So, operating under Nobody Cares, you do incremental research. The usefulness of any given piece you’re working on is very dubious 95% of the time, so you hit “submit” 95% of the time. So you keep publishing until you strike gold, until you combine some large body of published research with a novel insight and realize that the result clearly advances capabilities. At last, you can clearly see that (2) is true, so you don’t publish and engage high security. Except… By this point, 95% of the insights necessary for that capabilities leap have already been published. You’re in the unstable situation where the last 5% may be contributed by a random other alignment/capabilities researcher looking at the published 95% at the right angle and posting the last bit without thinking. And whether it’s already the time to influence the AI industry/policy, or it’ll come within a few years, (1) will become true shortly, so there’ll be lots of outsiders poring over your work.
Basically, I’m concerned that following “Nobody Cares” predictably sets us up to fail at the very end. (1) and (2) are not true very often, but we can expect that they will be true if our work proves useful at all.
Not that I have any idea what to do about that.
- johnswentworth Jul 28, 2022, 5:24 PM
  7 points
  0
  Parent
  One part I disagree with: I do not expect that implementing an alignment solution will involve influencing government/labs, conditional on having an alignment solution at all. Reason: alignment requires understanding basically-all the core pieces of intelligence at a sufficiently-detailed level that any team capable of doing it will be very easily capable of building AGI. It is wildly unlikely that a team not capable of building AGI is even remotely capable of solving alignment.
  Another part I disagree with: I claim that, if I publish 95% of the insights needed for X, then the average time before somebody besides me or my immediate friends/coworkers implements X goes down by, like, maybe 10%. Even if I publish 100% of the insights, the average time before somebody besides me or my immediate friends/coworkers implements X only goes down by maybe 20%, if I don’t publish any flashy demos.
  A concrete example to drive that intuition: imagine a software library which will do something very useful once complete. If the library is 95% complete, nobody uses it, and it’s pretty likely that someone looking to implement the functionality will just start from scratch. Even if the library is 100% complete, without a flashy demo few people will ever find it.
  All that said, there is a core to your argument which I do buy. The worlds where our work is useful at all for alignment are also the worlds where our work is most likely to be capabilities relevant. So, I’m most likely to end up regretting publishing something in exactly those worlds where the thing is useful for alignment; I’m making my life harder in exactly those worlds where I might otherwise have succeeded.
  What links here?
  - Thane Ruthenis's comment on The case against AI alignment by andrew sauer (Dec 24, 2022, 7:55 AM; 32 points)
  - Thane Ruthenis Jul 29, 2022, 4:51 AM
    3 points
    0
    Parent
    I do not expect that implementing an alignment solution will involve influencing government/labs, conditional on having an alignment solution at all
    Mmm, right, in this case the fact that the rest of the AI industry is being carefree about openly publishing WMD design schematics is actually beneficial to us — our hypothetical AGI group won’t be missing many insights that other industry leaders have.
    The two bottlenecks here that I still see are money and manpower. The theory for solving alignment and the theory for designing AGI are closely related, but the practical implementations of these two projects may be sufficiently disjoint — such that the optimal setup is e. g. one team works full-time on developing universal interpretability tools while another works full-time on AGI architecture design. If we could hand off the latter part to skilled AI architects (and not expect them to screw it up), that may be a nontrivial speed boost.
    Separately, there’s the question of training sets/compute, i. e. money. Do we have enough of it? Suppose in a decade or two, one of the leading AI Labs successfully pushes for a Manhattan project equivalent, such that they’d be able to blow billions of dollars on training runs. Sure, insights into agency will probably make our AGI less compute-hungry. But will it be cheaper enough that we’d be able to match this?
    Even if the library is 100% complete, without a flashy demo few people will ever find it.
    But what if we have to release a flashy demo to attract attention, so there are now people swarming the already-published research looking for ideas?
    - johnswentworth Jul 29, 2022, 5:26 AM
      4 points
      0
      Parent
      We do in fact have access to rather a lot of money; billions of dollars would not be out of the question in a few years, hundreds of millions are already probably available if we have something worthwhile to do with it, and alignment orgs are spending tens of millions already. Though by the time it becomes relevant, I don’t particularly expect today’s dollars → compute → performance curves to apply very well anyway.
      But what if we have to release a flashy demo to attract attention, so there are now people swarming the already-published research looking for ideas?
      Also money is a great substitute for attracting attention.
      - Thane Ruthenis Aug 19, 2022, 12:42 PM
        13 points
        11
        Parent
        Okay, I’ve thought about it more, and I think my concerns are mainly outlined by this. Less by the post’s actual contents, and more by the post’s existence.
        People dislike villains. Whether the concerns Andrew outlines are valid or not, people on the outside will tend to think that such concerns are valid. The hypothetical unilateral-aligned-AGI organization will be, at all times, on the verge of being a target of the entire world. The public would rally against it if the organizations’ intentions became public knowledge, other AI Labs would be eager to get rid of the competition slash threat it presents, and governments would be eager either to seize AI research (if they take AI seriously by that point) or acquire political points by squishing something the public and megacorps want squished.
        As such, the unilateral path requires a lot of subtle secrecy too. It should not be known that we expect our AI to engage in, uh, full-scale world… optimization. In theory, that connection can be left obscured — most of the people involved can just be allowed to fail to think about what the aligned superintelligence will do once it’s deployed, so there aren’t leaks from low-commitment people joining and quitting the org. But the people in charge will probably have the full picture, and… Well, at this point it sounds like the stupid kind of supervillain doomsday scheme, no?
        More practically, I think the ship has already sailed on keeping the sort of secrecy this plan would need to work. I don’t understand why all this talk of pivotal acts has been allowed to enter public discourse by Eliezer et al., but it’ll be doubtlessly connected to any hypothetical future friendly-AGI org. Probably not by the public/other AI labs directly, but by fellow AI Safety researches who do not agree with unilateral pivotal acts. And once the concerns have been signal-boosted so, then they may be picked up by the media/politicians/Eliezer’s sneer club/whoever, and once we’re spending billions on training runs and it’s clear that there’s something actually going on beyond a bunch of doom-cult wackos, they will take these concerns seriously and act on them.
        A further contributing factor may be increased public awareness of AI Risk in the future, encouraged by general AI capabilities growth, possible (non-omnicial) AI disasters, and poorly-considered efforts of our own community. (It would be very darkly ironic if AI Safety’s efforts to ban dangerous AI research resulted in governments banning AI Safety’s own AGI research and no-one else’s, so that’s probably an attractor in possibility-space because we live in Hell.)
        The bottom line is… This idea seems thermonuclear, in the sense that trying it and getting noticed probably completely dooms us on the spot, and it’d be really hard not to get noticed.
        (Though I don’t really buy the whole “pivotal processes” thing either. We can probably increase the timeline this way, but actually making the world’s default systems produce an aligned AI… Nah.)
      - Thane Ruthenis Jul 29, 2022, 5:42 AM
        4 points
        1
        Parent
        Fair. I have no more concrete counter-arguments to offer at this time.
        I still have a vague sense that acting with the expectations that we’d be able to unilaterally build an AGI is optimistic in a way that dooms us in a nontrivial number of timelines that would’ve been salvageable if we didn’t assume that. But maybe that impression is wrong.