Principles of Privacy for Alignment Research

johnswentworth27 Jul 2022 19:53 UTC

LW: 73 AF: 35

AI Privacy / Confidentiality / Secrecy Information Hazards AI Capabilities AI Alignment Fieldbuilding

The hard/useful parts of alignment research are largely about understanding agency/intelligence/etc. That sort of understanding naturally yields capabilities-relevant insights. So, alignment researchers naturally run into decisions about how private to keep their work.

This post is a bunch of models which I use to think about that decision.

I am not very confident that my thinking on the matter is very good; in general I do not much trust my own judgment on security matters. I’d be more-than-usually interested to hear others’ thoughts/critiques.

The “Nobody Cares” Model

By default, nobody cares. Memetic reproduction rate is less than 1. Median number of citations (not counting self-citations) is approximately zero, and most citations are from people who didn’t actually read the whole paper but just noticed that it’s vaguely related to their own work. Median number of people who will actually go to any effort whatsoever to use your thing is zero. Getting other people to notice your work at all takes significant effort and is hard even when the work is pretty good. “Nobody cares” is a very strong default, the very large majority of the time.

Privacy, under this model, is very easy. You need to make a very large active effort in order for your research to not end up de-facto limited to you and maybe a few friends/coworkers.

This is the de-facto mechanism by which most theoretical work on alignment avoids advancing capabilities most of the time, and I think it is the main mechanism by which most theoretical work on alignment should aim to avoid advancing capabilities most of the time. It should be the default. But obviously there will be exceptions; when does the “nobody cares” model fail?

Theory-Practice Gap and Flashy Demos

Why, as a general rule, does nobody care? In particular, why does nobody working on AI capabilities care about most work on alignment theory most of the time, given that a lot of it is capabilities-relevant?

Well, even ignoring the (large) chunk of theoretical research which turns out to be useless, the theory-practice gap is a thing. Most theoretical ideas don’t really do much when you translate them to practice. This includes most ideas which sound good to intelligent people. Even those theoretical ideas which do turn out to be useful are typically quite hard to translate into practice. It takes months of work (at least), often additional significant insights, and often additional enabling pieces which aren’t already mainstream or even extant. Practitioners correctly expect this, and therefore mostly don’t pay attention to most ideas until after there’s evidence that they work in practice. (This is especially true of the sort of people who work on ML systems.)

In ML/AI, smart-sounding ideas which don’t really work easily are especially abundant, so ML practitioners are (correctly) even more than usually likely to ignore theoretical work.

The flip side of this model is that people will pay lots of attention once there is clear evidence that some idea works in practice—i.e. evidence that the idea has crossed the theory-practice gap. What does that look like? Flashy demos. Flashy demos are the main signal that the theory-practice gap has already been crossed, which people correctly take to mean that the thing can be useful now.

The theory-practice gap is therefore a defense which both (a) slows down someone actively trying to apply an idea, and (b) makes most ideas very-low-memetic-fitness until they have a flashy demo. To a large extent, one can write freely in public without any flashy demos, and it won’t spread very far memetically (or will spread very slowly if it does).

Reputation

Aside from flashy demos, the other main factor I know of which can draw peoples’ attention is reputation. If someone has a track record of interesting work, high status, or previous flashy demos, then people are more likely to pay attention to their theoretical ideas even before the theory-practice gap is crossed.

Of course this is not relevant to the large majority of people the large majority of time, especially insofar as it involves reputation outside of the alignment research community. That said, if you’re relying on lack-of-reputation for privacy, then you need to avoid gaining too broad a following in the future, which may be an important constraint—more on that in the next section.

Takeaways & Gotchas

Main takeaway of the “nobody cares” model: if you’re not already a person of broad interest outside alignment, and you don’t make any flashy demos, then probably approximately nobody working on ML systems outside of alignment will pay any attention to your work.

… but there are some gotchas.

First, there’s a commitment/time-consistency problem: to the extent that we rely on this model of privacy, we need to precommit to remain uninteresting in the future, at least until we’re confident that our earlier work won’t dangerously accelerate capabilities. If you’re hoping to gain lots of status outside the alignment research community, that won’t play well with a “nobody cares” privacy model. If you’re hoping to show future flashy demos, that won’t play well with a “nobody cares” privacy model. If your future work is very visibly interesting, you may be stuck keeping it secret.

(Though note that, in the vast majority of cases, it will turn out that your earlier theory work was never particularly important for capabilities in the first place, and hopefully you figure that out later. So relying on “nobody caring” now will reduce your later options mainly in worlds where your current work turns out to be unusually important/interesting in its own right.)

Second, relying on “nobody caring” obviously does not yield much defense-in-depth. It’s probably not something we want to rely on for stuff that immediately or directly advances capabilities by a lot.

But for most theoretical alignment work most of the time, where there are some capabilities implications but they’re not very direct or immediately dangerous on their own, I think “nobody cares” is the right privacy model under which to operate. Mostly, theoretical researchers should just not worry much about privacy, as long as (1) they don’t publish flashy demos, (2) they don’t have much name recognition outside alignment, and (3) the things they’re working on won’t immediately or directly advance capabilities by a lot.

Beyond “Nobody Cares”: Danger, Secrecy and Adversaries

Broadly speaking, I see two main categories of reasons for theoretical researchers to go beyond the “nobody cares” model and start to actually think about privacy:

Research which might directly or immediately advance capabilities significantly
Current or anticipated future work which is unusually likely to draw a lot of attention, especially outside the alignment field

These are different failure modes of the “nobody cares” model, and they call for different responses.

The “Keep It To Yourself” Model for Immediately Capabilities-Relevant Research

Under the “nobody cares” model, a small number of people might occasionally pay attention to your research and try to use it, but your research is not memetically fit enough to spread much. For research which might directly or immediately advance capabilities significantly, even a handful of people trying it out is potentially problematic. Those handful might realize there’s a big capability gain and run off to produce a flashy demo.

For research which is directly or immediately capabilities-relevant, we want zero people to publicly try it. The “nobody cares” model is not good enough to robustly achieve that. In these cases, my general policy would be to not publish the research, and possibly not share it with anyone else at all (depending on just how immediately and directly capabilities-relevant it looks).

On the other hand, we don’t necessarily need to be super paranoid about it. In this model, we’re still mostly worried about the research contributing marginally to capabilities; we don’t expect it to immediately produce a full-blown strong AGI. We want to avoid the work spreading publicly, but it’s still not that big a problem if e.g. some government surveillance sees my google docs. Spy agencies, after all, would presumably not publicly share my secrets after stealing them.

The “Active Adversary” Model

… which brings us to the really paranoid end of the spectrum. Under this model, we want to be secure even against active adversaries trying to gain access to our research—e.g. government spy agencies.

I’m not going to give advice about how to achieve this level of security, because I don’t think I’m very good at this kind of paranoia. The main question I’ll focus on is: when do we need highly paranoid levels of security, and when can we get away with less?

As with the other models, someone has to pay attention in order for security to be necessary at all. Even if a government spy agency had a world-class ML research lab (which I doubt is currently the case), they’d presumably ignore most research for the same reasons other ML researchers do. Also, spying is presumably expensive; random theorists/scientists are presumably not worth the cost of having a human examine their work. The sorts of things which I’d expect to draw attention are the same as earlier:

enough of a track record that someone might actually go to the trouble of spying on our work
public demonstration of impressive capabilities, or use of impressive capabilities in a way which will likely be noticed (e.g. stock trading)

Even if we are worried about attention from spies, that still doesn’t mean that most of our work needs high levels of paranoia. The sort of groups who are likely to steal information not meant to be public are not themselves very likely to make that information public. (Well, assuming our dry technical research doesn’t draw the attention of the dreaded Investigative Journalists.) So unless we’re worried that our research will accelerate capabilities to such a dramatic extent that it would enable some government agency to develop dangerous AGI themselves, we probably don’t need to worry about the spies.

The case where we need extreme paranoia is where both (1) an adversary is plausibly likely to pay attention, and (2) our research might allow for immediate and direct and very large capability gains, without any significant theory-practice gap.

This degree of secrecy should hopefully not be needed very often.

Other Considerations

Unilateralist’s Curse

Many people may have the same idea, and it only takes one of them to share it. If all their estimates of the riskiness of the idea have some noise in them, and their risk tolerance has some noise, then presumably it will be the person with unusually low risk estimate and unusually high risk tolerance who determines whether the idea is shared.

In general, this sort of thing creates a bias toward unilateral actions being taken even when most people want them to not be taken.

On the other hand, unilateralist’s curse is only relevant to an idea which many people have. And if many people have the idea already, then it’s probably not something which can realistically stay secret for very long anyway.

Existing Ideas

In general, if an idea has been talked-about in public at some previous point, then it’s probably fine to talk about again. Your marginal impact on memetic fitness is unlikely to be very large, and if the idea hasn’t already taken off then that’s strong evidence that it isn’t too memetically fit. (Though this does not apply if you are a person with a very large following.)

Alignment Researchers as the Threat

Just because someone’s part of the ingroup does not mean that they won’t push the run button. We don’t have a way to distinguish safe from dangerous programs; our ingroup is not meaningfully more able to do so than the outgroup, and few people in the ingroup are very careful about running python scripts on a day-to-day basis. (I’m certainly not!)

Point is: don’t just assume that it’s fine to share ideas with everyone in the ingroup.

On the other hand, if all we want is for an idea to not spread publicly, then in-group trust is less risky, because group members would burn their reputation by sharing private things.

Differential Alignment/Capabilities Advantage

In the large majority of cases, research is obviously much more relevant to one or the other, and desired privacy levels should be chosen based on that.

I don’t think it’s very productive, in practice, to play the “but it could be relevant to [alignment/capabilities] via [XYZ]” game for things which seem obviously more relevant to capabilities/alignment.

Most Secrecy Is Hopefully Temporary

Most ideas will not dramatically impact capabilities. Usually, we should expect secrecy to be temporary, long enough to check whether a potentially-capabilities-relevant idea is actually short-term relevant (i.e. test it on some limited use-case).

Feedback Please!

Part of the reason I’m posting this is because I have not seen discussion of the topic which feels adequate. I don’t think my own thoughts are clearly correct. So, please argue about it!