Principles of Privacy for Alignment Research
The hard/useful parts of alignment research are largely about understanding agency/intelligence/etc. That sort of understanding naturally yields capabilities-relevant insights. So, alignment researchers naturally run into decisions about how private to keep their work.
This post is a bunch of models which I use to think about that decision.
I am not very confident that my thinking on the matter is very good; in general I do not much trust my own judgment on security matters. I’d be more-than-usually interested to hear others’ thoughts/critiques.
The “Nobody Cares” Model
By default, nobody cares. Memetic reproduction rate is less than 1. Median number of citations (not counting self-citations) is approximately zero, and most citations are from people who didn’t actually read the whole paper but just noticed that it’s vaguely related to their own work. Median number of people who will actually go to any effort whatsoever to use your thing is zero. Getting other people to notice your work at all takes significant effort and is hard even when the work is pretty good. “Nobody cares” is a very strong default, the very large majority of the time.
Privacy, under this model, is very easy. You need to make a very large active effort in order for your research to not end up de-facto limited to you and maybe a few friends/coworkers.
This is the de-facto mechanism by which most theoretical work on alignment avoids advancing capabilities most of the time, and I think it is the main mechanism by which most theoretical work on alignment should aim to avoid advancing capabilities most of the time. It should be the default. But obviously there will be exceptions; when does the “nobody cares” model fail?
Theory-Practice Gap and Flashy Demos
Why, as a general rule, does nobody care? In particular, why does nobody working on AI capabilities care about most work on alignment theory most of the time, given that a lot of it is capabilities-relevant?
Well, even ignoring the (large) chunk of theoretical research which turns out to be useless, the theory-practice gap is a thing. Most theoretical ideas don’t really do much when you translate them to practice. This includes most ideas which sound good to intelligent people. Even those theoretical ideas which do turn out to be useful are typically quite hard to translate into practice. It takes months of work (at least), often additional significant insights, and often additional enabling pieces which aren’t already mainstream or even extant. Practitioners correctly expect this, and therefore mostly don’t pay attention to most ideas until after there’s evidence that they work in practice. (This is especially true of the sort of people who work on ML systems.)
In ML/AI, smart-sounding ideas which don’t really work easily are especially abundant, so ML practitioners are (correctly) even more than usually likely to ignore theoretical work.
The flip side of this model is that people will pay lots of attention once there is clear evidence that some idea works in practice—i.e. evidence that the idea has crossed the theory-practice gap. What does that look like? Flashy demos. Flashy demos are the main signal that the theory-practice gap has already been crossed, which people correctly take to mean that the thing can be useful now.
The theory-practice gap is therefore a defense which both (a) slows down someone actively trying to apply an idea, and (b) makes most ideas very-low-memetic-fitness until they have a flashy demo. To a large extent, one can write freely in public without any flashy demos, and it won’t spread very far memetically (or will spread very slowly if it does).
Reputation
Aside from flashy demos, the other main factor I know of which can draw peoples’ attention is reputation. If someone has a track record of interesting work, high status, or previous flashy demos, then people are more likely to pay attention to their theoretical ideas even before the theory-practice gap is crossed.
Of course this is not relevant to the large majority of people the large majority of time, especially insofar as it involves reputation outside of the alignment research community. That said, if you’re relying on lack-of-reputation for privacy, then you need to avoid gaining too broad a following in the future, which may be an important constraint—more on that in the next section.
Takeaways & Gotchas
Main takeaway of the “nobody cares” model: if you’re not already a person of broad interest outside alignment, and you don’t make any flashy demos, then probably approximately nobody working on ML systems outside of alignment will pay any attention to your work.
… but there are some gotchas.
First, there’s a commitment/time-consistency problem: to the extent that we rely on this model of privacy, we need to precommit to remain uninteresting in the future, at least until we’re confident that our earlier work won’t dangerously accelerate capabilities. If you’re hoping to gain lots of status outside the alignment research community, that won’t play well with a “nobody cares” privacy model. If you’re hoping to show future flashy demos, that won’t play well with a “nobody cares” privacy model. If your future work is very visibly interesting, you may be stuck keeping it secret.
(Though note that, in the vast majority of cases, it will turn out that your earlier theory work was never particularly important for capabilities in the first place, and hopefully you figure that out later. So relying on “nobody caring” now will reduce your later options mainly in worlds where your current work turns out to be unusually important/interesting in its own right.)
Second, relying on “nobody caring” obviously does not yield much defense-in-depth. It’s probably not something we want to rely on for stuff that immediately or directly advances capabilities by a lot.
But for most theoretical alignment work most of the time, where there are some capabilities implications but they’re not very direct or immediately dangerous on their own, I think “nobody cares” is the right privacy model under which to operate. Mostly, theoretical researchers should just not worry much about privacy, as long as (1) they don’t publish flashy demos, (2) they don’t have much name recognition outside alignment, and (3) the things they’re working on won’t immediately or directly advance capabilities by a lot.
Beyond “Nobody Cares”: Danger, Secrecy and Adversaries
Broadly speaking, I see two main categories of reasons for theoretical researchers to go beyond the “nobody cares” model and start to actually think about privacy:
Research which might directly or immediately advance capabilities significantly
Current or anticipated future work which is unusually likely to draw a lot of attention, especially outside the alignment field
These are different failure modes of the “nobody cares” model, and they call for different responses.
The “Keep It To Yourself” Model for Immediately Capabilities-Relevant Research
Under the “nobody cares” model, a small number of people might occasionally pay attention to your research and try to use it, but your research is not memetically fit enough to spread much. For research which might directly or immediately advance capabilities significantly, even a handful of people trying it out is potentially problematic. Those handful might realize there’s a big capability gain and run off to produce a flashy demo.
For research which is directly or immediately capabilities-relevant, we want zero people to publicly try it. The “nobody cares” model is not good enough to robustly achieve that. In these cases, my general policy would be to not publish the research, and possibly not share it with anyone else at all (depending on just how immediately and directly capabilities-relevant it looks).
On the other hand, we don’t necessarily need to be super paranoid about it. In this model, we’re still mostly worried about the research contributing marginally to capabilities; we don’t expect it to immediately produce a full-blown strong AGI. We want to avoid the work spreading publicly, but it’s still not that big a problem if e.g. some government surveillance sees my google docs. Spy agencies, after all, would presumably not publicly share my secrets after stealing them.
The “Active Adversary” Model
… which brings us to the really paranoid end of the spectrum. Under this model, we want to be secure even against active adversaries trying to gain access to our research—e.g. government spy agencies.
I’m not going to give advice about how to achieve this level of security, because I don’t think I’m very good at this kind of paranoia. The main question I’ll focus on is: when do we need highly paranoid levels of security, and when can we get away with less?
As with the other models, someone has to pay attention in order for security to be necessary at all. Even if a government spy agency had a world-class ML research lab (which I doubt is currently the case), they’d presumably ignore most research for the same reasons other ML researchers do. Also, spying is presumably expensive; random theorists/scientists are presumably not worth the cost of having a human examine their work. The sorts of things which I’d expect to draw attention are the same as earlier:
enough of a track record that someone might actually go to the trouble of spying on our work
public demonstration of impressive capabilities, or use of impressive capabilities in a way which will likely be noticed (e.g. stock trading)
Even if we are worried about attention from spies, that still doesn’t mean that most of our work needs high levels of paranoia. The sort of groups who are likely to steal information not meant to be public are not themselves very likely to make that information public. (Well, assuming our dry technical research doesn’t draw the attention of the dreaded Investigative Journalists.) So unless we’re worried that our research will accelerate capabilities to such a dramatic extent that it would enable some government agency to develop dangerous AGI themselves, we probably don’t need to worry about the spies.
The case where we need extreme paranoia is where both (1) an adversary is plausibly likely to pay attention, and (2) our research might allow for immediate and direct and very large capability gains, without any significant theory-practice gap.
This degree of secrecy should hopefully not be needed very often.
Other Considerations
Unilateralist’s Curse
Many people may have the same idea, and it only takes one of them to share it. If all their estimates of the riskiness of the idea have some noise in them, and their risk tolerance has some noise, then presumably it will be the person with unusually low risk estimate and unusually high risk tolerance who determines whether the idea is shared.
In general, this sort of thing creates a bias toward unilateral actions being taken even when most people want them to not be taken.
On the other hand, unilateralist’s curse is only relevant to an idea which many people have. And if many people have the idea already, then it’s probably not something which can realistically stay secret for very long anyway.
Existing Ideas
In general, if an idea has been talked-about in public at some previous point, then it’s probably fine to talk about again. Your marginal impact on memetic fitness is unlikely to be very large, and if the idea hasn’t already taken off then that’s strong evidence that it isn’t too memetically fit. (Though this does not apply if you are a person with a very large following.)
Alignment Researchers as the Threat
Just because someone’s part of the ingroup does not mean that they won’t push the run button. We don’t have a way to distinguish safe from dangerous programs; our ingroup is not meaningfully more able to do so than the outgroup, and few people in the ingroup are very careful about running python scripts on a day-to-day basis. (I’m certainly not!)
Point is: don’t just assume that it’s fine to share ideas with everyone in the ingroup.
On the other hand, if all we want is for an idea to not spread publicly, then in-group trust is less risky, because group members would burn their reputation by sharing private things.
Differential Alignment/Capabilities Advantage
In the large majority of cases, research is obviously much more relevant to one or the other, and desired privacy levels should be chosen based on that.
I don’t think it’s very productive, in practice, to play the “but it could be relevant to [alignment/capabilities] via [XYZ]” game for things which seem obviously more relevant to capabilities/alignment.
Most Secrecy Is Hopefully Temporary
Most ideas will not dramatically impact capabilities. Usually, we should expect secrecy to be temporary, long enough to check whether a potentially-capabilities-relevant idea is actually short-term relevant (i.e. test it on some limited use-case).
Feedback Please!
Part of the reason I’m posting this is because I have not seen discussion of the topic which feels adequate. I don’t think my own thoughts are clearly correct. So, please argue about it!
I think my threat model is a bit different. I don’t particularly care about the zillions of mediocre ML practitioners who follow things that are hot and/or immediately useful. I do care about the pioneers, who are way ahead of the curve, working to develop the next big idea in AI long before it arrives. These people are not only very insightful themselves, but also can recognize an important insight when they see it, and they’re out hunting for those insights, and they’re not looking in the same places as most people, and in particular they’re not looking at whatever is trending on Twitter or immediately useful.
Let’s try this analogy, maybe: “most impressive AI” ↔ “fastest man-made object”. Let’s say that the current record-holder for fastest man-made object is a train. And right now a competitor is building a better train, that uses new train-track technology. It’s all very exciting, and lots of people are following it in the newspapers. Meanwhile, a pioneer has the idea of building the first-ever rocket ship, but the pioneer is stuck because they need better heat-resistant tiles in order for the rocket-ship prototype to actually work. This pioneer is probably not going to be following the fastest-train news; instead, they’re going to be poring over the obscure literature on heat-resistant tiles. (Sorry for lack of historical or engineering accuracy in the above.) This isn’t a perfect analogy for many reasons, ignore it if you like.
So my ideal model is (1) figure out the whole R&D path(s) to building AGI, (2) don’t tell anyone (or even write it down!), (3) now you know exactly what not to publish, i.e. everything on that path, and it doesn’t matter whether those things would be immediately useful or not, because the pioneers who are already setting out down that path will seek out and find what you’re publishing, even if it’s obscure, because they already have a pretty good idea of what they’re looking for. Of course, that’s easier said than done, especially step (1) :-P
Thinking out loud here...
I do basically buy the “ignore the legions of mediocre ML practitioners, pay attention to the pioneers” model. That does match my modal expectations for how AGI gets built. But:
How do those pioneers find my work, if my work isn’t very memetically fit?
If they encounter my work directly from me (i.e. by reading it on LW), then at that point they’re selected pretty heavily for also finding lots of stuff about alignment.
There’s still the theory-practice gap; our hypothetical pioneer needs to somehow recognize the theory they need without flashy demos to prove its effectiveness.
Thinking about it, these factors are not enough to make me confident that someone won’t use my work to produce an unaligned AGI. On (1), thinking about my personal work, there’s just very little technical work at all on abstraction, so someone who knows to look for technical work on abstraction could very plausibly encounter mine. And that is indeed the sort of thing I’d expect an AGI pioneer to be looking for. On (2), they’d be a lot more likely to encounter my work if they’re already paying attention to alignment, and encountering my work would probably make them more likely to pay attention to alignment even if they weren’t before, but neither of those really rules out unaligned researchers. On (3), I do expect a pioneer to be able to recognize the theory they need without flashy demos to prove it if they spend a few days’ attention on it.
… ok, so I’m basically convinced that I should be thinking about this particular scenario, and the “nobody cares” defense is weaker against the hypothetical pioneers than against most people. I think the “fundamental difference” is that the hypothetical pioneers know what to look for; they’re not just relying on memetic fitness to bring the key ideas to them.
… well fuck, now I need to go propagate this update.
An annoying thing is, just as I sometimes read Yann LeCun or Steven Pinker or Jeff Hawkins, and I extract some bits of insight from them while ignoring all the stupid things they say about the alignment problem, by the same token I imagine other people might read my posts, and extract some bits of insight from me while ignoring all the wise things I say about the alignment problem. :-P
That said, I do definitely put some nonzero weight on those kinds of considerations. :)
More thinking out loud...
On this model, I still basically don’t worry about spy agencies.
What do I really want here? Like, what would be the thing which I most want those pioneers find? “Nothing” doesn’t actually seem like the right answer; someone who already knows what to look for is going to figure out the key ideas with or without me. What I really want to do is to advertise to those pioneers, in the obscure work which most people will never pay attention to. I want to recruit them.
It feels like the prototypical person I’m defending against here is younger me. What would work on younger me?
Coming back to this 2 years later, and I’m curious about how you’ve changed your mind.
[For the record, here’s previous relevant discussion]
My problem with the “nobody cares” model is that it seems self-defeating. First, if nobody cares about my work, then how would my work help with alignment? I don’t put a lot of stock into building aligned AGI in the basement on the my own. (And not only because I don’t have a basement.) Therefore, any impact I will have flows through my work becoming sufficiently known that somebody who builds AGI ends up using it. Even if I optimistically assume that I will personally be part of that project, my work needs to be sufficiently well-known to attract money and talent to make such a project possible.
Second, I also don’t put a lot of stock into solving alignment all by myself. Therefore, other people need to build on my work. In theory, this only requires it to be well-known in the alignment community. But, to improve our chances of solving the problem we need to make the alignment community bigger. We want to attract more talent, much of which is found in the broader computer science community. This is in direct opposition to preserving the conditions for “nobody cares”.
Third, a lot of people are motivated by fame and status (myself included). Therefore, bringing talent into alignment requires the fame and status to be achievable inside the field. This is obviously also in contradiction with “nobody cares”.
My own thinking about this is: yes, progress in the problems I’m working on can contribute to capability research, but the overall chance of success on the pathway “capability advances driven by theoretical insights” is higher than on the pathway “capability advances driven by trial and error”, even if the first leads to AGI sooner, especially if these theoretical insights are also useful for alignment. I certainly don’t want to encourage the use of my work to advance capability, and I try to discourage anyone who would listen, but I accept the inevitable risk of that happening in exchange for the benefits.
Then again, I’m by no means confident that I’m thinking about all of this in the right way.
Our work doesn’t necessarily need wide memetic spread to be found by the people who know what to look for. E.g. people playing through the alignment game tree are a lot more likely to realize that ontology identification, grain-of-truth, value drift, etc, are key questions to ask, whereas ML researchers just pushing toward AGI are a lot less likely to ask those questions.
I do agree that a growing alignment community will add memetic fitness to alignment work in general, which is at least somewhat problematic for the “nobody cares” model. And I do expect there to be at least some steps which need a fairly large alignment community doing “normal” (i.e. paradigmatic) incremental research. For instance, on some paths we need lots of people doing incremental interpretability/ontology research to link up lots of concepts to their representations in a trained system. On the other hand, not all of the foundations need to be very widespread—e.g. in the case of incremental interpretability/ontology research, it’s mostly the interpretability tools which need memetic reach, not e.g. theory around grain-of-truth or value drift.
That’s a valid argument, but I can also imagine groups that (i) in a world where alignment research is obscure proceed to create unaligned AGI (ii) in a world where alignment research is famous, use this research when building their AGI. Maybe any such group would be operationally inadequate anyway, but I’m not sure. More generally, it’s possible that in a world where alignment research is a well-known respectable field of study, more people take AI risk seriously.
I think I have a somewhat different model of the alignment knowledge tree. From my perspective, the research I’m doing is already paradigmatic. I have a solid-enough paradigm, inside which there are many open problems, and what we need is a bunch of people chipping away at these open problems. Admittedly, the size of this “bunch” is still closer to 10 people than to 1000 people but (i) it’s possible that the open problems will keep multiplying hydra-style, as often happens in math and (ii) memetic fitness would help getting the very best 10 people to do it.
It’s also likely that there will be a “phase II” where the nature of the necessary research becomes very different (e.g. it might involve combining the new theory with neuroscience, or experimental ML research, or hardware engineering), and successful transition to this phase might require getting a lot of new people on board which would also be a lot easier given memetic fitness.
My usual take here is the “Nobody Cares” model, though I think there is one scenario that I tend to be worried about a bit here that you didn’t address, which is how to think about whether or not you want things ending up in the training data for a future AI system. That’s a scenario where the “Nobody Cares” model really doesn’t apply, since the AI actually does have time to look at everything you write.
That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important. However, it can also help AI systems do things like better understand how to be deceptive, so this sort of thing can be a bit tricky.
Worrying about which alignment writing ends up in the training data feels like a very small lever for affecting alignment; my general heuristic is that we should try to focus on much bigger levers.
Is that because you think it would be hard to get the relevant researchers to exclude any given class of texts from their training datasets [EDIT: or prevent web crawlers from downloading the texts etc.]? Or even if that part was easy, you would still feel that that lever is very small?
Even if that part was easy, it still seems like a very small lever. A system capable of taking over the world will be able to generate those ideas for itself, and a system with strong motivations to take over the world won’t have them changed by small amounts of training text.
It may have more impact at intermediate stages of AI training, where the AI is smart enough to do value reflection and contemplate self-modification/hacking the training process, but not smart enough to immediately figure out all the ways it can go wrong.
E. g., it can come to wrong conclusions about its own values, like humans can, then lock these misunderstandings in by maneuvering the training process this way, or by designing a successor agent with the wrong values. Or it may design sub-agents without understanding the various pitfalls of multi-agent systems, and get taken over by some Goodharter.
I agree that “what goes into the training set” is a minor concern, though, even with regards to influencing the aforementioned dynamic.
Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.
The relevant texts I’m thinking about here are:
Descriptions of certain tricks to evade our safety measures.
Texts that might cause the ML model to (better) model AIS researchers or potential AIS interventions, or other potential AI systems that the model might cooperate with (or that might “hijack” the model’s logic).
That consideration seems relevant only for language models that will be doing/supporting alignment work.
The problem with this is...
In my model, most useful research is incremental and builds upon itself. As you point out, it’s difficult to foresee how useful what you’re currently working on will be, but if it is useful, your or others’ later work will probably use it as an input.
The successful, fully filled-out alignment research tech tree will necessarily contain crucial capabilities insights. That is, (2) will necessarily be true if we succeed.
At the very end, once we have the alignment solution, we’ll need to ensure that it’s implemented, which means influencing AI Labs and/or government policy, which means becoming visible and impossible-to-ignore by these entities. So (1) will be true as well. Potentially because we’ll have to publish a lot of flashy demos.
In theory, this can be done covertly, by e. g. privately contacting key people and quietly convincing them or something. I wouldn’t rely on us having this kind of subtle skill and coordination.
So, operating under Nobody Cares, you do incremental research. The usefulness of any given piece you’re working on is very dubious 95% of the time, so you hit “submit” 95% of the time. So you keep publishing until you strike gold, until you combine some large body of published research with a novel insight and realize that the result clearly advances capabilities. At last, you can clearly see that (2) is true, so you don’t publish and engage high security. Except… By this point, 95% of the insights necessary for that capabilities leap have already been published. You’re in the unstable situation where the last 5% may be contributed by a random other alignment/capabilities researcher looking at the published 95% at the right angle and posting the last bit without thinking. And whether it’s already the time to influence the AI industry/policy, or it’ll come within a few years, (1) will become true shortly, so there’ll be lots of outsiders poring over your work.
Basically, I’m concerned that following “Nobody Cares” predictably sets us up to fail at the very end. (1) and (2) are not true very often, but we can expect that they will be true if our work proves useful at all.
Not that I have any idea what to do about that.
One part I disagree with: I do not expect that implementing an alignment solution will involve influencing government/labs, conditional on having an alignment solution at all. Reason: alignment requires understanding basically-all the core pieces of intelligence at a sufficiently-detailed level that any team capable of doing it will be very easily capable of building AGI. It is wildly unlikely that a team not capable of building AGI is even remotely capable of solving alignment.
Another part I disagree with: I claim that, if I publish 95% of the insights needed for X, then the average time before somebody besides me or my immediate friends/coworkers implements X goes down by, like, maybe 10%. Even if I publish 100% of the insights, the average time before somebody besides me or my immediate friends/coworkers implements X only goes down by maybe 20%, if I don’t publish any flashy demos.
A concrete example to drive that intuition: imagine a software library which will do something very useful once complete. If the library is 95% complete, nobody uses it, and it’s pretty likely that someone looking to implement the functionality will just start from scratch. Even if the library is 100% complete, without a flashy demo few people will ever find it.
All that said, there is a core to your argument which I do buy. The worlds where our work is useful at all for alignment are also the worlds where our work is most likely to be capabilities relevant. So, I’m most likely to end up regretting publishing something in exactly those worlds where the thing is useful for alignment; I’m making my life harder in exactly those worlds where I might otherwise have succeeded.
Mmm, right, in this case the fact that the rest of the AI industry is being carefree about openly publishing WMD design schematics is actually beneficial to us — our hypothetical AGI group won’t be missing many insights that other industry leaders have.
The two bottlenecks here that I still see are money and manpower. The theory for solving alignment and the theory for designing AGI are closely related, but the practical implementations of these two projects may be sufficiently disjoint — such that the optimal setup is e. g. one team works full-time on developing universal interpretability tools while another works full-time on AGI architecture design. If we could hand off the latter part to skilled AI architects (and not expect them to screw it up), that may be a nontrivial speed boost.
Separately, there’s the question of training sets/compute, i. e. money. Do we have enough of it? Suppose in a decade or two, one of the leading AI Labs successfully pushes for a Manhattan project equivalent, such that they’d be able to blow billions of dollars on training runs. Sure, insights into agency will probably make our AGI less compute-hungry. But will it be cheaper enough that we’d be able to match this?
But what if we have to release a flashy demo to attract attention, so there are now people swarming the already-published research looking for ideas?
We do in fact have access to rather a lot of money; billions of dollars would not be out of the question in a few years, hundreds of millions are already probably available if we have something worthwhile to do with it, and alignment orgs are spending tens of millions already. Though by the time it becomes relevant, I don’t particularly expect today’s dollars → compute → performance curves to apply very well anyway.
Also money is a great substitute for attracting attention.
Okay, I’ve thought about it more, and I think my concerns are mainly outlined by this. Less by the post’s actual contents, and more by the post’s existence.
People dislike villains. Whether the concerns Andrew outlines are valid or not, people on the outside will tend to think that such concerns are valid. The hypothetical unilateral-aligned-AGI organization will be, at all times, on the verge of being a target of the entire world. The public would rally against it if the organizations’ intentions became public knowledge, other AI Labs would be eager to get rid of the competition slash threat it presents, and governments would be eager either to seize AI research (if they take AI seriously by that point) or acquire political points by squishing something the public and megacorps want squished.
As such, the unilateral path requires a lot of subtle secrecy too. It should not be known that we expect our AI to engage in, uh, full-scale world… optimization. In theory, that connection can be left obscured — most of the people involved can just be allowed to fail to think about what the aligned superintelligence will do once it’s deployed, so there aren’t leaks from low-commitment people joining and quitting the org. But the people in charge will probably have the full picture, and… Well, at this point it sounds like the stupid kind of supervillain doomsday scheme, no?
More practically, I think the ship has already sailed on keeping the sort of secrecy this plan would need to work. I don’t understand why all this talk of pivotal acts has been allowed to enter public discourse by Eliezer et al., but it’ll be doubtlessly connected to any hypothetical future friendly-AGI org. Probably not by the public/other AI labs directly, but by fellow AI Safety researches who do not agree with unilateral pivotal acts. And once the concerns have been signal-boosted so, then they may be picked up by the media/politicians/Eliezer’s sneer club/whoever, and once we’re spending billions on training runs and it’s clear that there’s something actually going on beyond a bunch of doom-cult wackos, they will take these concerns seriously and act on them.
A further contributing factor may be increased public awareness of AI Risk in the future, encouraged by general AI capabilities growth, possible (non-omnicial) AI disasters, and poorly-considered efforts of our own community. (It would be very darkly ironic if AI Safety’s efforts to ban dangerous AI research resulted in governments banning AI Safety’s own AGI research and no-one else’s, so that’s probably an attractor in possibility-space because we live in Hell.)
The bottom line is… This idea seems thermonuclear, in the sense that trying it and getting noticed probably completely dooms us on the spot, and it’d be really hard not to get noticed.
(Though I don’t really buy the whole “pivotal processes” thing either. We can probably increase the timeline this way, but actually making the world’s default systems produce an aligned AI… Nah.)
Fair. I have no more concrete counter-arguments to offer at this time.
I still have a vague sense that acting with the expectations that we’d be able to unilaterally build an AGI is optimistic in a way that dooms us in a nontrivial number of timelines that would’ve been salvageable if we didn’t assume that. But maybe that impression is wrong.
Suppose hypothetically I had a way to make neural networks recognize OOD inputs. (Like I get back 40% dog, 10% cat, 20% OOD, 5% teapot...) Should I run a big imagenet flashy demo (So I personally know if the idea scales up) and then tell no one?
There was reasoning that went. Any research that has a better alignment/ capabilities ratio than the average of all research currently happening is good. A lot of research is pure capabilities, like hardware research. So almost anything with any alignment in it is good. I’m not quite sure if this is a good rule of thumb.
I think I basically don’t buy the “just increase the alignment/capabilities ratio” model, at least on its own. It just isn’t a sufficient condition to not die.
It does feel like there’s a better version of that model waiting to be found, but I’m not sure what it is.
Relevant thread
If Donald is talking about the reasoning from my post, the primary argument there went a bit different. It was about expanding the AI Safety field by converting extant capabilities researchers/projects; and that even if we can’t make them stop capability research, any intervention that 1) doesn’t speed it up and 2) makes them output alignment results alongside capabilities results is net positive.
I think I’d also argued that the AI Safety is tiny at the moment so we won’t contribute much to capability research even if we deliberately tried, but in retrospect, that argument is obviously invalid in hypotheticals where we’re actually effective at solving alignment.
Model 1. Your new paper produces c units of capabilities, and a units of alignment. When C units of capabilities are reached, an AI is produced, and it is aligned iff A units of alignment has been produced. The rest of the world produces, and will continue to produce, alignment and capabilities research in ratio R. You are highly uncertain about A and/or C, but have a good guess at a,c,R.
In this model, if AC<<R we are screwed whatever you do, if AC>>R we win whatever you do. Your paper makes a difference in those worlds where AC≈R, and in those worlds it helps iff ac>R.
This model treats alignment and capabilities as continuous, fungible quantities that slowly accumulate. This is a dubious assumption. It also assumes that conditional on us being in the marginal world (The world very where good and bad outcomes are both very close) that your mainline probability involves research continuing at its current ratio.
If for example, you were extremely pessimistic, and think that the only way we have any chance is if a portal to Dath ilan opens up, then the goal is largely to hold off all research for as long as possible, to maximize the time a deus ex machina can happen in. Other goals might include publishing the sort of research most likely to encourage a massive global “take AI seriously” movement.
So, the main takeaway is that we need some notion of fungibility/additivity of research progress (for both alignment and capabilities) in order for the “ratio” model to make sense.
Some places fungibility/additivity could come from:
research reducing time-until-threshold-is-reached additively and approximately-independently of other research
probabilistic independence in general
a set of rate-limiting constraints on capabilities/alignment strategies, such that each one must be solved independent of the others (i.e. solving each one does not help solve the others very much)
???
Fungibility is necessary, but not sufficient for the “if your work has a better ratio than average research, publish”. You also need your uncertainty to be in the right place.
If you were certain of R, and uncertain what ACfuture research might have, you get a different rule, publish if ac>R.
I think “alignment/capabilities > 1” is a closer heuristic than “alignment/capabilities > average”, in the sense of ‘[fraction of remaining alignment this solves] / [fraction of remaining capabilities this solves]’. That’s a sufficient condition if all research does it, though not IRL e.g. given pure capabilities research also exists; but I think it’s still a necessary condition for something to be net helpful.
It feels like what’s missing is more like… gears of how to compare “alignment” to “capabilities” applications for a particular piece of research. Like, what’s the thing I should actually be imagining when thinking about that “ratio”?
Thank you for writing this post; I had been struggling with these considerations a while back. I investigated going full paranoid mode but in the end mostly decided against it.
I agree theoretical insight on agency and intelligence have a real chance of leiding to capability gains. I agree on the government spy threat model as being unlikely. I would like to add however that if say MIRI builds a safe AGI prototype—perhaps based on different principles than systems used by adversaries it might make sense for an (ai-assisted) adversary to trawl through your old blogposts.
Byrnes has already mentioned the distinction between pioneers and median researchers. Another aspect that your threat models don’t capture is: research that builds on your research. Your research may end up in a very long chain of theoretical research only a minority of which you have contributed. Or the spirit if not the letter of your ideas may percolate through the research community. Additionally, the alignment field will almost certainly become very much larger raising both the status of John and the alignment field in general. Over longer timescales I expect percolation to be quite strong.
Even if approximately nobody reads your or know of your works the insights may very well become massively signalboosted by other alignment researchers (once again I expect the community to explode in size within a decade) and thereby end up in a flashy demo.
All-in-all these and other considerations let me to the conclusion that this danger is very real. That is there is a significant minority of possible worlds in which early alignment researchers tragically contribute to DOOM.
However, I still think on the whole most alignment researchers should work in the open. Any solution to alignment will most likely come from a group (albeit-small) of people. Working privately massively hampers collaboration. It makes the community look weird and makes it way harder to recruit good people. Also, for most researchers it is difficult to support themselves financially if they can’t show their work. As by far the most likely doom scenario is some company/government simply building AGI without sufficient safeguards because either there is no alignment solution or they are simply unaware of it/it ote it I conclude that the best policy in expected value is to work mostly publicly*.
*Ofc if there is a clear path to capability gain keeping it secret might be the best.
EDIT: Cochran has a comical suggestion