This is going to be a linkpost from Beren on some severe problems that come with the use of the concept of an infohazard on LW.
The main problem I see that are relevant to infohazards are that it encourages a “Great Man Theory” of progress in science, which is basically false, and this still holds despite vast disparities in ability, since it’s very rare for person or small group to be able to single handedly solve scientific fields/problems by themselves, and the culture of AI safety already has a bit of a problem with using the “Great Man Theory” too liberally, especially those that are influenced by MIRI.
There are other severe problems that come with infohazards that cripple the AI safety community, but I think the encouragement of Great Man Theories of scientific progress is the most noteworthy problem to me, but that doesn’t mean it has the biggest impact on AI safety, compared to the other problems.
Part of Beren’s post is quoted below:
Infohazards assume an incorrect model of scientific progress
One issue I have with the culture of AI safety and alignment in general is that it often presupposes too much of a “great man” theory of progress 1 – the idea that there will be a single ‘genius’ who solves ‘The Problem’ of alignment and that everything else has a relatively small impact. This is not how scientific fields develop in real life. While there are certainly very large individual differences in performance, and a log-normal distribution of impact, with outliers having vastly more impact than the median, nevertheless in almost all scientific fields progress is highly distributed – single individuals very rarely completely solve entire fields themselves.
Solving alignment seems unlikely to be different a-priori, and appears to require a deep and broad understanding of how deep learning and neural networks function and generalize, as well as significant progress in understanding their internal representations, and learned goals. In addition, there must likely be large code infrastructures built up around monitoring and testing of powerful AI systems and an sensible system of multilateral AI regulation between countries. This is not the kind of thing that can be invented by a lone genius from scratch in a cave. This is a problem that requires a large number of very smart people building on each other’s ideas and outputs over a long period of time, like any normal science or technological endeavor. This is why having widespread adoption of the ideas and problems of alignment, as well as dissemination of technical work is crucial.
This is also why some of the ideas proposed to fix some of the issues caused by infohazard norms fall flat. For instance, to get feedback, it is often proposed to have a group of trusted insiders who have access to all the infohazardous information and can build on it themselves. However, not only is such a group likely to just get overloaded with adjudicating infohazard requests, but we should naturally not expect the vast majority of insights to come from a small recognizable group of people at the beginning of the field. The existing set of ‘trusted alignment people’ is strongly unlikely to generate all, or even a majority, of the insights required to successfully align superhuman AI systems in the real world. Even Einstein – the archetypal lone genius – who was at the time a random patent clerk in Switzerland far from the center of the action – would not have been able to make any discoveries if all theoretical physics research of the time was held to be ‘infohazardous’ and only circulated privately among the physics professors of a few elite universities at the time. Indeed, it is highly unlikely that in such a scenario much theoretical physics would have been done at all.
Similarly, take the case in ML. The vast majority of advancements in current ML come from a widely distributed network of contributors in academia and industry. If knowledge of all advancements was restricted to the set of ML experts in 2012 when AlexNet was published, this would have prevented almost everybody who has since contributed to ML from entering the field and slowed progress down immeasurably. Of course there is naturally a power-law distribution of impact where a few individuals show outlier productivity and impact, however progress in almost all scientific fields is extremely distributed and not confined to a few geniuses which originate the vast majority of the inventions.
Another way to think about this is that the AI capabilities research ‘market’ is currently much more efficient than the AI safety market. There are a lot more capabilities researchers between industry and academia than safety researchers. The AI capabilities researchers have zero problem sharing their work and building off the work of others – ML academia directly incentivises this and, until recently it seems, so did the promotion practices of most industry labs. Capabilities researchers also tend to get significantly stronger empirical feedback loops than a lot of alignment research and, generally, better mentorship and experience in actually conducting science. This naturally leads to much faster capabilities progress than alignment progress. Having strict infohazard norms and locking down knowledge of new advances to tiny groups of people currently at the top of the alignment status hierarchy further weakens the epistemics of the alignment community and significantly increases the barriers to entry – which is exactly the opposite of what we want. We need to be making the alignment research market more efficient, and with less barriers to research dissemination and access than capabilities if we want to out-progress them. Strict infohazard norms move things in the wrong direction.
Reference
For more on this topic, Beren’s linkpost up above is a great reference, and I’d highly recommend it for discussion on more problems with the infohazard concept.
I think that post assumes an incorrect model of scientific progress.
First: It’s not about people at all, it’s about ideas. And it seems much more defensible to claim that the impact ideas have on scientific problems is dominated by outliers: so-called “paradigms”. Quantum mechanics, general relativity, Shannon’s information theory, or the idea of applying the SGD algorithm to train very deep neural networks – all of those fields have “very compact generators” in terms of ideas.
These ideas then need to be propagated and applied, and in some sense, that takes up a “bigger” chunk of the concept-space than the compact generators themselves. Re-interpreting old physical paradigms in terms of the new theory, deriving engineering solutions and experimental setups, figuring out specific architectures and tricks for training ML models, etc. The raw information content of all of this is much higher than that of the “paradigms”/”compact generators”. But that doesn’t mean it’s not all downstream of said generators, in a very strong sense.
Second: And this is where the “Great Man Theory” can be re-introduced again, in a way more true to reality. It’s true that the bulk of the work isn’t done by lone geniuses. But as we’ve just established, the bulk of the work is relatively straightforward propagation and application of a paradigm’s implications. Not necessarily trivial – you still need to be highly mathematically gifted to make progress in many cases, say – but straightforward. And also factorizable: once a paradigm is in place, it can be propagated/applied in many independent directions at once.
The generation of paradigms themselves, however, is something a relatively small group of people can accomplish, by figuring out, based on hard-to-specify post-rigorous intuitions, in which direction the theory needs to be built. And this is something that might require unusual genius/talent/skillset.
Tenuously related: On this model, I think the purported “decline of genius” – the observation that there are no Einsteins or von Neumanns today – is caused by a change in scientific-research pipelines. Previously, a lone genius trying to cause a paradigm shift needed to actually finalize the theory before it’d be accepted, and publish it all at once. Nowadays, they’re wrapped up in a cocoon of collaborators from the get-go, the very first steps of a paradigm shift are published, and propagation/application steps likewise begin immediately. So there’s less of a discontinuity, both in terms of how paradigm shifts happen, and to whom they’re attributed.
I agree with you that there exist very compact generators, or at least our universe has some surprisingly compact generators like quantum physics, if you ignore the physical constant issues.
My fundamental claim is that this:
Is actually really, really important, especially for it to be usable at all. Arguably more important than the theory itself, especially in domains outside of mathematics. And in particular, I think ignoring the effort of actually being able to put a theory into practice is one of the main things that I think LW gets wrong, and worse this causes a lot of other issues, like undervaluing empirics, or believing that you need a prodigy to solve X problem entirely.
Much more generally, my points here are that the grunt work matters a lot more than LW thinks, and the Great Man Theory of scientific progress hinders that by ignoring the actual grunt work and overvaluing the theory work. The logical inferences/propagation and application arguably take up most of science that doesn’t adhere to formalistic standards, and there great people matter a lot less than LW content says it is.
Nothing New: Productive Reframing discusses this.
https://www.lesswrong.com/posts/ZZNM2JP6YFCYbNKWm/nothing-new-productive-reframing
That can’t be true, because the ability to apply a theory is dependent on having a theory. I mean, I suppose you can do technology development just by doing random things and seeing what works, but that tends to have slow or poor results. Theories are a bottleneck on scientific advancement.
I suppose there is some sense in which the immediate first-order effects of someone finding a great application for a theory are more impactful than that of someone figuring out the theory to begin with. But that’s if we’re limiting ourselves to evaluating first-order effects only, and in this case this approximation seems to directly lead to the wrong conclusion.
Any specific examples? (I can certainly imagine some people doing so. I’m interested in whether you think they’re really endemic to LW, or if I am doing that.)
Do you still think that the original example counts? If you agree that scientific fields have compact generators, it seems entirely natural to believe that “exfohazards” – as in, hard-to-figure-out compact ideas such that if leaked, they’d let people greatly improve capabilities just by “grunt work” – are a thing. (And I don’t really think most of the people worrying about them envision themselves Great Men? Rather than viewing themselves as “normal” researchers who may stumble into an important insight.)
AI in general is littered with this, but a point I want to make is that the entire deep learning revolution caught LW by surprise, as while it did involve algorithmic improvement, overall it basically involved just adding more compute and data, and for several years, even up until now, the theory of deep learning hasn’t caught up with the empirical success of deep learning. In general, the stuff considered very important to LW like logic, provability, self-improvement, and generally strong theoretical foundations all turned out not to matter all that much to AI in general.
Steelmaking is probably another example where the theory lagged radically behind the empirical successes of the techniques, and overall an example of where empirical success can be found without theoretical basis for success.
For difficulty in applying theories being important, I’d argue that evolution was the central example, as while Darwin’s theory of evolution was very right, it also took quite a lot of time to fully propagate the logical implications, and for bounded agents like us, just having a central idea doesn’t allow us to automatically derive all the implications from that theory, because logical inference is very, very hard.
I’d potentially agree, but I’d like the concept to be used a lot less, and a lot more carefully than what is used now.
I’m missing the context, but I think you should consider naming specific people or organizations rather than saying “LW”.
I’m specifically focused on Nate Soares and Eliezer Yudkowsky, as well as MIRI the organization, but I do think the general point applies, especially before 2012-2015.
Before 2012, it’s somewhat notable that AlexNet wasn’t published yet.
TBC, I think people savvy enough about AI should have predicted that ML was a pretty plausible path and that “lots of compute” was also plausible. (But it’s unclear if they should have put lots of probability on this with the information available in 2010.)
I am more pointing out that they seemed to tacitly assume that deep learning/ML/scaling couldn’t work, since all the real work was what we would call better algorithms, and compute was not viewed as a bottleneck at all.
I generally like the linked essay by Beren, but I don’t like this linkpost, especially the title, because I dispute that Beren’s essay is on the topic of “the problem with infohazards as a concept”. My very strong impression is that Beren (like me) thinks that “infohazards” is a perfectly valid and useful concept. In particular, Beren’s essay starts with:
My opinion is: Infohazards exist. But people have to figure out in each individual instance whether something is or isn’t an infohazard (i.e., whether the costs of keeping it secret outweigh the benefits, versus the other way around). And (from my perspective) figuring that out is generally very hard.
For one thing, figuring out what is or isn’t an infohazard is inherently hard, because there are a bunch of considerations entering the decision, all of which involve a lot of uncertainty, partly because they may involve trying to guess things about future intellectual & tech progress (which is notoriously hard).
For another thing, it has poor feedback mechanisms—so you can be really bad at deciding what is or isn’t an infohazard, for a very long time, without noticing and correcting.
In this background context, Beren is arguing that, in his experience, people in AI alignment are systematically going wrong by keeping things secret that, all things considered, would have been better discussed openly.
That’s (a priori) a plausible hypothesis, and it’s well worth discussing whether it’s true or false. But either way, I don’t see it as indicating a problem with infohazards as a concept.
Sorry if I’m misunderstanding the OP’s point or putting words in anyone’s mouth.
While I mostly agree with you here, which is why I’ll change the title soon, I do think that the point around encouraging a great man view of science and progress is very related to the concept of infohazards as used by the LW community, because infohazards as used by the community do tend to imply that small groups or individuals can discover world ending technology, and I think a sort of “Great Person Theory” of science falls out of that.
Beren’s arguing that this very model is severely wrong for most scientific fields, which is a problem with the concept of infohazards as used on LW.
Edit: I did change the title.
Einstein wouldn’t be my first choice, if you want an archetypal example of a “Great Man” scientist. A better first choice would be Shannon. In the case of information theory, I’d say the Great Man model is just obviously basically correct.
With that example in mind, the “Great Man theory is just false” claim goes straight out the window; fields presumably vary in the extent to which Great Man models apply. So the relevant question is: is alignment more like information theory, or <whatever field would be the archetypal example of lots and lots of people all making marginal contributions>.
Hmm, I think information theory was due to the work of only a few people, but I seem to recall that various people at Bell Labs claim that they came up with basically similar stuff around the same as Shannon (and that Shannon took more credit than was due). (I can’t find a source for this after very quickly looking, so maybe I’m just wrong.)
Of course, Shannon wrote the seminal work in the area which outlined basically all the key problems and proved nearly all the key results, regardless of whether other people came up with these results first.
Even if it’s true that Shannon claimed more credit than he was due, I still think that only a few people were involved with the creation of information theory. So maybe not lone genius, but only a few geniuses.
I think you might be confusing Shannon with Shockley, one of the guys who worked on the transistor? In that case there were definitely other people working on it, and Shockley was pretty unambiguously kind-of-an-asshole who definitely tried to grab credit.
In Shannon’s case, everything I’ve read (notably including The Idea Factory) indicates that Shannon basically developed the whole thing, came out of left field and surprised basically everyone.
Hartley’s Transmission of Information was published in 1928, when Shannon was only 12 years old. Certainly Shannon produced a lot of new insights into the field, particularly in terms of formalizing things, but he did not invent the field. Are there particular advancements that Shannon in particular made that you expect would have taken many years to discover if Shannon had not discovered them?
Smart-ass answer: “yes, all of the advancements that Shannon in particular made”. That’s probably not literally true, but my understanding is that it is at least true for the central results, i.e. there was nobody else even remotely close to making the main central discoveries which Shannon made (most notably the source coding theorem and noisy channel coding theorem).
Telegraph operators and ships at sea, in the decades prior to World War II, frequently had to communicate in Morse code over noisy channels. However, as far as I can tell, none of them ever came up with the idea of using checksums or parity bits to leverage the parts of the message that did get through to correct for the parts of the message that did not. So that looks pretty promising for the hypothesis that Shannon was the first person to come up with the idea of using error correcting codes to allow for the transmission of information over noisy channels, despite there being the better part of a century’s worth of people who dealt with the problem.
But on inspection, I don’t think that holds up. People who communicated using Morse Code did have ways of correcting for errors in the channel, because at both ends of the channel there were human beings who could understand the context of the messages being passed through that channel. Those people could figure out probable errors in the text based on context (e.g. if you get a message “LET US MEET IN PARIS IN THE SLRINGTIME” it’s pretty obvious to the human operators what happened and how to fix it).
Let’s look at the history of Shannon’s life:
Claude Shannon was one of the very first people in the world who got to work with actual physical computers. In 1936, at MIT, he got to work with an analog computer, and in 1937 designed switching circuits based on the concepts of George Boole, whose work he had studied during his undergraduate years.
In the early 1940s, Shannon joined Bell Labs, where he worked on problems related to national defense, particularly fire control systems and cryptography. Transmitting encrypted messages across a noisy channel has an interesting property: where you might be able to identify and correct an error in the message “LET US MEET IN PARIS IN THE SLRINGTIME”, a single error in transmission of the encrypted version of that message will turn it to meaningless garbage.
So, rather than solving a problem that had been unsolved for the better part of a century, I think that Shannon was instead probably one of the first hundred humans who encountered this problem. Had he not made his discoveries, I expect someone else would have in quite short order.
I would not consider error-correcting codes one of the more central Shannon discoveries. They were a major application which Shannon’s discoveries highlighted, but not one of the more counterfactually impactful ideas in their own right.
I have a whole post here about how Shannon’s discoveries qualitatively changed the way people thought about information. Very briefly: the major idea is that information/channel capacity is fungible. We do not need different kinds of channels to carry different kinds of information efficiently. Roughly speaking, any channel just has a single quantitative “capacity” (measurable in bits), and any signal has a single quantitative “entropy” (measurable in bits), and so long as the entropy is less than the capacity we can send the signal over the channel with arbitrarily high reliability and asymptotically tiny overhead.
That’s the core idea which I expect the vast majority of technical people pre-Shannon did not see coming, and were not close to figuring out.
Computers were not especially central to information theory IIUC, and Shannon was certainly not one of the first hundred (or thousand, or probably ten thousand) people to work on cryptography.
On an outside view… we can compare Shannon to another case at Bell Labs where the nominal discoverers clearly did not have much counterfactual impact: the transistor. Here was my takeaway on that invention from reading The Idea Factory:
That’s the sort of historical evidence which strongly indicates many people were on track to the invention. And that’s the sort of thing which we notably do not have, in Shannon’s case. The absence of that sort of evidence is at least some evidence that Shannon’s work was highly counterfactual.
Probably part of the difference is that, in the case of the transistor, there clearly was a problem there waiting to be solved, and multiple groups worked on that problem. In the case of information theory, it wasn’t clear that there was a problem to be solved at all: people mostly just assumed that of course different kinds of information need different kinds of transmission hardware, sure you can transmit Morse code over telephone but it’s going to be wildly inefficient because the phone lines were optimized for something else.
As you put it, Shannon was “probably one of the first hundred humans who encountered this problem”. But that’s not because only a hundred humans had worked on cryptography; it’s because there wasn’t obviously a problem there to be solved.
Yeah, I think that’s really the crux there. Whether the problem is legible enough for there to be a way to reliably specify it to anyone with the relevant background knowledge, vs. so vague and hazy you need unusual intuition to even suspect that there may be something in that direction.
So I think we might be talking past each other a bit. I don’t really have a strong view on whether Shannon’s work represented a major theoretical advancement. The specific thing I doubt is that Shannon’s work had significant counterfactual impacts on the speed with which it became practical to do specific things with computers.
This was why I was focusing on error correcting codes. Is there some other practical task which people wanted to do before Shannon’s work but were unable to do, which Shannon’s work enabled, and which you believe would have taken at least 5 years longer had Shannon not done his theoretical work?
This is an interesting question. Let’s come at it from first principles.
Going by the model in my own comment, Shannon’s work was counterfactually impactful plausibly because most people didn’t realize there was a problem to be solved there in the first place. So, in terms of practical applications, his work would be counterfactual mainly for things which people wouldn’t even have thought to try or realized was possible prior to information theory.
With that lens in mind, let’s go back to error-correcting codes; I can see now why you were looking there for examples.
Natural guess: Shannon’s counterfactual impact on error-correcting codes was to clearly establish the limits of what’s possible, so that code-designers knew what to aim for. It’s roughly analogous to the role of Joseph Black’s theory of latent heat in the history of the steam engine: before that theory, engines were wildly inefficient; Watt’s main claim to fame was to calculate where the heat was used, realize that mostly it went to warming and cooling the cylinder rather than the steam, then figure out a way to avoid that, resulting in massive efficiency gains. That’s the sort of thing I’d a-priori expect to see in the history of error-correcting codes: people originally did wildly inefficient things (like e.g. sending messages without compression so receivers could easily correct typos, or duplicating messages). Then, post-Shannon, people figured out efficient codes.
And I think the history bears that out. Here’s the wikipedia page on error-correcting codes:
So, the first method efficient enough to merit mention on the wikipedia page at all was developed by the guy who shared an office with Shannon, within a few years of Shannon’s development of information theory. And there had been nothing even remotely as efficient/effective for centuries before. If that’s not a smoking gun for counterfactual impact, then I don’t know what is.
Look at the date though. 1944. Might there have been some other external event causing the quantity of noisy channel transmitted information (such as radio messages) to have been increased at that point in time?
To use any form of error correcting code (vs simply a repeat of the message) you need a digital computer to apply the code. ENIAC is 1945.
Counterfactual. Shannon is licking the stamp to submit the key paper for peer review when an assassin shoots him and sets fire to the office.
Are you saying that human engineers in the 1940s and 50s, at the first point in human history where ECC is even useful or possible, wouldn’t invent a similar scheme? Note you don’t need the theory, you can cook up a code like this by hand by grouping the bits into nibbles and considering all 16 possible nibbles and all 4 single bit errors. You have 64 cases.
Can you find a code that uses less than 1 full nibble of redundancy? Could any of the thousands of other people working on digital computers in that era have thought of the idea?
If someone creates a less efficient (by say 1 bit) code does this change history in a meaningful way?
Remember, sending the message twice is 2 nibbles. Hamming 4,7 is 1 nible + 3 bits. It is very slightly more efficient than double-send where you send 2 nibbles and each gets a parity bit, so:
7 bits total vs 10 bits total. (you can also go up to bytes or words and just use a parity for each, but less parity bits increases the probability of an undetected error)
Interestingly, by learning effect laws (Moore’s) there is negligible gain here. 30% is nothing if the cost of a transistor is halving every 2-3 years. This may apply to AI as well, for example any linear improvement by a small factor is nothing.
Given the lack of simultaneous discovery over at least 5 years, no, nobody else thought of that idea (or if they did, they didn’t know how to make it work efficiently).
Look, man, you can pick any date in past 200 years and there will be Significant and Plausibly-Relevant events which happened around that time. In this case, I’d say the relevant events are less plausibly counterfactually impactful than average—after all, WWII was not the first war with widespread radio use (also this was at Bell Labs, their big use case was telephone lines), and no you do not need a digital computer to apply any error-correcting codes (I’ve personally done it by hand in homework assignments, there’s a learning curve but the complexity is not prohibitive). It is not at all plausible that the 40′s and 50′s were the first point in history where error-correcting codes were possible, and unlikely that they were the first point in history where error-correcting codes would have been useful.
We can come along today and see how easy it would have been for somebody to figure this stuff out, but that’s hindsight bias. Insofar as we can glean evidence about counterfactual worlds by looking at historical evidence at all, the evidence here generally points to the role of information theory being counterfactual: there was not prior or simultaneous discovery, and the main discovery which did happen was by the guy sharing an office with Shannon, within a few years of information theory.
(Notably, evidence against those core facts are the main things which would change my mind here: if there were prior or simultaneous discovery of reasonably-efficient error-correcting codes, by someone who didn’t know about info theory, then that would be clear evidence that Shannon’s work was not counterfactual for efficient error-correcting codes.)
You need a digital computer for the code to be practical. Otherwise why not just repeat every message 2-3 times? When humans are doing it, the cost of sending a message twice is less than the time it takes a human computer to do the math to reconstruct a message that may have been corrupted. This is because a human telegraph operator is also the I/O.
When I googled for it I found memoirs from the time period that credit hamming so I can’t prove the counterfactual, just note that there are a lot of possible encoding schemes that let you reconstruct corrupt messages.
But yeah sure, I am thinking from hindsight. I have done byte level encoding for communications on a few projects so I am very familiar with this, but obviously it was 60 years later.
One concrete example of a case where I expect error-correcting codes (along with compression) would have been well worth the cost: 19th-century transatlantic telegraph messages, and more generally messages across lines bottlenecked mainly by the capacity of a noisy telegraph line. In those cases, five minutes for a human to encode/decode messages and apply error correction would probably have been well worth the cost for many users during peak demand. (And that’s assuming they didn’t just automate the encoding/decoding; that task is simple enough that a mechanical device could probably do it.)
For the very noisy early iterations of the line, IIRC messages usually had to be sent multiple times, and in that case especially I’d expect efficient error-correcting codes to do a lot better.
I’d be very interested if anyone has specific examples of ideas like this they could share (that are by now widely known or obviously not hazardous). I’m sympathetic to the sorts of things the article says, but I don’t actually have any picture of the class of ideas it’s talking about.
I’m not “on the inside”, but my understanding is that some people at Conjecture came up with Chain of Thought prompting and decided that it was infohazardous, I gather fairly shortly before preprints describing it came out in the open AI literature. That idea does work well, but was of course obvious to any schoolteacher.
I found other parts of the post a lot more convincing than this part of the post, and almost didn’t read it because you highlighted this part of the post. Thankfully I did!
Here are the headings:
Infohazards prevent seeking feedback and critical evaluation of work
Thinking your work is infohazardous leads to overrating novelty or power of your work
Infohazards assume an incorrect model of scientific progress
Infohazards prevent results from becoming common knowledge and impose significant frictions
Infohazards imply a lack of trust, but any solution will require trust
Infohazards amplify in-group and social status dynamics
Infohazards can be abused as tools of power
Infohazards fail the ‘skin in the game’ test
Alright, I think I’ve figured out what my disagreement with this post is.
A field of research pursues the general endeavor of finding out things there are to know about a topic. It consists of building an accurate map of the world, of how-things-work, in general.
A solution to alignment is less like a field of research and more like a single engineering project. A difficult one, for sure! But ultimately, still a single engineering project, for which it is not necessary to know all the facts about the field, but only the facts that are useful.
And small groups/individuals do put together single engineering projects all the time! Including very large engineering projects like compilers, games & game engines, etc.
And, yes, we need solving alignment to be an at least partially nonpublic affair, because some important insights about how to solve alignment will be dual use, and the whole point is to get the people trying to save the world to succeed before the people functionally trying to kill everyone, not to get the people trying to save the world to theoretically succeed if they as much time as they wanted.
(Also: I believe this post means “exfohazard”, not “infohazard”)
Suggestion: if you’re using the framing of alignment-as-a-major-engineering-project, you can re-frame “exfohazards” as “trade secrets”. That should work to make people who’d ordinarily think that the very idea of exfohazards is preposterous[1] take you seriously.
As in: “Aren’t you trying to grab too much status by suggesting you’re smart enough to figure out something dangerous? Know your station!”
tbh I kinda gave up on reaching people who think like this :/
My heuristic is that they have too many brainworms to be particularly helpful to the critical parts of worldsaving, and it feels like it’d be unpleasant and not-great-norms to have a part of my brain specialized in “manipulating people with biases/brainworms”.
I don’t think that reframing is manipulation? In my model, reframing between various setting is a necessary part of general intelligence—you set problem and switch between frameworks until you find one of which where solution-search-path is “smooth”. The same with communication—you build various models of your companion until you find shortest-inference path.
I meant when interfacing with governments/other organizations/etc., and plausibly at later stages, when the project may require “normal” software engineers/specialists in distributed computations/lower-level employees or subcontractors.
I agree that people who don’t take the matter seriously aren’t going to be particularly helpful during higher-level research stages.
I don’t think this is really manipulation? You’re communicating an accurate understanding of the situation to them, in the manner they can parse. You’re optimizing for accuracy, not for their taking specific actions that they wouldn’t have taken if they understood the situation (as manipulators do).
If anything, using niche jargon would be manipulation, or willful miscommunication: inasmuch as you’d be trying to convey them accurate information in the way you know they will misinterpret (even if you’re not actively optimizing for misinterpretation).
From the link:
If we’re proposing this as a general criterion, as opposed to “a starting-point heuristic that would have been an improvement in Beren’s microenvironment”, then I have some complaints.
First, before we even start, I think “claim something is infohazardous” is not where I’d necessarily be locating the decision point. I want to ask: What exactly are you deciding whether or not to say, and to whom, and in what context? For example, if there are paradigm shifts between us and TAI, then the biggest infohazards would presumably involve what those paradigm-shifts are. And realistically, whatever the answer is, someone somewhere has already proposed it, but it hasn’t caught on, because it hasn’t so far been demonstrated to work well, and/or seems a priori implausible to most people, and/or is confusing or antimemetic for some other reason, etc. So if a random person says “I think the paradigm shift is X”, that’s probably not an infohazard even if they’re right, because random people make claims like that all the time and nobody listens. By contrast, if they are not a random person but rather a prominent leader in the field giving a NeurIPS keynote, or unusually good at explaining things in a compelling way, or controlling an AI funding source, or publishing turnkey SOTA code, or whatever, that might be different.
Second, “it empirically works on actual ML systems at scale” shouldn’t be a necessary criterion in general, in my opinion. For example, if there are paradigm shifts between us and TAI (as I happen to believe), then maybe we can think of actual ML systems as airplanes, and TAI as a yet-to-be-invented rocket ship, in an imaginary world where modern aircraft exist but rocket ships haven’t been invented yet. Anyway, early rocket ships are going to be far behind the airplane SOTA; in fact, for quite a while, they just won’t work at all, because they need yet-to-be-invented components like rocket fuel. Developing those components is important progress, but would not “empirically” advance the (airplane) SOTA.
I think about this a lot in the context of neuroscience. IMO the most infohazardous AI work on earth right now involves theoretical neuroscience researchers toiling away in obscurity, building models that don’t do anything very impressive. But we know that fully understanding the human brain is a recipe for TAI. So figuring out a little piece of that puzzle is unarguably a step towards TAI. The key is that: one piece of the puzzle, in the absence of all the other pieces, is not necessarily going to do anything impressive.
Third, (4) is not really how I would frame weighing the costs and benefits. For example, I would be especially interested in the question: “If this is true, then do I expect capabilities researchers to figure this out and publish it sooner or later, before TAI?” If so, it might make sense to just sit on it and find something else to work on until it’s published by someone else, even if it does have nonzero safety/alignment implications. Especially if there are other important alignment things to work on in the meantime (which there probably are). This has the side benefit of being a great strategy in the worlds where the idea is wrong anyway. It depends on the details in lots of ways, of course. More discussion & nuance in my post here, and see also Charlie’s post on tech trees.
(None of this is to invalidate Beren’s personal experience, which I think is pretty different from mine in various ways, especially that he’s working closer to the capabilities research mainstream, and was working at Conjecture which actually wanted to be out ahead of everyone else in (some aspects of) AI capabilities IIUC, whereas I’m speaking from my own experience doing alignment research with has a path-to-impact which mostly doesn’t require that.)
I think the question you need to ask yourself is “Given that there are O(10-100x) more Capabilities researchers than Alignment researchers, if I sit on this (exciting but untested) Capabilities idea I just had (which I don’t recall reading anywhere), will that delay Capabilities by more months than the infohazard procedures required to keep this quiet will slow down Alignment? And how sure am I that this idea (if in fact viable) is a pure Capabilities idea and doesn’t have any significant Alignment uses?”
I don’t think the concept of infohazard as applied to AI alignment/safety has anything to do with the Great Man Theory. If we bought the Great Man Theory, we would also have to believe that at any time a random genius could develop ASI using only their laptop and unleash it onto the world, in which case, any hope of control is moot. Most people who support AI governance don’t believe things are quite that extreme, and think that strategies ranging from “controlling compute” to “making it socially disreputable to work on AI capabilities” may effectively delay the development of AGI by significantly hurting the big collaborative projects.
On the flip side, would alignment work proceed much faster and more successfully if thousands of researchers with billions of dollars of funding worked on it exclusively, building on each others’ intuition? Of course it would. But we do not live in that world. We live in a world in which economic incentives have aligned things so that the power of numbers lies on the side of capabilities. In this world, if you share your genius interpretability insight (which you may have, because while science isn’t made only of Great Men on average, it certainly can advance in random leaps and bounds depending on individual contributions, and here even a mere 10 or 20 years of difference in the time of a discovery can be crucial), well, it’s much more likely that it will just be used to make an even more powerful AI before anyone manages to use it to align the ones we have. So, keeping info contained isn’t the best thing to do in some absolute sense, but it may be the best thing we can do now, as inefficient as it is, because it’s the only random edge that a small community competing with a much larger and more powerful one can hope to gain.
Yeah, this is definitely something that’s more MIRI specific, though I’d make the case that the infohazard concept as used by the LW community kinda does invite the Great Man Theory of science and technology because infohazards tend to connote the idea that there are ridiculously impactful technologies that can be found by small groups. But yeah, I do buy that this is a more minor problem, compared to the other problems with infohazards.
My fundamental crux here is that this is ultimately going to result in more high-quality alignment work being done than LW will do, and the incentives for capabilities also result in incentives for AI safety and control, and a lot of this fundamentally comes down to companies internalizing the costs of AI not being controlled far more than is usual, plus there are direct incentives to control AI, because an uncontrollable AI is not nearly as useful to companies as LW thinks, which also implies that the profit incentives will go to solving AI control problems.
No one thinks it is. An uncontrollably warming up world isn’t very useful to fossil fuel companies either, but fossil fuel companies can’t stop to care about that, because it’s too long term and they have to optimize profits on a much shorter time horizon. The argument isn’t “evil companies profit from unleashing unaligned AI”, it’s “dumb badly coordinated companies unleash unaligned AI while trying to build aligned AI while also cutting costs and racing with each other”. Cheap, Fast, and Doesn’t Kill Everyone: choose only two.
I don’t know for sure if the infohazard concept is that useful. I think it could be only given certain assumptions. If you discovered a concept that advances alignment significantly and can’t be used much for capabilities you should definitely scream it to everyone listening, and to many who aren’t. But “this would lead to better alignment research” isn’t very useful if it leads to proportionally even stronger capabilities. The goal here isn’t just maximizing alignment knowledge, it’s closing a gap. Relative speed matters, not just absolute. That said, you may be right—but we’re already discussing something far different from “Great Man Theory”. This is just a highly peculiar situation. It would be extremely convenient if some genius appeared who can solve alignment overnight on their own just starting from existing knowledge. It’s not very likely, if the history of science is anything to go by, because alignment is probably harder than, say, the theory of relativity. But however unlikely it is, it might be a better hope than relying on collaborative projects that doom alignment to just keep lagging behind even if it in general advances faster.
I don’t quite follow. Infohazards mean “some information is dangerous”. This doesn’t require small groups, in fact “it’s safer if this information is only held by a small group rather than spread to the world at large” is inherently more true if you assume that Great Men Theory is false, because regardless of trust in the small group, the small group will just be less able to turn the information into dangerous technology than the collective intellect of humanity at large would be.
That is an extremely easy choice. “Doesn’t Kill Everyone” is blatantly essential. “Fast” is unfortunately a requirement given that the open-source community intent on releasing everything to every nutjob, criminal, and failed state on the planet is only ~18 month behind you, so waiting until this can be done cheaply means that there will by many thousands of groups that can do this and we’re all dead if any one of them does something stupid (a statistical inevitability). So the blindingly obvious decision is to skip “Cheap” and hit up the tech titans for tens of billions of dollars, followed by Wall Street for hundreds of billions of dollars as training costs increase. While also clamming up on publishing capabilities work and only publishing alignment work, and throwing grant money around to fund external Alignment Research. Which sounds to me like an description of the externally visible strategies of the ~3 labs making frontier models at the moment.
I honestly think this is still cheap. Non-cheap would be monumentally bigger and with much larger teams employed on alignment to attack it from all angles. I think we’re seeing Cheap and Fast, with the obvious implied problem.
You’re talking about a couple of thousand extremely smart people, quite a few of them of them alignment researchers (some of whom post regularly on Less Wrong/The Alignment Forum), and suggesting they’re all not noticing the possibility of the extinction of the human race. The desirability of not killing everyone is completely obvious, to anyone aware that it’s a possibility. Absolutely no one wants to kill themselves and all their friends and family. (This is obviously not a problem that private bunker will help with: a paperclip maximizer will want to turn that into paperclips too. I believe Elon Musk is on record pointing out the fact that Mars is not far enough to run.) Yes, there are people like Yann LeCun who are publically making it clear that they’re still missing the point that this could happen any time soon. On the other hand, Sam Altman, Ilya Suskever, Dario & Daniella Amodei, and Demis Hassabis are all on public record with significant personal reputational skin in the game saying that killing everyone is a real risk in the relatively near term, and also that not doing so is obviously vital, while Sundar Pichai is constitutionally incapable of speaking in public without using the words ‘helpful’, ‘responsible’ and ‘trustworthy’ at least once every few paragraphs, so it’s hard to tell how worried he is. OpenAI routinely delay shipping their models for ~6 months while they and other external groups do safety work, Google just delayed Gemini Ultra for what sounds rather like safety reasons, and Anthropic are publically committed to never ship first, and never have. This is not what “cheap”+”fast” looks like.
Tens to hundreds of billions of dollars is not cheap in anyone’s books, not even tech titans’. Add Google and Microsoft’s entire current market capitalizations together, you get 4-or-5 trillion. The only place we could get significantly more money than that to throw at the problem is the US government. Now, it is entirely true that the proportion of that largesse going to alignment research isn’t anything like as high as the proportion going to build training compute (though OpenAI did publicly commit 20% of their training compute to AGI-level alignment work, and that’s a lot of money), But if they threw a couple of orders of magnitude more than the $10m in grants that OpenAI just threw at alignment, are there enough competent alignment researchers to spend it without seriously diminishing returns? I think alignment field-building is the bottleneck.
Just because this isn’t cheap relative to the world GDP doesn’t mean it’s enough. If our goal was “build a Dyson sphere” even throwing our whole productivity towards it would be cheap. I’m not saying there aren’t any concerns, but the money is still mostly going to capabilites and safety, while a concern, still needs to be compromised also with commercial needs and race dynamics—albeit mercifully dampened. Honestly with LeCun’s position we’re just lucky that Meta isn’t that good at AI, or they alone would set the pace of the race for everyone else.
I think Meta have been somewhat persuaded by the Biden administration to sign on for safety, or at least for safety-theatre, despite LeCun. They actually did a non-trivial amount of real safety work on Llama-2 (a model small enough not to need it), and then never released one size of it for safety reasons.. Which was of course pointless, or more exactly just showing off, since they then open-sourced the weights, including to the base models, so anyone with $200 can fine-tune their safety work out again. However, it’s all basically window dressing, as these models are (we believe) too small to be an x-risk, and they were reasonably certain of that before they started (as far as we know, about the worst these models can do is write badly-written underage porn or phishing emails, or similarly marginally assist criminals.)
Obviously no modern models are an existential risk, the problem is the trajectory. Does the current way of handling the situation extrapolate properly to even just AGI, something that is an open goal for many of these companies? I’d say not, or at least, I very much doubt it. As in, if you’re not doing that kind of work inside a triple-airgapped and firewalled desert island and planning for layers upon layers of safety testing before even considering releasing the resulting product as a commercial tool, you’re doing it wrong—and that’s just for technical safety. I still haven’t seen a serious proposal of how do you make human labor entirely unnecessary and maintain a semblance of economic order instead of collapsing every social and political structure at once.