Alignment researchers should think hard about switching to working on AI Control
I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50⁄50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!
For this post, my definitions are roughly:
AI alignment is the task of ensuring the AIs “do what you want them to do”
AI control is the task of ensuring that if the AIs are not aligned (e.g. don’t always “do what you want” and potentially want to mess with you), then you are still OK and can use them for economically productive tasks (an important one of which is doing more alignment/control research.)
Here are some thoughts, arguments, and analogies (epistemic status: there is no “hidden content”, if you don’t find the literal words I wrote persuasive you shouldn’t update. In particular, just update on the words and don’t update about what my words imply about my beliefs.):
Everything is in degrees. We can “partially align” some AIs, and things will be better if we can use those AIs for productive tasks, like helping with alignment research. The thing that actually matters is “how aligned are the AIs” + “how aligned to they need to be to use them for stuff”, so we should also focus on the 2nd thing.
If you were a hedge fund, and your strategy for preventing people from stealing your data was and starting new hedge fund was “we will make the hedge fund a super fun place to work and interview people carefully to make sure they really care about working for us”, and you also did not do things like “maybe people shouldn’t be able to download all the trading strategies onto a USB stick which we let them take home”, “maybe people should be forced to keep their work computers in the office” and “maybe we should have some people checking if other employees are trying to steal data and make some plans about how to stop them from starting a rival hedge fund if they do”, then all the other hedge funds would think you were an idiot. You should obviously try both to check if people want to defect, and also make it harder for them to defect.
I think a lot of the neglect of AI control can be traced to the “pivotal act” based framings of how to make aligned AI go well. I think these views are mistaken, but I won’t provide a great argument here (quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”). However, I don’t think things are shaping up for single pivotal acts to be very important, and being able to productively use early AI labor to help with alignment seems like it is going to be an important part of the future going well.
Working on AI control is really tractable compared to working on alignment. There is a lot of prior art in people thinking about how to e.g. prevent data from being stolen by insider threats, secure various information against adversaries, generally ensure that national security secrets don’t get leaked, etc. We can set up various plausibly looking scenarios and try to “game them out” under certain plausibly seeming assumptions about how much alignment we’ll be able to achieve. We can make assumptions about AI capabilities and alignment without really getting into the nitty gritty of philosophically confusing topics like “what is deception really?” and “how likely is deception?” and “when will this all go down?”
Note that this means control might be easier for the AIs to automate, and easier to do “during crunch time.”
It also means that AI control is a bit contiguous with work people already do to protect from adversaries. I think the AI thing qualitatively changes the picture though, so it’s still worth thinking about.
If you succeed at AI control, there is a more compelling story you can tell for why you should implement the specific scheme that you thought of. In comparison, a lot of alignment proposals depend on philosophically dubious and confusing assumptions about the nature of cognition/SGD/etc. That is to say, a successful proposal in AI control comes with an argument that might be convincing to established security professionals if they thought about it, but there is likely no such class of professionals capable of lending your proposal external legitimacy in AI alignment.
The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.
There is a bunch of research that is useful for both alignment and control, e.g. model organisms of deception, interpretability techniques, oversight techniques, etc. More people should analyze such research from the perspective of “how can this be helpful even if it fails to produce an aligned AI?”
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
When we have controlled unaligned AIs doing economically useful work, they probably won’t be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn’t.
So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I’m afraid that some people just don’t feel philosophically confused about much of anything, and will push forward on that basis.) But you do seem to worry about philosophical problems, which makes me confused about the position you take here.
BTW I have similar objections to working on relatively easy forms of (i.e., unscalable) alignment solutions, and using the resulting aligned AIs to solve alignment for more powerful AIs. But at least there, one might gain some insights into the harder alignment problems from working on the easy problems, potentially producing some useful strategic information or making it easier to verify future proposed alignment solutions. So while I don’t think that’s a good plan, this plan seems even worse.
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
[...]
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs.
It currently seems unlikely to me that marginal AI control research I’m excited about is very economically useful. I agree that some control or control-adjacent research will end up being at least somewhat economically relevant.
I can also imagine worlds where AIs are clear schemers that are clearly unproductive to use by default, and thus control increases deployment through both straightforward incentives and risk compensation (though this maybe requires the AIs to have made a mistake or to have somewhat strange preferences as they would probably prefer fooling our tests).
My understanding is that your concerns on downside are maybe some combination of:
AI control research will be economically very useful (e.g. accelerates AI by 6 months conditional on 5 year timelines).
Control will result in irrational risk compensation style actions where reducing earlier risks provides false confort about later risks (as control isn’t scalable) and this will ultimately make the situation worse.
AI control prevents earlier warning shots which would have created more will for costly actions that save us later. (I’m not sure if this is actually your concern, though some statements pattern matched to this.)
I’m very skeptical of (1) for marginal control research. It feelsl like this has to imply incredibly strong levels of inefficiency and usefulness of control research. Perhaps you’re imagining a smaller acceleration effect, but think the benefits are also tiny relative to this small acceleration?
I’m somewhat sympathetic to things being somewhat inefficient such that if the field of people who might do control research instead did capabilities, that would be a notable acceleration (maybe 5%), but it seems like control research should be a massive reduction in acceleration relative to this.
On (2), this sounds like a weird intermediate level of competence where they are actually sensitive to full on scheming risks at an earlier point (when control works), but end up thinking this will scale despite it being relatively obvious it won’t. I agree that control might end up being an excuse for scaling, but I don’t think that it will be a counterfactual excuse (some other excuse would have been found) and I think we should just make sure to argue on the merits at the point.
I do think that (2) points toward focusing on advancing evaluation methodology more than techniques such that it would be relatively clear when things fail if people are being unreasonable and it is relatively easy to call out unreasonableness.
I think not doing control due to (3) can put in you in pretty scary situations: how can we know when you’ll get scary warning shots vs full AI takeover? Further, AIs presumably want to avoid warning shots though it is unclear how much AIs will systematically want to advance the interests of other AIs. At a more basic level I don’t love arguments of the form “actually, having bad things happen will actually be good, so we shouldn’t try to prevent bad things which are pretty close to the actual bad things we’re worried about”. All that said, I’m sympathetic to researchers avoiding work on mitigating misuse or smaller-than-full-takeover harms due to this motivation. (I still think AI companies should have an obligation to make it very clear what risks they are imposing and to try to mitigate risks at multiple scales, but this obligation doesn’t apply to researchers. If it is hard to hire good people to work on misuse, try paying more.)
When we have controlled unaligned AIs doing economically useful work, they probably won’t be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs.
As in, your view is that:
Greatly accelerating all alignment work which isn’t somewhat philosophically/conceptually confusing won’t be very useful for solving alignment. (Because of conceptual bottlenecks.)
Philosophy/conceptual work is sufficiently hard to check when an AI is trying to screw you over that control can’t force scheming AIs to notably help with this work. (Note that even if evaluation in this regime is much harder than generation, that isn’t clearly a blocker, because we potentially only have a to supervise a small subset of outputs.) (Note that this is distinct from the view that AIs won’t be capable enough to help with this work as is currently the case.)
I think there is a lot of very helpful alignment work which isn’t conceptually bottlenecked and agendas with fewer conceptual barriers could go very far.
Further, I’m skeptical about evaluation being so hard.
Beyond that, I’m only imaging part of the theory of change of control work being to “solve alignment” or work on alignment.
Alternatives:
Buy time by both making it cheaper for actors to stall and also making risks better understood[1].
Also buy time by pushing AI takeover later in time.
Work on alternative exit conditions like emulated minds (difficult but maybe doable).
Additional time with very powerful AIs seems useful for studying them even if we can’t get any acceleration from automation. Pausing at the brink is more difficult in terms of costs and coordination (as there will be more close actors and more incentive), but it might also be radically easier to make the case for pausing relative to today. So, it overall seems easier to arrange and time at the brink also seems much more useful all around.
(I also think that pausing looks less good if it isn’t part of a well implemented, thoughtful and extremely high effort approach which e.g. actively intervenes with hardware and is aggressive on international efforts. So, I’m not very into pausing “for the wrong reasons”.)
(Edited to add)
I’d like to argue that there is a lot of helpful stuff which isn’t conceptually bottlenecked.
Concretely, let’s imagine targeting aligning AI systems which are capable enough to dominate top human experts in speed and cost across all relevant domains (including philosophy/conceptual work). In particular, we want to align them well enough that we’re happy to totally defer to them on really tricky questions like “what should be our high level approach for handling risks for further AI systems”[2]. I think this is a pretty good target for a rational actor, then you can actually hand off decision making to such AIs. (They might decides to work on alignment of smarter systems or pursue totally different routes.)
It seems difficult though not impossible that we could reasonably confidently (95%)[3] reach this target via very prosaic/empirical approaches using huge amounts of AI labor (e.g. a combination of testing things on huge numbers of model organisms, doing some high level interp (both white and black box) to get a bit of a sense for what is going on in AI systems in general, and verifying that AIs seem to do a good job in held out conceptual domains where we happen to know the answers). And, I think prosaic approaches with huge amounts labor could also substantially increase chances of hitting this target (e.g. 4x risk reduction) even if we can’t hit this level of confidence.
This doesn’t really look like “solve alignment”, but in practice it reduces risk a bunch.
It’s also possible that better understanding of risks will indicate that proceeding to wildly superhuman AI immediately is fine, though I’m skepitical.
Obviously, 95% confidence is objectively unreasonable, but it is notable improvement over doing nothing, particularly in worlds where we find evidence for misalignment.
My vague plan along these lines is to attempt as hard as possible to defer all philosophically confusing questions to the “long reflection”, and to use AI control as a tool to help produce AIs that can help preserve long term option value (including philosophical option value) as best as possible.
I seperately have hope we can solve “the entire problem” at some point, e.g. through ARC’s agenda (which I spend most of my time trying to derisk and advance).
People interested in a discussion about control with someone who is maybe closer to Wei Dai’s perspective might be interested in my dialogue with habyrka.
I think if the first powerful unaligned AI remained in control instead of escaping, it might make a good difference, because we can engineer and test alignment ideas on it, rather than develop alignment ideas on an unknown future AI. This assumes at least some instances of it do not hide their misalignment very well.
I think a key difference is I do believe the technical alignment/control problem as defined essentially requires no philosophical progress or solving philosophical problems like the hard problem of consciousness, and I believe the reason for this comes down to both a general point and a specific point.
In general, one of the reasons I believe philosophy tends not to be a productive area compared to other branches of science is that usually they either solve essentially proven to be intractable problems nowadays, or they straight up tried to solve a problem in far too much generality without doing any experiments, and that’s when they aren’t straight up solving fictional problems (I believe a whole lot of possible world philosophizing is in that category).
This is generally because philosophers do far too much back-chaining compared to front-chaining on a lot of problems.
For the specific point of alignment/control agendas, it’s because that the problem of AI alignment isn’t a problem about what goals you should assign it, but rather whether you can put in goals into the AI system such that the AI will reliably follow your goals at all.
I really think if you want to tell a story of AI Control work being good (especially compared to working on preventing AGI from being built in the first place), the important and difficult part is figuring out how to actually use these more powerful AI systems to either achieve some kind of global moratorium, or make unprecedented progress on the hard parts of the AI Alignment problem.
When I see most people start thinking about control, I rarely see them interface with either of these two things, and honestly, I mostly see them come up with cool additional commercial applications of AI, which I think will mostly make the present situation worse. To be clear, I think thinking about how to use AI systems for global coordination or making progress on the hard parts of the AI Alignment problem is tractable, but I haven’t seen much work on it, and I am worried it is not the default thing that people will work on when they start thinking about control.
I think it is unlikely for control work to buy humanity much time until someone builds a very powerful unaligned AI system, at least at our present levels of coordination tools. I think controlled AI systems might help with coordination here, but I do think someone needs to do the actual hard work of thinking about how it might.
FWIW, I don’t think of myself as having a super strong opinion on whether working on control is better than working on preventing people from building catastrophically dangerous AI (and I’ve seriously considered doing the latter).
(I do feel more optimistic than you about being able to use controlled AIs for something useful.)
@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is.
Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.
On (a) and (b), we describe this at a high level here.
We don’t really have anything written on (c) or (d). (c) really depends a lot on effort, so I’d probably prefer to talk mostly about (d) including what evalutions would be needed at various points etc.
For (a), I think we potentially care about all of:
Systems which are perhaps qualitatively similarly smart to OK software engineers and which are capable of speeding up R&D work by 10x (speedups aren’t higher due to human bottlenecks). (On a nearcast, we’d expect such systems to be very broadly knowledgeable, pretty fast, and very well tuned for many of their usages.)
Systems which nearly strictly dominate top human scientists on capability and which are perhaps similar in qualitative intelligence (I’d guess notably, but not wildly weaker and compensating in various ways.) Such systems likely some domains/properties in which they are much better than any human or nearly any human.
Systems which are qualitatively smarter than any human by a small amount.
It’s likely control is breaking down by (3) unless control ends up being quite easy or the implementation/evaluation is very good.
On (b) we plan on talking more about this soon. (Buck’s recent EAGx talk is pretty up to date with our current thinking, though this talk is obviously not that detailed. IDK if you can find a recording anywhere.)
Does that mean that you believe that after a certain point we would lose control over AI? I am new to this field, but doesn’t this fact spell doom for humanity?
By “control”, I mean AI Control: approaches aiming to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures.
AI control stops working once AIs are sufficiently capable (and likely don’t work for all possible deployments that might eventually be otherwise desirable), but there could be other approaches that work at that point. In particular aligning systems.
The main hope I think about is something like:
Use control until AIs are capable enough that if we trusted them, we could obsolete top human scientists and experts.
Use our controlled AI labor to do the work needed to make systems which are capable enough, trustworthy enough (via alignment), and philosophically competent enough that we can safely hand things off to them. (There might be some intermediate states to get to here.)
Have these systems which totally obsolete us figure out what to do, including figuring out how to aligning more powerful systems as needed.
yep agreed, I have a bunch of vague plans in this direction. I most generally think that AI control is a pretty good tool in the toolbox, and is unlikely to make things much worse but plausibly makes things much better.
I think AI control agendas are defined in such a way such that this metric isn’t as relevant as you think it is:
I think it is unlikely for control work to buy humanity much time until someone builds a very powerful unaligned AI system, at least at our present levels of coordination tools.
Because the agenda isn’t trying to make AIs alignable, but to make them useful and not break out of labs, so the question of the timeline to unaligned AI is less relevant than it is for most methods of making safe AI.
This topic is important enough that you could consider making a full post.
My belief is that this would improve reach, and also make it easier for people to reference your arguments.
Consider, you believe there is a 45% chance that alignment researchers would be better suited pivoting to control research. I do not suspect a quick take will reach anywhere close to that number, and has a low chance of catalysing dramatic, institutional level change.
Yes, I agree. If I had more time, this would have been a top-level post. If anyone reading wants to write such a post using my quick take as a base, I would be happy to take a look and offer comments. I might do it myself at some point as well.
Just make it a full post without doing much if any editing, and link to this quick take and its comments when you do. A polished full post is better than an unpolished one, but an unpolished one is better than none at all.
That’s fair. To be honest I’ve only used AI for writing code, I merely heard about other people having success with AI drafts. Maybe their situation was different, or they were bad at English to the point that AI writes better than them.
quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”
This feels like a pretty central cruxy point—and not just for the relevance of the pivotal act framing specifically. I think it’s underlying a whole difference of worldview or problem-solving approach.
A couple other points in a similar direction:
A thing I noticed in our discussion on the model delta with Christiano post: your criterion for useful AI safety work seems to be roughly “this will contribute some marginal value” as opposed to “this will address a bottleneck”.
Right at the top of this thread, you say: “I think Redwood Research’s recent work on AI control really ‘hits it out of the park’, and they have identified a tractable and neglected intervention that can make AI go a lot better”. Note what’s conspicuously missing there: tractable and neglected, but you don’t claim importance.
I would say that your mindset, when approaching AI safety, seems to be an epsilon fallacy.
Sure, in principle a sum of numbers can be large without any individual number being large. In practice, the 80⁄20 rule is a thing, and everything has bottlenecks all the time. If work is not addressing a bottleneck, then it’s approximately useless.
(Somewhat more precisely: if marginal work is not addressing something which is a bottleneck on current margins, then it’s approximately useless.)
Of importance, tractability and neglectedness, importance is the most important. In practice, it is usually better to have a thousand people trying to solve a major bottleneck each with low chance of success, than a thousand people making tractable progress on some neglected issue which is not a bottleneck.
I think I disagree with your model of importance. If your goal is the make a sum of numbers small, then you want to focus your efforts where the derivative is lowest (highest? signs are hard), not where the absolute magnitude is highest.
The “epsilon fallacy” can be committed in both directions: both in that any negative dervative is worth working on, and that any extremely large number is worth taking a chance to try to improve.
I also seperately think that “bottleneck” is not generally a good term to apply to a complex project with high amounts of technical and philosophical uncertainty. The ability to see a “bottleneck” is very valuable should one exist, but I am skeptical of the ability to strongly predict where such bottlnecks will be in advance, and do not think the historical record really supports the ability to find such bottlenecks reliably by “thinking”, as opposed to doing a lot of stuff, including trying things and seeing what works. If you have a broad distribution over where a bottleneck might be, then all activities lend value by “derisking” locations for particular bottlenecks if they succeed, and providing more evidence that a bottleneck is in a particular location if it fails. (kinda like: https://en.wikipedia.org/wiki/Swiss_cheese_model) For instance, I think of “deceptive alignment” as a possible way to get pessimal generalization, and thus a proabalistic “bottleneck” to various alignment approaches. But there are other ways things can fail, and so one can still lend value by solving non-deceptive-alignment related problems (although my day job consists of trying to get “benign generalization” our of ML, and thus does infact address that particular bottleneck imo).
I also seperately think that if someone thinks they have identified a bottleneck, they should try to go resolve it as best they can. I think of that as what you (John) is doing, and fully support such activities, although think I am unlikely to join your particular project. I think the questions you are trying to answer are very interesting ones, and the “natural latents” approach seems likely to shed at some light on whats going on with e.g. the ability of agents to communicate at all.
I do think that “we don’t have enough information to know where the bottlenecks are yet” is in-general a reasonable counterargument to a “just focus on the bottlenecks” approach (insofar as we in fact do not yet have enough information). In this case I think we do have enough information, so that’s perhaps a deeper crux.
Hiliariously, it seems likely that our disagreement is even more meta, on the question of “how do you know when you have enough information to know”, or potentially even higher, e.g. “how much uncertainty should one have given that they think they know” etc.
I agree it is better work on bottlenecks than non-bottlenecks. I have high uncertainty about where such bottlenecks will be, and I think sufficiently low amounts of work have gone into “control” that it’s obviously worth investing more, because e.g. I think it’ll let us get more data on where bottlenecks are.
The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.
Well, this would be the lone crux. The rest of the stuff you wrote is about non-exploding AI, right? And is therefore irrelevant to the thing about everyone dying, except insofar as controlled non-exploding AI can help prevent uncontrolled exploding AI from killing everyone?
I directionally agree with this (and think it’s good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
“ensuring that if the AIs are not aligned [...], then you are still OK” (which I think is the main meaning of “AI control”)
Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood’s existing work on control).
I think 2. is arguably the most promising strategy for 1., but I’ve occasionally noticed myself conflating them more than I should.
1. gives you the naive 50⁄50 equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)
I also think (non-confidently) that 2. is what’s really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.
But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!
I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there’s some under way, and I’m excited for more).
We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.
I plan to write more about white box control soon (but I plan to write about a lot of stuff…)
This seems to presume that you can divide up research topics into “alignment” vs “control” but this seems wrong to me. E.g. my categorization would be something like:
Clearly alignment: debate theory, certain flavors of process supervision
Clearly control: removing affordances (e.g. “don’t connect the model to the Internet”)
Could be either one: interpretability, critique models (in control this is called “untrusted monitoring”), most conceptions of ELK, generating inputs on which models behave badly, anomaly detection, capability evaluations, faithful chain of thought, …
Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn’t make much sense for research, most of which helps with both. It does make sense as a distinction for “what plan will you implement in practice”—but labs have said very little publicly about that.
Other things that characterize work done under the name of “control” so far are (1) it tries to be very concrete about its threat models, to a greater degree than most other work in AI safety, and (2) it tries to do assurance, taking a very worst case approach. Maybe you’re saying that people should do those things more, but this seems way more contentious and I’d probably just straightforwardly disagree with the strength of your recommendation (though probably not its direction).
Nitpick: I would also quibble with your definitions; under your definitions, control seems like a subset of alignment (the one exception if you notice the model is scheming and then simply stop using AI). I think you really have to define alignment as models reliably doing what you want independent of the surrounding context, or talk about “trying to do what you want” (which only makes sense when applied to models, so has similar upshots).
Tbc I like control and think more effort should be put into it; I just disagree with the strength of the recommendation here.
This seems to presume that you can divide up research topics into “alignment” vs “control” but this seems wrong to me.
Agreed. (And I made a comment to this effect on Mark’s draft of this short form.) I don’t think it is that productive to classify research directions, the central thing we’re trying to point at is a classification for reasons why you might think that you’re fine where one reason is “the AI couldn’t do bad stuff without getting caught” and another reason is “the AI wouldn’t do bad stuff” (obviously simplifying somewhat).
I also think that white-box control should typically be considered pretty separately from black-box control and the story for it is often pretty different. (Though when thinking about white-box control you should often adopt the central control perspective, e.g., you should incorporate white-box runtime detection techniques (like probes) into an overall control evaluation via exogenously trying to determine the probe reliability and other important aspects of the probe. For instance, the probe can take advantage of black box strategies and how safe we are depends on how many actions AIs need to cause huge problems.)
The claims I’d like to make is:
There should be more research specifically targeting a black-box control story and which is operating from a black-box control perspective. (I’d guess like 20% of the overall AI safety portfolio should be this and maybe 30% of safety work at labs. This is not including work like “get AIs to be more useful at alignment research and get better at DC eval methodology” which should also be subsidized from a control perspective but seem non-central.)
When people do research which could be pretty useful from a black-box control perspective (e.g. debate or other scalable oversight style research), they should consider sometimes explicitly think about the control perspective and consider adjusting their research accordingly (to the extent they can understand this perspective well enough). To the extent that people have an alignment style story they think about while doing their research, they should be roughly equally excited about thinking about the control story. I also find that the control story is often more straightforward.
People thinking about safety plans for near term powerful AI should strongly consider control.
Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn’t make much sense for research, most of which helps with both. It does make sense as a distinction for “what plan will you implement in practice”—but labs have said very little publicly about that.
You’re potentially thinking about footnote 4 in our post on control:
As one operationalization, suppose that you had two teams at your AI lab, one of which was much more competent, and you had to assign one to be in charge of minimizing the probability that the AIs you’re about to train and deploy were scheming, and the other to be in charge of ensuring control. To address risk due to powerful AI in the next 8 years, we think it would be better to assign the more competent team to ensuring control, because we think that the quality of control interventions is more influential on this risk.
This footnote is a bit confusingly worded, but I think the situation we were trying to say is “Suppose you expect dangerously powerful AI in the next year, and the current year is prior to 2032. We’re guessing you should put the better team on control.”
This is different than research in the run up.
I also think that to the extent people are trying to do backchained research focused on specific applications, it makes sense to put the better team on control over reducing the chance that scheming arises. (But these aren’t the only classes of interventions and some interventions don’t nicely fit into these buckets, e.g., you can do work on differentially making AIs more useful for alignment work which isn’t well classified as either and you can work on high level interpretability which aims to roughly understand how AIs make decisions in some cases (this high-level interp doesn’t clearly help with reducing the chance that scheming arises very directly, but could help with a bunch of stuff).)
I agree with bullet points 1, 2, 3, 6 and 7, partially agree with bullet point 5, and disagree with bullet point 4.
Thus, I agree with the central claim here:
I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50⁄50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!
For more of my analysis on the bullet points, read the rest of the comment.
For bullet point 1, I basically agree with this, mostly due to not favoring binary assumptions and instead prefer continuous quantity reasoning, which tends to both be a better match for the IRL world, and also continuous quantity reasoning gives you more information than binary outcomes.
I really like bullet point 2, and also think that even in a scenario where it’s easy to prevent defection, you should still have controls that make defecting employees have much less reward and much more punishment for subversive actions.
I deeply agree with point 3, and I’d frame AI control in one of 2 ways:
As a replacement for the pivotal act concept.
As a pivotal act that doesn’t require destruction or death, and doesn’t require you to overthrow nations in your quest.
A nitpick: AI labor will be the huge majority of alignment progress in every stage, not just the early stage.
I think one big reason the pivotal act frame dominated a lot of discussions is the assumption that we would get a pure software singularity which would FOOM in several weeks, but reality is shaping up to not be a pure software-singularity, since physical stuff like robotics and data centers still matters.
There’s a reason why every hyperscaler is trying to get large amounts of power and datacenter compute contracts, because they realize that the singularity is bottlenecked currently on power and to a lesser extent compute.
I disagree with 4, but that’s due to my views on alignment, which tend to view it as a significantly easier problem than the median LWer does, and in particular I view essentially 0 need for philosophical deconfusion to make the future go well.
I agree that AI control enhances alignment arguments universally, and provides more compelling stories. I disagree with the assumption that all alignment plans depend on dubious philosophical assumptions about the nature of cognition/SGD.
I definitely agree with bullet point 6 that superhuman savant AI could well play a big part in the intelligence explosion, and I believe this most for formal math theorem provers/AI coders.
Agree with bullet point 7, and think it would definitely be helpful if people focused more on “how can this be helpful even if it fails to produce an aligned AI?”
I feel like our viewpoints have converged a lot over the past couple years Noosphere. Which I suppose makes sense, since we’ve both been updating on similar evidence!
The one point I’d disagree with, although also wanting to point out that the disagreement seems irrelevant to short term strategy, is that I do think that philosophy and figuring out values is going to be pretty key in getting from a place of “shakey temporary safety” to a place of “long-term stable safety”.
But I think our views on the sensible next steps to get to that initial at-least-reasonable-safety sound quite similar.
Since I’m pretty sure we’re currently in a quite fragile place as a species, I think it’s worth putting off thinking about long term safety (decades) to focus on short/medium term safety (months/years).
I would suggest 50% of researchers working on a broader definition of control: including “control”, technical governance work and technical outreach (scary demos, model organisms of misalignment).
I’m in the process of trying to build an org focused on “automated/augmented alignment research.” As part of that, I’ve been thinking about which alignment research agendas could be investigated in order to make automated alignment safer and trustworthy. And so, I’ve been thinking of doing internal research on AI control/security and using that research internally to build parts of the system I intend to build. I figured this would be a useful test case for applying the AI control agenda and iterating on issues we face in implementation, and then sharing those insights with the wider community.
Would love to talk to anyone who has thoughts on this or who would introduce me to someone who would fund this kind of work.
I don’t see a significant difference in your distinction between alignment and control. If you say alignment is about doing what you want (which I strongly disagree with in its generality, e.g. when someone might want to murder or torture people or otherwise act unethically), that obviously includes your wanting to “be OK” when the AI didn’t do exactly what you want. Alignment comes in degrees, and you merely seem to equate control with non-perfect alignment and alignment with perfect alignment. Or I might be misunderstanding what you have in mind.
In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.
I disagree with this take. A.I. control will only be important in a multipolar situation in which no single A.I. system can create a gray goo catastrophe etc. But if such pivotal acts are impossible and no singular A.I. takes control, but instead many A.I.’s are competing, than some groups will develop better or worse control for economic reasons and it won’t affect existential risk much to work on it now. I don’t think I can see a situation where control matters—only a few players have A.G.I. for a very long time and none escape or are open sourced but also none gain a decisive advantage?
I do see advantages to hardening important institutions against cyberattacks and increasing individual and group rationality so that humans remain agentic for as long as possible.
I think the key story for wins from AI control specifically is a scenario where we have human-level to slightly superhuman AI inside a lab that isn’t aligned and wants to take over, but it turns out that it’s easier to control what AI affordances are given than it is to align an AI, and in particular it’s easier to catch an AI scheming than it is to make it aligned, and the lab wants to use AIs for alignment/control research.
I don’t see this as a probable scenario, but I do see it as a valuable scenario to work on, so it does have value in my eyes.
Imagine that there are just a few labs with powerful A.I., all of which are responsible enough to use existing A.I. control strategies which have been prepared for this situation, and none of which open source their models. Now if they successfully use their A.I. for alignment, they will also be able to successfully use it for capabilities research. At some point, control techniques will no longer be sufficient, and we have to hope that by then A.I. aided alignment has succeeded enough to prevent bad outcomes. I don’t believe this is a serious possibility; the first A.I. capable of solving the alignment problem completely will also be able to deceive us about solving the alignment problem (more) easily—up to and including this point, A.I. will produce partial, convincing solutions to the alignment problem which human engineers will go forward with. Control techniques will simply threshold (below) the capabilities of the first unaligned A.I. that escapes, which is plausibly a net negative since it means we won’t have early high impact warnings. If occasional A.I. escapes turn out to be non-lethal, economic incentives will favor better A.I. control, so working on this early won’t really matter. If occasional A.I. escapes turn out to be lethal, then we will die unless we solve the alignment problem ourselves.
One assumption that could be used to defuse the danger is if we can apply the controlled AIs to massively improve computer security such that computer security wins the race over attacking the computers.
This is at least semi-plausible for the first AIs, who will almost certainly be wildly superhuman at both coding and mathematically proving theorems in for example Lean, because there’s a very obvious path to how to get the training data to bootstrap yourself, and like go, it is relatively easy to verify that your solution actually works.
Another assumption that could work for AI control is that once the AIs are controlled enough, labs start using the human-level AIs to further enhance control/alignment strategies, and thus from the base case we can inductively show that the next level up of AIs are even better controlled until you reach a limit, and that the capabilities that are done are there to make the AI safer, not more dangerous.
Improving computer security seems possible but there are many other attack vectors. For instance, even if an A.I. can prove a system’s software is secure, it may choose to introduce social engineering style back doors if it is not aligned. It’s true that controlled A.I.’s can be used to harden society but overall I don’t find that strategy comforting.
I’m not convinced that this induction argument goes through. I think it fails on the first generation that is smarter than humans, for basically Yudkowskian reasons.
Hard to be sure without more detail, but your comment gives me the impression that you haven’t thought through the various different branches of how AI and geopolitics might go in the next 10 years.
I, for one, am pretty sure AI control and powerful narrow AI tools will both be pretty key for humanity surviving the next 10 years. I don’t expect us to have robustly solved ASI-aligment in that timeframe.
I also don’t expect us to have robustly solved ASI-alignment in that timeframe. I simply fail to see a history in which AI control work now is a decisive factor. If you insist on making a top level claim that I haven’t thought through the branches of how things go, I’d appreciate a more substantive description of the branch I am not considering.
This is a good and important point. I don’t have a strong opinion on whether you’re right, but one counterpoint: AI companies are already well-incentivized to figure out how to control AI, because (as Wei Dai said) controllable AI is more economically useful. It makes more sense for nonprofits / independent researchers to do work that AI companies wouldn’t do otherwise.
This post raises an important perspective on the practicalities of AI Control versus Alignment. Given the potential for AI to function productively even when not fully aligned, do you believe that current AI control methods are scalable enough to handle future AGI systems? Additionally, what would be the main challenges in ensuring that AI control strategies are robust in highly dynamic or emergency scenarios?
AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind
Some considerations:
Safety researchers should want Google Deepmind (GDM) to have a robust and flourishing safety department. It seems plausible that GDM will be able to create “the smartest” models: they have lots of talent, and own lots of computers. (see e.g. https://epochai.org/data/notable-ai-models#computing-capacity)
Anthropic (ANT) might run into trouble in the future due to not owning their own computers, e.g. if Amazon (or where ever they’re renting their computers from) starts their own internal scaling competitor, and decides to stop renting out most of their compute.
ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the “optimal allocation”.
GDM only recently started a bay area based safety research team/lab (with members like Alex Turner). So if people had previously decided to work for ANT based on location, they now have the opportunity to work for GDM without relocating.
I’ve heard that many safety researchers join ANT without considering working for GDM, which seems like an error, although I don’t have 1st hand evidence for this being true.
ANT vs GDM is probably a less important consideration than “scaling lab” (ANT, OAI, GMD, XAI, etc.) vs “non scaling lab” (USAISI, UKAISI, Redwood, ARC, Palisade, METR, MATS, etc. (so many...)). I would advise people to think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted” [edit: I mean viewed as corrupted by the broader world in situations where e.g. there is a non-existential AI disaster or there is rising dislike of the way AI is being handled by coorperations more broadly, e.g. similar to how working for an oil company might result in various climate people thinking you’re corrupted, even if you were trying to get the oil company to reduce emissions, etc. I personally do not think GDM or ANT safety people are “corrupted”] (in addition to strengthening them, which I expect people to spend more time thinking about by default).
Because ANT has a stronger safety culture, doing safety at GDM involve more politics and navigating around buerearcracy, and thus might be less productive. This consideration applies most if you think the impact of your work is mostly through the object level research you do, which I think is possible but not that plausible.
(Thanks to Neel Nanda for inspiring this post, and Ryan Greenblatt for comments.)
ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the “optimal allocation”.
I think this depends on whether you think AI safety at a lab is more of an O-ring process or a swiss-cheese process. Also, if you think it’s more of an O-ring process, you might be generally less excited about working at a scaling lab.
Centralization might actually be good if you believe there are compounding returns to having lots of really strong safety researchers in one spot working together, e.g. in terms of having other really good people to work with, learn from, and give you feedback.
My guess would be that Anthropic resources its safety teams substantially more than GDM in terms of e.g. compute per researcher (though I’m not positive of this).
I think the object-level research productivity concerns probably dominate, but if you’re thinking about influence instead, it’s still not clear to me that GDM is better. GDM is a much larger, more bureaucratic organization, which makes it a lot harder to influence. So influencing Anthropic might just be much more tractable.
is it actually tractable to affect Deepmind’s culture and organizational decisionmaking
how close to the threshold is Anthropic for having a good enough safety culture?
My current best guess is that Anthropic is still under the threshold for good enough safety culture (despite seeming better than I expected in a number of ways), and meanwhile that Deepmind is just too intractably far gone.
I think people should be hesitant to work at any scaling lab, but, I think Anthropic might be possible to make “the one actually good scaling lab”, and I don’t currently expect that to be tractable at Deepmind and I think “having at least one” seems good for the world (although it’s a bit hard for me to articulate why at the moment)
I am interested in hearing details about Deepmind that anyone thinks should change my mind about this.
This viewpoint is based on having spent at least 10s of hours trying to learn and about influence both org’s culture, at various times.
In both cases, I don’t get the sense that people at the orgs really have a visceral sense that “decisionmaking processes can be fake”, I think they will be fake by default and the org is better modeled as following general incentives, and DeepMind has too many moving people and moving parts at a low enough density that it doesn’t seem possible to fix. For me to change my mind about this, I would need to someone there to look me in the eye and explain that they do have a visceral sense of how organizational decisionmaking processes can be fake, and why they nonetheless think DeepMind is tractable to fix. I assume it’s hard for @Rohin Shah and @Neel Nanda can’t really say anything publicly that’s capable of changing my mind for various confidentiality and political reasons, but, like, that’s my crux.
(conving me in more general terms “Ray, you’re too pessimistic about org culture” would hypothetically somehow work, but, you have a lot of work to do given how thoroughly those pessimistic predictions came true about OpenAi)
I think Anthropic also has this problem, but the threshold of almost-aligned-leadership and actually-pretty-aligned people that it feels at least possible to me for the to fix it. The main things that would persuade me that they are over the critical threshold is if they publicly spent social capital on clearly spelling out why the x-risk problem is hard, and made explicit plans to not merely pause for a bit when they hit an RSP threshold, but (at least in some circumstances) advocate strongly for global government shutdown for like 20+ years.
I think your pessimism of org culture is pretty relevant for the question of big decisions that GDM may make, but I think there is absolutely still a case to be made for the value of alignment research conducted wherever. If the research ends up published, then the origin shouldn’t be held too much against it.
So yes, having a few more researchers at GDM doesn’t solve the corporate race problem, but I don’t think it worsens it either.
It might be “fine” to do research at GDM (depending on how free you are to actually pursue good research directions, or how good a mentor you have). But, part of the schema in Mark’s post is “where should one go for actively good second-order effects?”.
I largely agree with this take & also think that people often aren’t aware of some of GDM’s bright spots from a safety perspective. My guess is that most people overestimate the degree to which ANT>GDM from a safety perspective.
For example, I think GDM has been thinking more about international coordination than ANT. Demis has said that he supports a “CERN for AI” model, and GDM’s governance team (led by Allan Dafoe) has written a few pieces about international coordination proposals.
ANT has said very little about international coordination. It’s much harder to get a sense of where ANT’s policy team is at. My guess is that they are less enthusiastic about international coordination relative to GDM and more enthusiastic about things like RSPs, safety cases, and letting scaling labs continue unless/until there is clearer empirical evidence of loss of control risks.
I also think GDM deserves some praise for engaging publicly with arguments about AGI ruin and threat models.
(On the other hand, GDM is ultimately controlled by Google, which makes it unclear how important Demis’s opinions or Allan’s work will be. Also, my impression is that Google was neutral or against SB1047, whereas ANT eventually said that the benefits outweighed the costs.)
Great post. I’m on GDM’s new AI safety and alignment team in the Bay Area and hope readers will consider joining us!
I would advise people to think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”
What evidence is there that working at a scaling lab risks creating a “corrupted” perception? When I try thinking of examples, the people that come to my mind seem to have quite successfully transitioned from working at a scaling lab to doing nonprofit / government work. For example:
Paul Christiano went from OpenAI to the nonprofit Alignment Research Center (ARC) to head of AI safety at the US AI Safety Institute.
Geoffrey Irving worked at Google Brain, OpenAI, and Google DeepMind. Geoffrey is now Chief Scientist at the UK AI Safety Institute.
Beth Barnes worked at DeepMind and OpenAI and is now founder and head of research at Model Evaluation and Threat Research (METR).
I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that “corrupted”, although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).
> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”
Does this mean something like:
1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or
2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon
Basically (2), very small amounts of (1) (perhaps qualitatively similar to the amount of (1) you would apply to e.g. people joining US AISI or UK AISI)
The high level claim seems pretty true to me. Come to the GDM alignment team, it’s great over here! It seems quite important to me that all AGI labs have good safety teams
What are you using the word “rationalist” to mean? If you just mean “members of any subculture bearing some line of memetic descent from early Less Wrong” (which I don’t think deserves the pretentious term rationalist, but putting that aside), why is “communication platform” a useful way to chop up that social grouping? A lot of the same people use Less Wrong and Facebook and Twitter and Discord and Google Docs, and a lot of the people who use the same platform wouldn’t be in the same cluster if you were to analyze the graph of what actual conversations people are using these platforms to have.
It’s a natural way to cut it up from one’s own experience. Each platform has different affordances and brings out different aspects of people, and I get pretty different experiences of them on the different platforms mentioned.
John Flanagan: “An ordinary archer practices until he gets it right. A ranger practices until he never gets it wrong.”
I want to reword this in to make it about rationality in a way that isn’t pretentious.
Cavilo, The Vor Game: “The key to strategy… is not to choose a path to victory, but to choose so that all paths lead to a victory.” is close to what I want, but not quite.
Quarantine preparation has made me realize that a days worth of food is actually really cheap, doesn’t require that much time to cook, and can be made fairly tasty for not much more, i.e. a day’s worth of easy-to-cook, relatively tasty food is about $5.
This requires some amount of amortized costs for the easy-to-cook and relatively tasty part, but not immensely large upfront costs (instantpot, spices, etc.).
This reference says that 40 million people dealt with hunger is the US. I am… super confused? I find it extremely difficult to believe that people literally couldn’t afford to buy food, so the explanation is probably something like “hunger is worth it to get dietary variety/convenience/etc.” or “trapped by local incentive gradients” or “have incentives to not save money, go hungry when hit by proverbial ‘rainy days’” or “people don’t realize how cheap food actually is”
I’m still confused though. I feel like there might be some room for someone to write some infographic that has information like “here’s a rotating course of 10 meals that are cheap and tasty, with lists of exactly what to buy on what day, how to cook everything with various types of kitchen equipment, substitutes in case the store doesn’t have various ingredients, possible variants in case you get bored”. Crucially, the infographic would have to be really good. A possible explanation is that people who potential might have to deal with hunger don’t have the slack to plan their meals so they don’t and none of the existing meal plans are understandable enough or something.
I notice I’m still confused.
Also mildly confused by why soup kitchens make complicated foods instead of simple foods, but that confusion is nearly entirely resolved by various signaling considerations.
Maybe people who struggle with hunger don’t plan a rotating course of 10 meals because they are signaling that they aren’t so poor to have to plan their meals so meticulously. Maybe planning a rotating course of 10 meals is much harder than I think it is. Maybe I’m far below where I think I am in terms of “ability to endure eating the same food over and over again” and most people just can’t eat a rotating course of meals.
I notice I’m still confused.
I feel like I might be missing something really clear. Like something along the lines of “most people who go hungry don’t have kitchens/space to store ingredients/stable living situations/any slack at all whatsoever/something”.
It seems to me that during the quarantine I eat less than usual; either I am deluding myself, or it is a combination of having less physical activity (such as walking to/from job/lunch/shops/playground), being able to eat whenever I want (so there is no pressure to “eat a lot now, because the next opportunity will be 7 hours later”), making less superstimulating food (using less sugar and salt), and having other ways to get some hedons (e.g. taking a nap). Sometimes I cook a soup, and that’s most of my daily food.
And soups are really cheap. You take $1-2 worth of ingredients, cook them in water, add little salt and spices; optionally eat with bread. Bread is cheap, salt is cheap, most spices are cheap (per portion), potatoes, broccoli, carrot, onion, and beans are cheap. Most of these things are like $1 per 1kg.
Okay, soups are not super healthy; cooking destroys vitamins. You should also get some fresh fruits and vegetables. Apples, tomatoes, cucumbers are $1-2 per 1 kg. You should definitely be able to eat healthy food for less than $5 a day.
What is expensive? Chocolate and other sweets, cheese, berries, nuts; I probably forgot something. You shouldn’t eat sweets; and you can afford the other things now and then even under $5 a day on average. (It is not an optimal diet; some people recommend eating berries and nuts every day. But still healthier than many people eat, including those who spend more money on food.)
.
On the other hand, we usually spend more than $5 per person per day, even during the quarantine. We spend a lot on sweets and cheese. The tastier ones are even more expensive than the basic ones, which are already more expensive than the actually useful food. Instant gratification—it’s addictive! The more stress I have during the day, the more I need something that will improve my mood instantly, even if it’s only for a moment.
Poor people probably have more stress, and thus less willpower to resists things full of sugar and salt. (Also alcohol and tobacco. Okay, the last one is technically not food, but still comes from the same budget.)
Very poor people, e.g. the homeless, don’t have the place to cook. So many cheapest things are ironically out of their reach. Not having a fridge is also a problem.
Then, I assume many poor people don’t have the good skills and habits. Some of them don’t have the necessary IQ, some have mental problems, some had shitty upbringing.
1) The claim that 40 million Americans “deal with hunger” is, um, questionable. Their citation leads to feedingamerica.org, which cites USDA’s Household Food Security in the United States report (https://www.ers.usda.gov/webdocs/publications/94849/err-270.pdf?v=963.1). The methodology used is an 11-question survey (18 for households with children), where answering 3 questions in the affirmative marks you as low food security. The questions asked are (naturally) subjective. Even better, the first question is this: “We worried whether our food would run out before we got money to buy more.” Was that often, sometimes, or never true for you in the last 12 months? That’s an a real concern to have, but it is not what people are talking about when they say “dealing with hunger”. You can be running on a shoestring budget and often worry about whether you’ll have enough money for food without ever actually not having enough money for food.
2) A significant percentage of the population has non-trivial issues with executive function. Also, most of the population isn’t familiar with “best practices” (in terms of effective life strategies, basic finances, etc). Most people simply don’t think about things like this systematically, which is how you get the phenomenon of ~50% of the population not being able to cover a $400 emergency (or whatever those numbers are, they’re pretty close). This would be less of an issue if those cultural norms were inherited, but you can’t teach something you don’t know, and apparently we don’t teach Home Economics anymore (not that it’d be sufficient, but it would be better than nothing). This is a subject that deserves a much more in-depth treatment, but I think as a high-level claim this is both close enough to true and sufficient as a cause for what we might observe here. Making an infographic with a rotating course of 10 cheap, easy-to-prepare, relatively healthy, and relatively tasty meals is a great idea, but it’ll only be useful to the sorts of people who already know what “meal prep” means. You might catch some stragglers on the margin, but not a lot.
3) The upfront costs are less trivial than they appear if you don’t inherit any of the larger items, and remember, 50% of the population can’t cover a mid-3-figure emergency. “Basic kitchen equipment” can be had for under $100, but “basic kitchen equipment” doesn’t necessarily set you up to prepare food in a “meal prep” kind of way.
2) is something that I sort of thought about but not with as much nuance. I agree that such an infographic would be only useful for people who were looking for an alternate meal preparation strategy or something.
3) if it’s true that people want to do meal preppy type things but don’t have enough to pay upfront costs, there might be gains from 0-interest microloans, maybe via some MLM-type I loan you money, then once you’ve saved some money and paid me back, you loan other people money too.
It seems like the bottom 20% of the US spends $2216 per year per income earner or ~$6 per day on food. Given that children themselves don’t have an income they might spend less then $5 per person for food per day.
People can drown in a river that’s on average 1m deep.
If you have DAI right now, minting on https://foundry.finance/ and swapping yTrump for nTrump on catnip.exchange is an almost guaranteed 15% profit.
Your AI doesn’t figure out how to do a reasonable “values handshake” with a competitor (where two agents agree to both pursue some appropriate compromise values in order to be Pareto efficient)...
I think it refers to something like this: Imagine that a superintelligent human-friendly AI meets a superintelligent paperclip maximizer, and they both realize their powers are approximately balanced. What should they do?
For humans, “let’s fight, and to the victor go the spoils” is the intuitive answer, but the superintelligences can possibly do better. If they fight, they have a 50% chance of achieving nothing, and a 50% chance of winning the universe… minus whatever was sacrificed to Moloch, which could possibly be a lot. If they split the universe to halves, and find out a way how to trust each other, that is better that war. But there is a possibility of even better solution, when both of them would agree on acting as if they were a single superintelligence that values both humans and paperclips equally.
The cooperative solution can be better than a 50% split of the universe, because you could build paperclip factories on places humans care less about, such as uninhabitable planets; or perhaps you could find a way how to introduce paperclips to human environment without reducing the human quality of life. For example, would you mind using paperclips to reinforce the walls of your house? Would you mind if almost all materials used to build stuff for humans contained little paperclips inside? Would you mind living in a simulation implemented on paperclip-shaped circuits? So maybe at the end, humans could get like 70% of the potential utility of the universe, while 70% of potential material would be converted to paperclips.
A weird but not-inaccurate way to think of log(n) is as an answer to “how many digits does n have?”
This suggests that a weird but not-inaccurate way to think of a log-normal distribution is as a distribution where “the number of digits is normally distributed”
I was answering a bunch of questions from OpenPhill’s calibration test of the form “when did <thing> happen?”. A lot of the time, I had no knowledge of <thing>, so I gave a fairly large confidence interval as a “maximum ignorance” type prediction (1900-2015, for example).
However, the fact that I have no knowledge of <thing> is actually moderate evidence that it happened “before my time”.
Example: “when did <person> die?” If I was alive when <person> died, there’s a higher chance of me hearing about their death. Thus not having heard of <person> is evidence that they died some time ago.
Example: “when did <person> die?” If I was alive when <person> died, there’s a higher chance of me hearing about their death. Thus not having heard of <person> is evidence that they died some time ago.
Technically, it could also be evidence that you are dead, but your ghost cannot move to afterlife, probably because it is too attached to scoring internet points (a fate that awaits many of us, I am afraid).
If you’re interviewing employees of a company about how good the company is, there’s positive bias because people who hate the company will have already left.
Sure. Also, current employees are dis-incented from being truthful about the bad parts. But you’re not applying statistics to the results, so that’s not terribly important. Such interviews provide limited evidence about goodness of the company. They provide decent evidence about the potential coworkers you’re interviewing.
Generally, when you’re interviewing employees about a company about whether the company is any good, you’re trying to decide whether to work there yourself. And you’re evaluating whether any of them seem competent and interesting enough that you can tolerate being near them for any length of time.
Coinfection rates of COVID and normal flu are very low. If you have the set of flu/COVID symptoms, you’re basically guaranteed to have one or the other. You can test for the flu pretty easily. Therefore, people can just test for the flu as a proxy for testing for COVID.
Is this just a really obvious chain of reasoning that everyone has missed? Which one of my assumptions is wrong?
Thus it’s either the case that if you have the set of flu/COVID symptoms, you’re basically guaranteed to have either flu or COVID.
Maybe the tests are only useful for people who don’t have symptoms, but if that’s not the case, then the flu test provides a lot of evidence as to whether or not someone has COVID (even if “basically guaranteed” is replaced with “probable”).
update the CDC advises testing for the flu and there’s a lot of medical things that cause “flu-like” symptoms. Turns out that “flu-like” symptoms is basically “immune system doing things”, which is going to happen with most things your body doesn’t like.
Moral uncertainty is a thing that people think about. Do people also think about decision theoretic uncertainty? E.g. how to decide when you’re uncertain about which decision theory is correct?
Decision theoretic uncertainty seems easier to deal with, because you’re trying to maximise expected value in each case, just changing what you condition on in calculating the expectation. So you can just take the overall expectation given your different credences in the decision theories
Alignment researchers should think hard about switching to working on AI Control
I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50⁄50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!
For this post, my definitions are roughly:
AI alignment is the task of ensuring the AIs “do what you want them to do”
AI control is the task of ensuring that if the AIs are not aligned (e.g. don’t always “do what you want” and potentially want to mess with you), then you are still OK and can use them for economically productive tasks (an important one of which is doing more alignment/control research.)
Here are some thoughts, arguments, and analogies (epistemic status: there is no “hidden content”, if you don’t find the literal words I wrote persuasive you shouldn’t update. In particular, just update on the words and don’t update about what my words imply about my beliefs.):
Everything is in degrees. We can “partially align” some AIs, and things will be better if we can use those AIs for productive tasks, like helping with alignment research. The thing that actually matters is “how aligned are the AIs” + “how aligned to they need to be to use them for stuff”, so we should also focus on the 2nd thing.
If you were a hedge fund, and your strategy for preventing people from stealing your data was and starting new hedge fund was “we will make the hedge fund a super fun place to work and interview people carefully to make sure they really care about working for us”, and you also did not do things like “maybe people shouldn’t be able to download all the trading strategies onto a USB stick which we let them take home”, “maybe people should be forced to keep their work computers in the office” and “maybe we should have some people checking if other employees are trying to steal data and make some plans about how to stop them from starting a rival hedge fund if they do”, then all the other hedge funds would think you were an idiot. You should obviously try both to check if people want to defect, and also make it harder for them to defect.
I think a lot of the neglect of AI control can be traced to the “pivotal act” based framings of how to make aligned AI go well. I think these views are mistaken, but I won’t provide a great argument here (quick analogy: if the sum of a bunch of numbers is large, there doesn’t need to be any individual number that is large; similarly, if the consequences of a sequence of actions results in a large change, no individual action needs to be “pivotal”). However, I don’t think things are shaping up for single pivotal acts to be very important, and being able to productively use early AI labor to help with alignment seems like it is going to be an important part of the future going well.
Working on AI control is really tractable compared to working on alignment. There is a lot of prior art in people thinking about how to e.g. prevent data from being stolen by insider threats, secure various information against adversaries, generally ensure that national security secrets don’t get leaked, etc. We can set up various plausibly looking scenarios and try to “game them out” under certain plausibly seeming assumptions about how much alignment we’ll be able to achieve. We can make assumptions about AI capabilities and alignment without really getting into the nitty gritty of philosophically confusing topics like “what is deception really?” and “how likely is deception?” and “when will this all go down?”
Note that this means control might be easier for the AIs to automate, and easier to do “during crunch time.”
It also means that AI control is a bit contiguous with work people already do to protect from adversaries. I think the AI thing qualitatively changes the picture though, so it’s still worth thinking about.
If you succeed at AI control, there is a more compelling story you can tell for why you should implement the specific scheme that you thought of. In comparison, a lot of alignment proposals depend on philosophically dubious and confusing assumptions about the nature of cognition/SGD/etc. That is to say, a successful proposal in AI control comes with an argument that might be convincing to established security professionals if they thought about it, but there is likely no such class of professionals capable of lending your proposal external legitimacy in AI alignment.
The intelligence explosion might happen with less-fully-AGI AIs, who will also be doing some alignment work on the side. It’s important for them to not escape and do other bad stuff until they’ve solve alignment. We can give ourselves more time to use smart AIs to help with alignment if we have better AI control.
There is a bunch of research that is useful for both alignment and control, e.g. model organisms of deception, interpretability techniques, oversight techniques, etc. More people should analyze such research from the perspective of “how can this be helpful even if it fails to produce an aligned AI?”
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
When we have controlled unaligned AIs doing economically useful work, they probably won’t be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn’t.
So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I’m afraid that some people just don’t feel philosophically confused about much of anything, and will push forward on that basis.) But you do seem to worry about philosophical problems, which makes me confused about the position you take here.
BTW I have similar objections to working on relatively easy forms of (i.e., unscalable) alignment solutions, and using the resulting aligned AIs to solve alignment for more powerful AIs. But at least there, one might gain some insights into the harder alignment problems from working on the easy problems, potentially producing some useful strategic information or making it easier to verify future proposed alignment solutions. So while I don’t think that’s a good plan, this plan seems even worse.
It currently seems unlikely to me that marginal AI control research I’m excited about is very economically useful. I agree that some control or control-adjacent research will end up being at least somewhat economically relevant.
I can also imagine worlds where AIs are clear schemers that are clearly unproductive to use by default, and thus control increases deployment through both straightforward incentives and risk compensation (though this maybe requires the AIs to have made a mistake or to have somewhat strange preferences as they would probably prefer fooling our tests).
My understanding is that your concerns on downside are maybe some combination of:
AI control research will be economically very useful (e.g. accelerates AI by 6 months conditional on 5 year timelines).
Control will result in irrational risk compensation style actions where reducing earlier risks provides false confort about later risks (as control isn’t scalable) and this will ultimately make the situation worse.
AI control prevents earlier warning shots which would have created more will for costly actions that save us later. (I’m not sure if this is actually your concern, though some statements pattern matched to this.)
I’m very skeptical of (1) for marginal control research. It feelsl like this has to imply incredibly strong levels of inefficiency and usefulness of control research. Perhaps you’re imagining a smaller acceleration effect, but think the benefits are also tiny relative to this small acceleration?
I’m somewhat sympathetic to things being somewhat inefficient such that if the field of people who might do control research instead did capabilities, that would be a notable acceleration (maybe 5%), but it seems like control research should be a massive reduction in acceleration relative to this.
On (2), this sounds like a weird intermediate level of competence where they are actually sensitive to full on scheming risks at an earlier point (when control works), but end up thinking this will scale despite it being relatively obvious it won’t. I agree that control might end up being an excuse for scaling, but I don’t think that it will be a counterfactual excuse (some other excuse would have been found) and I think we should just make sure to argue on the merits at the point.
I do think that (2) points toward focusing on advancing evaluation methodology more than techniques such that it would be relatively clear when things fail if people are being unreasonable and it is relatively easy to call out unreasonableness.
I think not doing control due to (3) can put in you in pretty scary situations: how can we know when you’ll get scary warning shots vs full AI takeover? Further, AIs presumably want to avoid warning shots though it is unclear how much AIs will systematically want to advance the interests of other AIs. At a more basic level I don’t love arguments of the form “actually, having bad things happen will actually be good, so we shouldn’t try to prevent bad things which are pretty close to the actual bad things we’re worried about”. All that said, I’m sympathetic to researchers avoiding work on mitigating misuse or smaller-than-full-takeover harms due to this motivation. (I still think AI companies should have an obligation to make it very clear what risks they are imposing and to try to mitigate risks at multiple scales, but this obligation doesn’t apply to researchers. If it is hard to hire good people to work on misuse, try paying more.)
As in, your view is that:
Greatly accelerating all alignment work which isn’t somewhat philosophically/conceptually confusing won’t be very useful for solving alignment. (Because of conceptual bottlenecks.)
Philosophy/conceptual work is sufficiently hard to check when an AI is trying to screw you over that control can’t force scheming AIs to notably help with this work. (Note that even if evaluation in this regime is much harder than generation, that isn’t clearly a blocker, because we potentially only have a to supervise a small subset of outputs.) (Note that this is distinct from the view that AIs won’t be capable enough to help with this work as is currently the case.)
I think there is a lot of very helpful alignment work which isn’t conceptually bottlenecked and agendas with fewer conceptual barriers could go very far.
Further, I’m skeptical about evaluation being so hard.
Beyond that, I’m only imaging part of the theory of change of control work being to “solve alignment” or work on alignment.
Alternatives:
Buy time by both making it cheaper for actors to stall and also making risks better understood[1].
Also buy time by pushing AI takeover later in time.
Work on alternative exit conditions like emulated minds (difficult but maybe doable).
Additional time with very powerful AIs seems useful for studying them even if we can’t get any acceleration from automation. Pausing at the brink is more difficult in terms of costs and coordination (as there will be more close actors and more incentive), but it might also be radically easier to make the case for pausing relative to today. So, it overall seems easier to arrange and time at the brink also seems much more useful all around.
(I also think that pausing looks less good if it isn’t part of a well implemented, thoughtful and extremely high effort approach which e.g. actively intervenes with hardware and is aggressive on international efforts. So, I’m not very into pausing “for the wrong reasons”.)
(Edited to add)
I’d like to argue that there is a lot of helpful stuff which isn’t conceptually bottlenecked.
Concretely, let’s imagine targeting aligning AI systems which are capable enough to dominate top human experts in speed and cost across all relevant domains (including philosophy/conceptual work). In particular, we want to align them well enough that we’re happy to totally defer to them on really tricky questions like “what should be our high level approach for handling risks for further AI systems”[2]. I think this is a pretty good target for a rational actor, then you can actually hand off decision making to such AIs. (They might decides to work on alignment of smarter systems or pursue totally different routes.)
It seems difficult though not impossible that we could reasonably confidently (95%)[3] reach this target via very prosaic/empirical approaches using huge amounts of AI labor (e.g. a combination of testing things on huge numbers of model organisms, doing some high level interp (both white and black box) to get a bit of a sense for what is going on in AI systems in general, and verifying that AIs seem to do a good job in held out conceptual domains where we happen to know the answers). And, I think prosaic approaches with huge amounts labor could also substantially increase chances of hitting this target (e.g. 4x risk reduction) even if we can’t hit this level of confidence.
This doesn’t really look like “solve alignment”, but in practice it reduces risk a bunch.
It’s also possible that better understanding of risks will indicate that proceeding to wildly superhuman AI immediately is fine, though I’m skepitical.
The AIs might ask us questions or whatever to figure out our preferences.
Obviously, 95% confidence is objectively unreasonable, but it is notable improvement over doing nothing, particularly in worlds where we find evidence for misalignment.
My vague plan along these lines is to attempt as hard as possible to defer all philosophically confusing questions to the “long reflection”, and to use AI control as a tool to help produce AIs that can help preserve long term option value (including philosophical option value) as best as possible.
I seperately have hope we can solve “the entire problem” at some point, e.g. through ARC’s agenda (which I spend most of my time trying to derisk and advance).
People interested in a discussion about control with someone who is maybe closer to Wei Dai’s perspective might be interested in my dialogue with habyrka.
I think if the first powerful unaligned AI remained in control instead of escaping, it might make a good difference, because we can engineer and test alignment ideas on it, rather than develop alignment ideas on an unknown future AI. This assumes at least some instances of it do not hide their misalignment very well.
I think a key difference is I do believe the technical alignment/control problem as defined essentially requires no philosophical progress or solving philosophical problems like the hard problem of consciousness, and I believe the reason for this comes down to both a general point and a specific point.
In general, one of the reasons I believe philosophy tends not to be a productive area compared to other branches of science is that usually they either solve essentially proven to be intractable problems nowadays, or they straight up tried to solve a problem in far too much generality without doing any experiments, and that’s when they aren’t straight up solving fictional problems (I believe a whole lot of possible world philosophizing is in that category).
This is generally because philosophers do far too much back-chaining compared to front-chaining on a lot of problems.
For the specific point of alignment/control agendas, it’s because that the problem of AI alignment isn’t a problem about what goals you should assign it, but rather whether you can put in goals into the AI system such that the AI will reliably follow your goals at all.
I really think if you want to tell a story of AI Control work being good (especially compared to working on preventing AGI from being built in the first place), the important and difficult part is figuring out how to actually use these more powerful AI systems to either achieve some kind of global moratorium, or make unprecedented progress on the hard parts of the AI Alignment problem.
When I see most people start thinking about control, I rarely see them interface with either of these two things, and honestly, I mostly see them come up with cool additional commercial applications of AI, which I think will mostly make the present situation worse. To be clear, I think thinking about how to use AI systems for global coordination or making progress on the hard parts of the AI Alignment problem is tractable, but I haven’t seen much work on it, and I am worried it is not the default thing that people will work on when they start thinking about control.
I think it is unlikely for control work to buy humanity much time until someone builds a very powerful unaligned AI system, at least at our present levels of coordination tools. I think controlled AI systems might help with coordination here, but I do think someone needs to do the actual hard work of thinking about how it might.
(I go into this and various related things in my dialogue with Ryan on control)
FWIW, I don’t think of myself as having a super strong opinion on whether working on control is better than working on preventing people from building catastrophically dangerous AI (and I’ve seriously considered doing the latter).
(I do feel more optimistic than you about being able to use controlled AIs for something useful.)
@Buck do you or Ryan have a writeup that includes: (a) a description of the capabilities of a system that you think would be able to do something useful for the sorts of objectives that Habryka talks about and (b) what that something useful is.
Bonus points if it has (c) the likelihood that you think such a system will be controllable by 20XX and (d) what kind of control setup you think would be required to control it.
On (a) and (b), we describe this at a high level here.
We don’t really have anything written on (c) or (d). (c) really depends a lot on effort, so I’d probably prefer to talk mostly about (d) including what evalutions would be needed at various points etc.
For (a), I think we potentially care about all of:
Systems which are perhaps qualitatively similarly smart to OK software engineers and which are capable of speeding up R&D work by 10x (speedups aren’t higher due to human bottlenecks). (On a nearcast, we’d expect such systems to be very broadly knowledgeable, pretty fast, and very well tuned for many of their usages.)
Systems which nearly strictly dominate top human scientists on capability and which are perhaps similar in qualitative intelligence (I’d guess notably, but not wildly weaker and compensating in various ways.) Such systems likely some domains/properties in which they are much better than any human or nearly any human.
Systems which are qualitatively smarter than any human by a small amount.
It’s likely control is breaking down by (3) unless control ends up being quite easy or the implementation/evaluation is very good.
On (b) we plan on talking more about this soon. (Buck’s recent EAGx talk is pretty up to date with our current thinking, though this talk is obviously not that detailed. IDK if you can find a recording anywhere.)
Does that mean that you believe that after a certain point we would lose control over AI? I am new to this field, but doesn’t this fact spell doom for humanity?
By “control”, I mean AI Control: approaches aiming to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures.
AI control stops working once AIs are sufficiently capable (and likely don’t work for all possible deployments that might eventually be otherwise desirable), but there could be other approaches that work at that point. In particular aligning systems.
The main hope I think about is something like:
Use control until AIs are capable enough that if we trusted them, we could obsolete top human scientists and experts.
Use our controlled AI labor to do the work needed to make systems which are capable enough, trustworthy enough (via alignment), and philosophically competent enough that we can safely hand things off to them. (There might be some intermediate states to get to here.)
Have these systems which totally obsolete us figure out what to do, including figuring out how to aligning more powerful systems as needed.
We discuss our hopes more in this post.
Re a, there’s nothing more specific on this than what we wrote in “the case for ensuring”. But I do think that our answer there is pretty good.
Re b, no, we need to write some version of that up; I think our answer here is ok but not amazing, writing it up is on the list.
yep agreed, I have a bunch of vague plans in this direction. I most generally think that AI control is a pretty good tool in the toolbox, and is unlikely to make things much worse but plausibly makes things much better.
I think AI control agendas are defined in such a way such that this metric isn’t as relevant as you think it is:
Because the agenda isn’t trying to make AIs alignable, but to make them useful and not break out of labs, so the question of the timeline to unaligned AI is less relevant than it is for most methods of making safe AI.
This topic is important enough that you could consider making a full post.
My belief is that this would improve reach, and also make it easier for people to reference your arguments.
Consider, you believe there is a 45% chance that alignment researchers would be better suited pivoting to control research. I do not suspect a quick take will reach anywhere close to that number, and has a low chance of catalysing dramatic, institutional level change.
Yes, I agree. If I had more time, this would have been a top-level post. If anyone reading wants to write such a post using my quick take as a base, I would be happy to take a look and offer comments. I might do it myself at some point as well.
Just make it a full post without doing much if any editing, and link to this quick take and its comments when you do. A polished full post is better than an unpolished one, but an unpolished one is better than none at all.
I’m not sure if this is allowed here, but maybe you can ask an AI to write a draft and manually proofread for mistakes?
idk how much value that adds over this shortform, and I currently find AI prose a bit nauseating.
That’s fair. To be honest I’ve only used AI for writing code, I merely heard about other people having success with AI drafts. Maybe their situation was different, or they were bad at English to the point that AI writes better than them.
This feels like a pretty central cruxy point—and not just for the relevance of the pivotal act framing specifically. I think it’s underlying a whole difference of worldview or problem-solving approach.
A couple other points in a similar direction:
A thing I noticed in our discussion on the model delta with Christiano post: your criterion for useful AI safety work seems to be roughly “this will contribute some marginal value” as opposed to “this will address a bottleneck”.
Right at the top of this thread, you say: “I think Redwood Research’s recent work on AI control really ‘hits it out of the park’, and they have identified a tractable and neglected intervention that can make AI go a lot better”. Note what’s conspicuously missing there: tractable and neglected, but you don’t claim importance.
I would say that your mindset, when approaching AI safety, seems to be an epsilon fallacy.
Sure, in principle a sum of numbers can be large without any individual number being large. In practice, the 80⁄20 rule is a thing, and everything has bottlenecks all the time. If work is not addressing a bottleneck, then it’s approximately useless.
(Somewhat more precisely: if marginal work is not addressing something which is a bottleneck on current margins, then it’s approximately useless.)
Of importance, tractability and neglectedness, importance is the most important. In practice, it is usually better to have a thousand people trying to solve a major bottleneck each with low chance of success, than a thousand people making tractable progress on some neglected issue which is not a bottleneck.
I think I disagree with your model of importance. If your goal is the make a sum of numbers small, then you want to focus your efforts where the derivative is lowest (highest? signs are hard), not where the absolute magnitude is highest.
The “epsilon fallacy” can be committed in both directions: both in that any negative dervative is worth working on, and that any extremely large number is worth taking a chance to try to improve.
I also seperately think that “bottleneck” is not generally a good term to apply to a complex project with high amounts of technical and philosophical uncertainty. The ability to see a “bottleneck” is very valuable should one exist, but I am skeptical of the ability to strongly predict where such bottlnecks will be in advance, and do not think the historical record really supports the ability to find such bottlenecks reliably by “thinking”, as opposed to doing a lot of stuff, including trying things and seeing what works. If you have a broad distribution over where a bottleneck might be, then all activities lend value by “derisking” locations for particular bottlenecks if they succeed, and providing more evidence that a bottleneck is in a particular location if it fails. (kinda like: https://en.wikipedia.org/wiki/Swiss_cheese_model) For instance, I think of “deceptive alignment” as a possible way to get pessimal generalization, and thus a proabalistic “bottleneck” to various alignment approaches. But there are other ways things can fail, and so one can still lend value by solving non-deceptive-alignment related problems (although my day job consists of trying to get “benign generalization” our of ML, and thus does infact address that particular bottleneck imo).
I also seperately think that if someone thinks they have identified a bottleneck, they should try to go resolve it as best they can. I think of that as what you (John) is doing, and fully support such activities, although think I am unlikely to join your particular project. I think the questions you are trying to answer are very interesting ones, and the “natural latents” approach seems likely to shed at some light on whats going on with e.g. the ability of agents to communicate at all.
I do think that “we don’t have enough information to know where the bottlenecks are yet” is in-general a reasonable counterargument to a “just focus on the bottlenecks” approach (insofar as we in fact do not yet have enough information). In this case I think we do have enough information, so that’s perhaps a deeper crux.
Hiliariously, it seems likely that our disagreement is even more meta, on the question of “how do you know when you have enough information to know”, or potentially even higher, e.g. “how much uncertainty should one have given that they think they know” etc.
I agree it is better work on bottlenecks than non-bottlenecks. I have high uncertainty about where such bottlenecks will be, and I think sufficiently low amounts of work have gone into “control” that it’s obviously worth investing more, because e.g. I think it’ll let us get more data on where bottlenecks are.
see my longer comment https://www.lesswrong.com/posts/A79wykDjr4pcYy9K7/mark-xu-s-shortform#8qjN3Mb8xmJxx59ZG
Well, this would be the lone crux. The rest of the stuff you wrote is about non-exploding AI, right? And is therefore irrelevant to the thing about everyone dying, except insofar as controlled non-exploding AI can help prevent uncontrolled exploding AI from killing everyone?
I directionally agree with this (and think it’s good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
“ensuring that if the AIs are not aligned [...], then you are still OK” (which I think is the main meaning of “AI control”)
Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood’s existing work on control).
I think 2. is arguably the most promising strategy for 1., but I’ve occasionally noticed myself conflating them more than I should.
1. gives you the naive 50⁄50 equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)
I also think (non-confidently) that 2. is what’s really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.
But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!
I agree and regret focusing as much as we did 2 in the past; I’m excited for work on “white box control” (there’s some under way, and I’m excited for more).
We focused on black box control because it’s much more obvious how you’d make a legible and conservative safety case for it. But in hindsight this was a mistake, I think: people probably aren’t going to actually make legible or conservative safety arguments for their deployments, so there’s less of an advantage than I’d thought.
I plan to write more about white box control soon (but I plan to write about a lot of stuff…)
This seems to presume that you can divide up research topics into “alignment” vs “control” but this seems wrong to me. E.g. my categorization would be something like:
Clearly alignment: debate theory, certain flavors of process supervision
Clearly control: removing affordances (e.g. “don’t connect the model to the Internet”)
Could be either one: interpretability, critique models (in control this is called “untrusted monitoring”), most conceptions of ELK, generating inputs on which models behave badly, anomaly detection, capability evaluations, faithful chain of thought, …
Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn’t make much sense for research, most of which helps with both. It does make sense as a distinction for “what plan will you implement in practice”—but labs have said very little publicly about that.
Other things that characterize work done under the name of “control” so far are (1) it tries to be very concrete about its threat models, to a greater degree than most other work in AI safety, and (2) it tries to do assurance, taking a very worst case approach. Maybe you’re saying that people should do those things more, but this seems way more contentious and I’d probably just straightforwardly disagree with the strength of your recommendation (though probably not its direction).
Nitpick: I would also quibble with your definitions; under your definitions, control seems like a subset of alignment (the one exception if you notice the model is scheming and then simply stop using AI). I think you really have to define alignment as models reliably doing what you want independent of the surrounding context, or talk about “trying to do what you want” (which only makes sense when applied to models, so has similar upshots).
Tbc I like control and think more effort should be put into it; I just disagree with the strength of the recommendation here.
Agreed. (And I made a comment to this effect on Mark’s draft of this short form.) I don’t think it is that productive to classify research directions, the central thing we’re trying to point at is a classification for reasons why you might think that you’re fine where one reason is “the AI couldn’t do bad stuff without getting caught” and another reason is “the AI wouldn’t do bad stuff” (obviously simplifying somewhat).
I also think that white-box control should typically be considered pretty separately from black-box control and the story for it is often pretty different. (Though when thinking about white-box control you should often adopt the central control perspective, e.g., you should incorporate white-box runtime detection techniques (like probes) into an overall control evaluation via exogenously trying to determine the probe reliability and other important aspects of the probe. For instance, the probe can take advantage of black box strategies and how safe we are depends on how many actions AIs need to cause huge problems.)
The claims I’d like to make is:
There should be more research specifically targeting a black-box control story and which is operating from a black-box control perspective. (I’d guess like 20% of the overall AI safety portfolio should be this and maybe 30% of safety work at labs. This is not including work like “get AIs to be more useful at alignment research and get better at DC eval methodology” which should also be subsidized from a control perspective but seem non-central.)
When people do research which could be pretty useful from a black-box control perspective (e.g. debate or other scalable oversight style research), they should consider sometimes explicitly think about the control perspective and consider adjusting their research accordingly (to the extent they can understand this perspective well enough). To the extent that people have an alignment style story they think about while doing their research, they should be roughly equally excited about thinking about the control story. I also find that the control story is often more straightforward.
People thinking about safety plans for near term powerful AI should strongly consider control.
You’re potentially thinking about footnote 4 in our post on control:
This footnote is a bit confusingly worded, but I think the situation we were trying to say is “Suppose you expect dangerously powerful AI in the next year, and the current year is prior to 2032. We’re guessing you should put the better team on control.”
This is different than research in the run up.
I also think that to the extent people are trying to do backchained research focused on specific applications, it makes sense to put the better team on control over reducing the chance that scheming arises. (But these aren’t the only classes of interventions and some interventions don’t nicely fit into these buckets, e.g., you can do work on differentially making AIs more useful for alignment work which isn’t well classified as either and you can work on high level interpretability which aims to roughly understand how AIs make decisions in some cases (this high-level interp doesn’t clearly help with reducing the chance that scheming arises very directly, but could help with a bunch of stuff).)
I agree with bullet points 1, 2, 3, 6 and 7, partially agree with bullet point 5, and disagree with bullet point 4.
Thus, I agree with the central claim here:
For more of my analysis on the bullet points, read the rest of the comment.
For bullet point 1, I basically agree with this, mostly due to not favoring binary assumptions and instead prefer continuous quantity reasoning, which tends to both be a better match for the IRL world, and also continuous quantity reasoning gives you more information than binary outcomes.
I really like bullet point 2, and also think that even in a scenario where it’s easy to prevent defection, you should still have controls that make defecting employees have much less reward and much more punishment for subversive actions.
I deeply agree with point 3, and I’d frame AI control in one of 2 ways:
As a replacement for the pivotal act concept.
As a pivotal act that doesn’t require destruction or death, and doesn’t require you to overthrow nations in your quest.
A nitpick: AI labor will be the huge majority of alignment progress in every stage, not just the early stage.
I think one big reason the pivotal act frame dominated a lot of discussions is the assumption that we would get a pure software singularity which would FOOM in several weeks, but reality is shaping up to not be a pure software-singularity, since physical stuff like robotics and data centers still matters.
There’s a reason why every hyperscaler is trying to get large amounts of power and datacenter compute contracts, because they realize that the singularity is bottlenecked currently on power and to a lesser extent compute.
I disagree with 4, but that’s due to my views on alignment, which tend to view it as a significantly easier problem than the median LWer does, and in particular I view essentially 0 need for philosophical deconfusion to make the future go well.
I agree that AI control enhances alignment arguments universally, and provides more compelling stories. I disagree with the assumption that all alignment plans depend on dubious philosophical assumptions about the nature of cognition/SGD.
I definitely agree with bullet point 6 that superhuman savant AI could well play a big part in the intelligence explosion, and I believe this most for formal math theorem provers/AI coders.
Agree with bullet point 7, and think it would definitely be helpful if people focused more on “how can this be helpful even if it fails to produce an aligned AI?”
I feel like our viewpoints have converged a lot over the past couple years Noosphere. Which I suppose makes sense, since we’ve both been updating on similar evidence! The one point I’d disagree with, although also wanting to point out that the disagreement seems irrelevant to short term strategy, is that I do think that philosophy and figuring out values is going to be pretty key in getting from a place of “shakey temporary safety” to a place of “long-term stable safety”. But I think our views on the sensible next steps to get to that initial at-least-reasonable-safety sound quite similar.
Since I’m pretty sure we’re currently in a quite fragile place as a species, I think it’s worth putting off thinking about long term safety (decades) to focus on short/medium term safety (months/years).
I would suggest 50% of researchers working on a broader definition of control: including “control”, technical governance work and technical outreach (scary demos, model organisms of misalignment).
I’m in the process of trying to build an org focused on “automated/augmented alignment research.” As part of that, I’ve been thinking about which alignment research agendas could be investigated in order to make automated alignment safer and trustworthy. And so, I’ve been thinking of doing internal research on AI control/security and using that research internally to build parts of the system I intend to build. I figured this would be a useful test case for applying the AI control agenda and iterating on issues we face in implementation, and then sharing those insights with the wider community.
Would love to talk to anyone who has thoughts on this or who would introduce me to someone who would fund this kind of work.
I don’t see a significant difference in your distinction between alignment and control. If you say alignment is about doing what you want (which I strongly disagree with in its generality, e.g. when someone might want to murder or torture people or otherwise act unethically), that obviously includes your wanting to “be OK” when the AI didn’t do exactly what you want. Alignment comes in degrees, and you merely seem to equate control with non-perfect alignment and alignment with perfect alignment. Or I might be misunderstanding what you have in mind.
The actual definition comes from this quote here:
And the full link is below:
https://www.lesswrong.com/tag/ai-control
I disagree with this take. A.I. control will only be important in a multipolar situation in which no single A.I. system can create a gray goo catastrophe etc. But if such pivotal acts are impossible and no singular A.I. takes control, but instead many A.I.’s are competing, than some groups will develop better or worse control for economic reasons and it won’t affect existential risk much to work on it now. I don’t think I can see a situation where control matters—only a few players have A.G.I. for a very long time and none escape or are open sourced but also none gain a decisive advantage?
I do see advantages to hardening important institutions against cyberattacks and increasing individual and group rationality so that humans remain agentic for as long as possible.
I think the key story for wins from AI control specifically is a scenario where we have human-level to slightly superhuman AI inside a lab that isn’t aligned and wants to take over, but it turns out that it’s easier to control what AI affordances are given than it is to align an AI, and in particular it’s easier to catch an AI scheming than it is to make it aligned, and the lab wants to use AIs for alignment/control research.
I don’t see this as a probable scenario, but I do see it as a valuable scenario to work on, so it does have value in my eyes.
Imagine that there are just a few labs with powerful A.I., all of which are responsible enough to use existing A.I. control strategies which have been prepared for this situation, and none of which open source their models. Now if they successfully use their A.I. for alignment, they will also be able to successfully use it for capabilities research. At some point, control techniques will no longer be sufficient, and we have to hope that by then A.I. aided alignment has succeeded enough to prevent bad outcomes. I don’t believe this is a serious possibility; the first A.I. capable of solving the alignment problem completely will also be able to deceive us about solving the alignment problem (more) easily—up to and including this point, A.I. will produce partial, convincing solutions to the alignment problem which human engineers will go forward with. Control techniques will simply threshold (below) the capabilities of the first unaligned A.I. that escapes, which is plausibly a net negative since it means we won’t have early high impact warnings. If occasional A.I. escapes turn out to be non-lethal, economic incentives will favor better A.I. control, so working on this early won’t really matter. If occasional A.I. escapes turn out to be lethal, then we will die unless we solve the alignment problem ourselves.
One assumption that could be used to defuse the danger is if we can apply the controlled AIs to massively improve computer security such that computer security wins the race over attacking the computers.
This is at least semi-plausible for the first AIs, who will almost certainly be wildly superhuman at both coding and mathematically proving theorems in for example Lean, because there’s a very obvious path to how to get the training data to bootstrap yourself, and like go, it is relatively easy to verify that your solution actually works.
https://www.lesswrong.com/posts/2wxufQWK8rXcDGbyL/access-to-powerful-ai-might-make-computer-security-radically
Another assumption that could work for AI control is that once the AIs are controlled enough, labs start using the human-level AIs to further enhance control/alignment strategies, and thus from the base case we can inductively show that the next level up of AIs are even better controlled until you reach a limit, and that the capabilities that are done are there to make the AI safer, not more dangerous.
Improving computer security seems possible but there are many other attack vectors. For instance, even if an A.I. can prove a system’s software is secure, it may choose to introduce social engineering style back doors if it is not aligned. It’s true that controlled A.I.’s can be used to harden society but overall I don’t find that strategy comforting.
I’m not convinced that this induction argument goes through. I think it fails on the first generation that is smarter than humans, for basically Yudkowskian reasons.
Hard to be sure without more detail, but your comment gives me the impression that you haven’t thought through the various different branches of how AI and geopolitics might go in the next 10 years.
I, for one, am pretty sure AI control and powerful narrow AI tools will both be pretty key for humanity surviving the next 10 years. I don’t expect us to have robustly solved ASI-aligment in that timeframe.
I also don’t expect us to have robustly solved ASI-alignment in that timeframe. I simply fail to see a history in which AI control work now is a decisive factor. If you insist on making a top level claim that I haven’t thought through the branches of how things go, I’d appreciate a more substantive description of the branch I am not considering.
This is a good and important point. I don’t have a strong opinion on whether you’re right, but one counterpoint: AI companies are already well-incentivized to figure out how to control AI, because (as Wei Dai said) controllable AI is more economically useful. It makes more sense for nonprofits / independent researchers to do work that AI companies wouldn’t do otherwise.
This post raises an important perspective on the practicalities of AI Control versus Alignment. Given the potential for AI to function productively even when not fully aligned, do you believe that current AI control methods are scalable enough to handle future AGI systems? Additionally, what would be the main challenges in ensuring that AI control strategies are robust in highly dynamic or emergency scenarios?
AI safety researchers might be allocated too heavily to Anthropic compared to Google Deepmind
Some considerations:
Safety researchers should want Google Deepmind (GDM) to have a robust and flourishing safety department. It seems plausible that GDM will be able to create “the smartest” models: they have lots of talent, and own lots of computers. (see e.g. https://epochai.org/data/notable-ai-models#computing-capacity)
Anthropic (ANT) might run into trouble in the future due to not owning their own computers, e.g. if Amazon (or where ever they’re renting their computers from) starts their own internal scaling competitor, and decides to stop renting out most of their compute.
ANT has a stronger safety culture, and so it is a more pleasant experience to work at ANT for the average safety researcher. This suggests that there might be a systematic bias towards ANT that pulls away from the “optimal allocation”.
GDM only recently started a bay area based safety research team/lab (with members like Alex Turner). So if people had previously decided to work for ANT based on location, they now have the opportunity to work for GDM without relocating.
I’ve heard that many safety researchers join ANT without considering working for GDM, which seems like an error, although I don’t have 1st hand evidence for this being true.
ANT vs GDM is probably a less important consideration than “scaling lab” (ANT, OAI, GMD, XAI, etc.) vs “non scaling lab” (USAISI, UKAISI, Redwood, ARC, Palisade, METR, MATS, etc. (so many...)). I would advise people to think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted” [edit: I mean viewed as corrupted by the broader world in situations where e.g. there is a non-existential AI disaster or there is rising dislike of the way AI is being handled by coorperations more broadly, e.g. similar to how working for an oil company might result in various climate people thinking you’re corrupted, even if you were trying to get the oil company to reduce emissions, etc. I personally do not think GDM or ANT safety people are “corrupted”] (in addition to strengthening them, which I expect people to spend more time thinking about by default).
Because ANT has a stronger safety culture, doing safety at GDM involve more politics and navigating around buerearcracy, and thus might be less productive. This consideration applies most if you think the impact of your work is mostly through the object level research you do, which I think is possible but not that plausible.
(Thanks to Neel Nanda for inspiring this post, and Ryan Greenblatt for comments.)
I think this depends on whether you think AI safety at a lab is more of an O-ring process or a swiss-cheese process. Also, if you think it’s more of an O-ring process, you might be generally less excited about working at a scaling lab.
Some possible counterpoints:
Centralization might actually be good if you believe there are compounding returns to having lots of really strong safety researchers in one spot working together, e.g. in terms of having other really good people to work with, learn from, and give you feedback.
My guess would be that Anthropic resources its safety teams substantially more than GDM in terms of e.g. compute per researcher (though I’m not positive of this).
I think the object-level research productivity concerns probably dominate, but if you’re thinking about influence instead, it’s still not clear to me that GDM is better. GDM is a much larger, more bureaucratic organization, which makes it a lot harder to influence. So influencing Anthropic might just be much more tractable.
I think two major cruxes for me here are:
is it actually tractable to affect Deepmind’s culture and organizational decisionmaking
how close to the threshold is Anthropic for having a good enough safety culture?
My current best guess is that Anthropic is still under the threshold for good enough safety culture (despite seeming better than I expected in a number of ways), and meanwhile that Deepmind is just too intractably far gone.
I think people should be hesitant to work at any scaling lab, but, I think Anthropic might be possible to make “the one actually good scaling lab”, and I don’t currently expect that to be tractable at Deepmind and I think “having at least one” seems good for the world (although it’s a bit hard for me to articulate why at the moment)
I am interested in hearing details about Deepmind that anyone thinks should change my mind about this.
This viewpoint is based on having spent at least 10s of hours trying to learn and about influence both org’s culture, at various times.
In both cases, I don’t get the sense that people at the orgs really have a visceral sense that “decisionmaking processes can be fake”, I think they will be fake by default and the org is better modeled as following general incentives, and DeepMind has too many moving people and moving parts at a low enough density that it doesn’t seem possible to fix. For me to change my mind about this, I would need to someone there to look me in the eye and explain that they do have a visceral sense of how organizational decisionmaking processes can be fake, and why they nonetheless think DeepMind is tractable to fix. I assume it’s hard for @Rohin Shah and @Neel Nanda can’t really say anything publicly that’s capable of changing my mind for various confidentiality and political reasons, but, like, that’s my crux.
(conving me in more general terms “Ray, you’re too pessimistic about org culture” would hypothetically somehow work, but, you have a lot of work to do given how thoroughly those pessimistic predictions came true about OpenAi)
I think Anthropic also has this problem, but the threshold of almost-aligned-leadership and actually-pretty-aligned people that it feels at least possible to me for the to fix it. The main things that would persuade me that they are over the critical threshold is if they publicly spent social capital on clearly spelling out why the x-risk problem is hard, and made explicit plans to not merely pause for a bit when they hit an RSP threshold, but (at least in some circumstances) advocate strongly for global government shutdown for like 20+ years.
I think your pessimism of org culture is pretty relevant for the question of big decisions that GDM may make, but I think there is absolutely still a case to be made for the value of alignment research conducted wherever. If the research ends up published, then the origin shouldn’t be held too much against it.
So yes, having a few more researchers at GDM doesn’t solve the corporate race problem, but I don’t think it worsens it either.
As for pausing, I think it’s a terrible idea. I’m pretty confident that any sort of large scale pause would be compute threshold focused, and would be worse than not pausing because it would shift research pressure towards algorithmic efficiency. More on that here: https://www.lesswrong.com/posts/Kobbt3nQgv3yn29pr/my-theory-of-change-for-working-in-ai-healthtech?commentId=qwixG4xYeFdELb2GJ
It might be “fine” to do research at GDM (depending on how free you are to actually pursue good research directions, or how good a mentor you have). But, part of the schema in Mark’s post is “where should one go for actively good second-order effects?”.
I largely agree with this take & also think that people often aren’t aware of some of GDM’s bright spots from a safety perspective. My guess is that most people overestimate the degree to which ANT>GDM from a safety perspective.
For example, I think GDM has been thinking more about international coordination than ANT. Demis has said that he supports a “CERN for AI” model, and GDM’s governance team (led by Allan Dafoe) has written a few pieces about international coordination proposals.
ANT has said very little about international coordination. It’s much harder to get a sense of where ANT’s policy team is at. My guess is that they are less enthusiastic about international coordination relative to GDM and more enthusiastic about things like RSPs, safety cases, and letting scaling labs continue unless/until there is clearer empirical evidence of loss of control risks.
I also think GDM deserves some praise for engaging publicly with arguments about AGI ruin and threat models.
(On the other hand, GDM is ultimately controlled by Google, which makes it unclear how important Demis’s opinions or Allan’s work will be. Also, my impression is that Google was neutral or against SB1047, whereas ANT eventually said that the benefits outweighed the costs.)
Great post. I’m on GDM’s new AI safety and alignment team in the Bay Area and hope readers will consider joining us!
What evidence is there that working at a scaling lab risks creating a “corrupted” perception? When I try thinking of examples, the people that come to my mind seem to have quite successfully transitioned from working at a scaling lab to doing nonprofit / government work. For example:
Paul Christiano went from OpenAI to the nonprofit Alignment Research Center (ARC) to head of AI safety at the US AI Safety Institute.
Geoffrey Irving worked at Google Brain, OpenAI, and Google DeepMind. Geoffrey is now Chief Scientist at the UK AI Safety Institute.
Beth Barnes worked at DeepMind and OpenAI and is now founder and head of research at Model Evaluation and Threat Research (METR).
I was intending to warn about the possibility of future perception of corruption, e.g. after a non-existential AI catastrophe. I do not think anyone currently working at safety teams is percieved as that “corrupted”, although I do think there is mild negative sentiment among some online communities (some parts of twitter, reddit, etc.).
> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”
Does this mean something like:
1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or
2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon
or something else entirely?
Basically (2), very small amounts of (1) (perhaps qualitatively similar to the amount of (1) you would apply to e.g. people joining US AISI or UK AISI)
The high level claim seems pretty true to me. Come to the GDM alignment team, it’s great over here! It seems quite important to me that all AGI labs have good safety teams
Thanks for writing the post!
My current taxonomy of rationalists is:
LW rationalists (HI!)
Facebook rationalists
Twitter rationalists
Blog rationalists
Internet-invisible rationalists
Are there other types of rationalists? Maybe like group-chat rationalists? or podcast rationalists? google doc rationalists?
Alternative taxonomy:
rationalists belonging to Eliezer
cryopreserved rationalists
rationalists trained by CFAR
aspiring rationalists
rationalists working for MIRI
legendary rationalists
metarationalists
those commenting on this taxonomy
those that tweet as if they were mad
Bayesians
et cetera
Zvi
those that from afar look like paperclips
:)
This made me chuckle. More humor
Rationalists taxonomizing rationalists
Mesa-rationalists (the mesa-optimizers inside rationalists)
carrier pigeon rationalists
proto-rationalists
not-yet-born rationalists
literal rats
frequentists
group-house rationalists
EA forum rationalists
academic rationalists
meme rationalists
:)
I like this one.
I think there is a community of discord rationalists and tumblr rationalists.
I am a google doc rationalist! (Or I would like to be. Google docs are great.)
What are you using the word “rationalist” to mean? If you just mean “members of any subculture bearing some line of memetic descent from early Less Wrong” (which I don’t think deserves the pretentious term rationalist, but putting that aside), why is “communication platform” a useful way to chop up that social grouping? A lot of the same people use Less Wrong and Facebook and Twitter and Discord and Google Docs, and a lot of the people who use the same platform wouldn’t be in the same cluster if you were to analyze the graph of what actual conversations people are using these platforms to have.
It’s a natural way to cut it up from one’s own experience. Each platform has different affordances and brings out different aspects of people, and I get pretty different experiences of them on the different platforms mentioned.
John Flanagan: “An ordinary archer practices until he gets it right. A ranger practices until he never gets it wrong.”
I want to reword this in to make it about rationality in a way that isn’t pretentious.
Cavilo, The Vor Game: “The key to strategy… is not to choose a path to victory, but to choose so that all paths lead to a victory.” is close to what I want, but not quite.
I’m not sure what to recommend, but a few words that come to mind that might be relevant and help you spark an idea:
resilient
robust
reliable
antifragile
high availability
self healing
overdetermined
Is the victory bit important in the quotation?
If it’s not about the victory/winning, and rather about the path/journey ….
A first draft that springs to mind as I type:
The key to rationality.… is not to chose the label, but to choose to take every opportunity to improve/update your thinking.
(Can’t … stop … myself … from commenting: From what I’ve observed too much ego gets in the way of rational thinking sometimes.)
Why not something along the lines of “your rationality is measured as much by your worst case performance as your average case”?
Maybe “your worst case contributes to your average just as strongly as your best case”?
Epistemic status: rambles
Quarantine preparation has made me realize that a days worth of food is actually really cheap, doesn’t require that much time to cook, and can be made fairly tasty for not much more, i.e. a day’s worth of easy-to-cook, relatively tasty food is about $5.
This requires some amount of amortized costs for the easy-to-cook and relatively tasty part, but not immensely large upfront costs (instantpot, spices, etc.).
This reference says that 40 million people dealt with hunger is the US. I am… super confused? I find it extremely difficult to believe that people literally couldn’t afford to buy food, so the explanation is probably something like “hunger is worth it to get dietary variety/convenience/etc.” or “trapped by local incentive gradients” or “have incentives to not save money, go hungry when hit by proverbial ‘rainy days’” or “people don’t realize how cheap food actually is”
I’m still confused though. I feel like there might be some room for someone to write some infographic that has information like “here’s a rotating course of 10 meals that are cheap and tasty, with lists of exactly what to buy on what day, how to cook everything with various types of kitchen equipment, substitutes in case the store doesn’t have various ingredients, possible variants in case you get bored”. Crucially, the infographic would have to be really good. A possible explanation is that people who potential might have to deal with hunger don’t have the slack to plan their meals so they don’t and none of the existing meal plans are understandable enough or something.
I notice I’m still confused.
Also mildly confused by why soup kitchens make complicated foods instead of simple foods, but that confusion is nearly entirely resolved by various signaling considerations.
Maybe people who struggle with hunger don’t plan a rotating course of 10 meals because they are signaling that they aren’t so poor to have to plan their meals so meticulously. Maybe planning a rotating course of 10 meals is much harder than I think it is. Maybe I’m far below where I think I am in terms of “ability to endure eating the same food over and over again” and most people just can’t eat a rotating course of meals.
I notice I’m still confused.
I feel like I might be missing something really clear. Like something along the lines of “most people who go hungry don’t have kitchens/space to store ingredients/stable living situations/any slack at all whatsoever/something”.
It seems to me that during the quarantine I eat less than usual; either I am deluding myself, or it is a combination of having less physical activity (such as walking to/from job/lunch/shops/playground), being able to eat whenever I want (so there is no pressure to “eat a lot now, because the next opportunity will be 7 hours later”), making less superstimulating food (using less sugar and salt), and having other ways to get some hedons (e.g. taking a nap). Sometimes I cook a soup, and that’s most of my daily food.
And soups are really cheap. You take $1-2 worth of ingredients, cook them in water, add little salt and spices; optionally eat with bread. Bread is cheap, salt is cheap, most spices are cheap (per portion), potatoes, broccoli, carrot, onion, and beans are cheap. Most of these things are like $1 per 1kg.
Okay, soups are not super healthy; cooking destroys vitamins. You should also get some fresh fruits and vegetables. Apples, tomatoes, cucumbers are $1-2 per 1 kg. You should definitely be able to eat healthy food for less than $5 a day.
What is expensive? Chocolate and other sweets, cheese, berries, nuts; I probably forgot something. You shouldn’t eat sweets; and you can afford the other things now and then even under $5 a day on average. (It is not an optimal diet; some people recommend eating berries and nuts every day. But still healthier than many people eat, including those who spend more money on food.)
.
On the other hand, we usually spend more than $5 per person per day, even during the quarantine. We spend a lot on sweets and cheese. The tastier ones are even more expensive than the basic ones, which are already more expensive than the actually useful food. Instant gratification—it’s addictive! The more stress I have during the day, the more I need something that will improve my mood instantly, even if it’s only for a moment.
Poor people probably have more stress, and thus less willpower to resists things full of sugar and salt. (Also alcohol and tobacco. Okay, the last one is technically not food, but still comes from the same budget.)
Very poor people, e.g. the homeless, don’t have the place to cook. So many cheapest things are ironically out of their reach. Not having a fridge is also a problem.
Then, I assume many poor people don’t have the good skills and habits. Some of them don’t have the necessary IQ, some have mental problems, some had shitty upbringing.
There are a few things to keep in mind:
1) The claim that 40 million Americans “deal with hunger” is, um, questionable. Their citation leads to feedingamerica.org, which cites USDA’s Household Food Security in the United States report (https://www.ers.usda.gov/webdocs/publications/94849/err-270.pdf?v=963.1). The methodology used is an 11-question survey (18 for households with children), where answering 3 questions in the affirmative marks you as low food security. The questions asked are (naturally) subjective. Even better, the first question is this:
“We worried whether our food would run out before we got money to buy more.” Was that often, sometimes, or never true for you in the last 12 months?
That’s an a real concern to have, but it is not what people are talking about when they say “dealing with hunger”. You can be running on a shoestring budget and often worry about whether you’ll have enough money for food without ever actually not having enough money for food.2) A significant percentage of the population has non-trivial issues with executive function. Also, most of the population isn’t familiar with “best practices” (in terms of effective life strategies, basic finances, etc). Most people simply don’t think about things like this systematically, which is how you get the phenomenon of ~50% of the population not being able to cover a $400 emergency (or whatever those numbers are, they’re pretty close). This would be less of an issue if those cultural norms were inherited, but you can’t teach something you don’t know, and apparently we don’t teach Home Economics anymore (not that it’d be sufficient, but it would be better than nothing). This is a subject that deserves a much more in-depth treatment, but I think as a high-level claim this is both close enough to true and sufficient as a cause for what we might observe here. Making an infographic with a rotating course of 10 cheap, easy-to-prepare, relatively healthy, and relatively tasty meals is a great idea, but it’ll only be useful to the sorts of people who already know what “meal prep” means. You might catch some stragglers on the margin, but not a lot.
3) The upfront costs are less trivial than they appear if you don’t inherit any of the larger items, and remember, 50% of the population can’t cover a mid-3-figure emergency. “Basic kitchen equipment” can be had for under $100, but “basic kitchen equipment” doesn’t necessarily set you up to prepare food in a “meal prep” kind of way.
2) is something that I sort of thought about but not with as much nuance. I agree that such an infographic would be only useful for people who were looking for an alternate meal preparation strategy or something.
3) if it’s true that people want to do meal preppy type things but don’t have enough to pay upfront costs, there might be gains from 0-interest microloans, maybe via some MLM-type I loan you money, then once you’ve saved some money and paid me back, you loan other people money too.
It seems like the bottom 20% of the US spends $2216 per year per income earner or ~$6 per day on food. Given that children themselves don’t have an income they might spend less then $5 per person for food per day.
People can drown in a river that’s on average 1m deep.
If you have DAI right now, minting on https://foundry.finance/ and swapping yTrump for nTrump on catnip.exchange is an almost guaranteed 15% profit.
Lesswrong posts that I want someone to write:
Description of pair debugging
Description of value handshakes
Maybe I’ll think of more later.
I found a reference to “value handshakes” here:
I think it refers to something like this: Imagine that a superintelligent human-friendly AI meets a superintelligent paperclip maximizer, and they both realize their powers are approximately balanced. What should they do?
For humans, “let’s fight, and to the victor go the spoils” is the intuitive answer, but the superintelligences can possibly do better. If they fight, they have a 50% chance of achieving nothing, and a 50% chance of winning the universe… minus whatever was sacrificed to Moloch, which could possibly be a lot. If they split the universe to halves, and find out a way how to trust each other, that is better that war. But there is a possibility of even better solution, when both of them would agree on acting as if they were a single superintelligence that values both humans and paperclips equally.
The cooperative solution can be better than a 50% split of the universe, because you could build paperclip factories on places humans care less about, such as uninhabitable planets; or perhaps you could find a way how to introduce paperclips to human environment without reducing the human quality of life. For example, would you mind using paperclips to reinforce the walls of your house? Would you mind if almost all materials used to build stuff for humans contained little paperclips inside? Would you mind living in a simulation implemented on paperclip-shaped circuits? So maybe at the end, humans could get like 70% of the potential utility of the universe, while 70% of potential material would be converted to paperclips.
A weird but not-inaccurate way to think of log(n) is as an answer to “how many digits does n have?”
This suggests that a weird but not-inaccurate way to think of a log-normal distribution is as a distribution where “the number of digits is normally distributed”
There are a bunch of explanations of logarithm as length on Arbital.
Ignorance as evidence
I was answering a bunch of questions from OpenPhill’s calibration test of the form “when did <thing> happen?”. A lot of the time, I had no knowledge of <thing>, so I gave a fairly large confidence interval as a “maximum ignorance” type prediction (1900-2015, for example).
However, the fact that I have no knowledge of <thing> is actually moderate evidence that it happened “before my time”.
Example: “when did <person> die?” If I was alive when <person> died, there’s a higher chance of me hearing about their death. Thus not having heard of <person> is evidence that they died some time ago.
Technically, it could also be evidence that you are dead, but your ghost cannot move to afterlife, probably because it is too attached to scoring internet points (a fate that awaits many of us, I am afraid).
(epistemic status: just kidding)
If you’re interviewing employees of a company about how good the company is, there’s positive bias because people who hate the company will have already left.
Sure. Also, current employees are dis-incented from being truthful about the bad parts. But you’re not applying statistics to the results, so that’s not terribly important. Such interviews provide limited evidence about goodness of the company. They provide decent evidence about the potential coworkers you’re interviewing.
Generally, when you’re interviewing employees about a company about whether the company is any good, you’re trying to decide whether to work there yourself. And you’re evaluating whether any of them seem competent and interesting enough that you can tolerate being near them for any length of time.
Coinfection rates of COVID and normal flu are very low. If you have the set of flu/COVID symptoms, you’re basically guaranteed to have one or the other. You can test for the flu pretty easily. Therefore, people can just test for the flu as a proxy for testing for COVID.
Is this just a really obvious chain of reasoning that everyone has missed? Which one of my assumptions is wrong?
https://twitter.com/katyw2004/status/1236848300143280128 says coinfection rates are low
https://www.wikiwand.com/en/Rapid_influenza_diagnostic_test means we can test for the flu fast
Thus it’s either the case that if you have the set of flu/COVID symptoms, you’re basically guaranteed to have either flu or COVID.
Maybe the tests are only useful for people who don’t have symptoms, but if that’s not the case, then the flu test provides a lot of evidence as to whether or not someone has COVID (even if “basically guaranteed” is replaced with “probable”).
update the CDC advises testing for the flu and there’s a lot of medical things that cause “flu-like” symptoms. Turns out that “flu-like” symptoms is basically “immune system doing things”, which is going to happen with most things your body doesn’t like.
Not everybody who has a cold has influenza (the flu) or COVID-19. There are many different viruses that cause influenza-like-illnesses.
Low is relativ. 2% of coindience is still a bunch.
Moral uncertainty is a thing that people think about. Do people also think about decision theoretic uncertainty? E.g. how to decide when you’re uncertain about which decision theory is correct?
Decision theoretic uncertainty seems easier to deal with, because you’re trying to maximise expected value in each case, just changing what you condition on in calculating the expectation. So you can just take the overall expectation given your different credences in the decision theories