CEO at Conjecture.
I don’t know how to save the world, but dammit I’m gonna try.
CEO at Conjecture.
I don’t know how to save the world, but dammit I’m gonna try.
Thanks for the comment!
Have I understood this correctly?
I am most confident in phases 1-3 of this agenda, and I think you have overall a pretty good rephrasing of 1-5, thanks! One note is that I don’t think “LLM calls” as being fundamental, I think of LLMs as a stand-in for “banks of patterns” or “piles of shards of cognition.” The exact shape of this can vary, LLMs are just our current most common shape of “cognition engines”, but I can think of many other, potentially better, shapes this “neural primitive/co-processor” could take.
I think there is some deep, as of yet unformalized, concept of computer science that differentiates what are intuitively “cognitive”/”neural” type problems vs “classical”/”code” type problems. Why can neural networks easily recognize dogs but doing it in regular code is hell? How can one predict ahead of time whether a given task can be solved with a given set of programming tools or neural network components? Some kind of vastly more advanced form of Algorithmic Information Theory, that can take in as input your programming tools and libraries, and a description of the problem you are trying to solve, and output how hard it is going to be (or what “engineering complexity class” it would belong to, whatever that means). I think this is a vast, unsolved question of theoretical computer science, that I don’t expect we will solve any sooner than we are going to solve P vs NP.
So, in absence of such principled understanding, we need to find the “engineering approximation equivalent” to this, which involves using as much code as we can and bounding the neural components as much as we can, and then developing good practical engineering around this paradigm.
Maybe it’d be good to name some speculative tools/theory that you hope to have been developed for shaping CoEms, then say how they would help with some of:
The way I see it, there are two main ways in which I see things differently in the CoEm frame:
First, the hope isn’t so much that CoEm “solves” these problems, but makes them irrelevant, because it makes it possible to not slip into the dangerous/unpredictable capabilities regime unexpectedly. If you can’t ensure your system won’t do something funky, you can simply choose not to build it, and instead decide to build something you can ensure proper behavior of. Then you can iterate, unlike in the current “pump as much LLM/RL juice as possible as fast as possible” paradigm.
In other words, CoEm makes it easier to distinguish between capabilities failures and alignment failures.
Most alignment research skips to trying to resolve issues like these first, at least in principle. Then often backs off to develop a relevant theory. I can see why you might want to do the levers part first, and have theory develop along with experience building things. But it’s risky to do the hard part last.
Secondly, more speculatively, I expect these problems to dissolve under better engineering and understanding. Here I am trying to point at something like “physicalism” or “gears level models.” If you have gears level models, a lot of the questions you might ask in a non-gears-level model stop making sense/being relevant, and you find new, more fundamental questions and tradeoff.
I think ontologies such as Agents/Goals are artifacts of poor understanding of deeper mechanics. If you can’t understand the inner mechanics of cell biology, then maybe psychology is the best you can do to predict a human. But if you can understand cell biology and construct a biological being from scratch, I think you don’t need the Agent framing, and it would be actively confusing to insist it is ontologically primitive somehow and must be “addressed” in your final description of the system you are engineering. These kinds of abstract/functionalist/teleologically models might be a good source of inspiration for messing around, but this is not the shape that the true questions will have.
“Instrumental convergence” dissolves into questions of predictability, choices of resource allocation and aesthetic/ethical stances on moral patienthood/universal rights. Those problems aren’t easy, but they are different and more “fundamental”, more part of the territory than of the map.
Similarly, “Reflective stability of goals” is just a special case of predicting what your system does. It’s not a fundamental property that AGIs have and other software doesn’t.
The whole CoEm family of ideas is pointing in this direction, encouraging the uncovering of more fundamental, practical, grounded, gears level models, by means of iterative construction. I think we currently do not have good gears level models of lots of the important questions of AI/cognition/alignment, and I think the way to get there is by treating it as a software/physicalist/engineering problem, not presupposing an already higher level agentic/psychological/functionalist framing. (It’s like the epistemological equivalent of the AI Effect, but for good, lol.)
I think that picking a hard problem before you know whether that “hard problem” is real or not is exactly what leads to confusions like the “hard problem of consciousness”, followed by zero actual progress on problems that matter. I don’t actually think we know what the true “hard problems” are to a level of deconfusion that we can just tackle them directly and backchain. Backchaining from a confused or wrong goal is one of the best ways to waste an entire career worth of research.
Not saying it is guaranteed to solve all these problems, or that I am close to having solved all these problems, but this agenda is the type of thing I would do if I wanted to make iterative research progress into that direction.
This is often not true, and I don’t think your paradigm makes it true. E.g. often we lose legibility to increase capability, and that is plausibly also true during AGI development in the CoEm paradigm.
It’s kinda trivially true in that the point of the agenda is to get to legibility, and if you sacrifice on legibility/constructibility, you are no longer following the paradigm, but I realize that is not an interesting statement. Ultimately, this is a governance problem, not a technical problem. The choice to choose illegible capabilities is a political one.
Expensive why? Seems like the bottleneck here is theoretical understanding.
Literally compute and man-power. I can’t afford the kind of cluster needed to even begin a pretraining research agenda, or to hire a new research team to work on this. I am less bottlenecked on the theoretical side atm, because I need to run into a lot of bottlenecks from actual grounded experiments first.
Hi habryka, I don’t really know how best to respond to such a comment. First, I would like to say thank you for your well-wishes, assuming you did not mean them sarcastically. Maybe I have lost the plot, and if so, I do appreciate help in recovering it. Secondly, I feel confused as to why you would say such things in general.
Just last month, me and my coauthors released a 100+ page explanation/treatise on AI extinction risk that gives a detailed account of where AGI risk comes from and how it works, which was received warmly by LW and the general public alike, and which continues to be updated and actively publicised.
In parallel, our sister org ControlAI, a non-profit policy advocacy org focused solely on extinction risk prevention I work with frequently, has had A Narrow Path, a similarly extensive writeup on principles of regulation to address xrisk from ASI, which me and ControlAI have pushed and discussed extensively with policy makers of multiple countries, and there are other regulation-promoting projects ongoing.
I have been on CNN, BBC, Fox News and other major news sources warning in no ambiguous terms about the risks. There is literally dozens of hours of podcast material, including from just last month, where I explain in excruciating depth the existential risk posed by AGI systems and where it comes from, and how it differs from other forms of AI risk. If you think all my previous material has “lost the plot”, then well, I guess in your eyes I never had it, not much I can do.
This post is a technical agenda that is not framed in the usual LW ideological ontology, and has not been optimized to appeal to that audience, but rather to identify an angle that is tractable and generalizes the problem without losing its core, and leads to solutions that address the hard core, which is Complexity. In the limit, if we had beautifully simple, legible designs for ASIs that we fully understand and can predict, technical xrisk (but not governance) would be effectively solved. If you disagree with this, I would have greatly enjoyed your engagement with what object level points you think are wrong, and it may have helped me write a better roadmap.
But it seems to me that you have not even tried to engage with the content of this post at all, and have instead merely asserted it is a “random rant against AI-generated art” and “name-calling.” I see no effort other than surface level pattern matching, or any curiosity to how it might fit with my previous writings and thinking that have been shared and discussed.
Do you truly think that’s the best effort at engaging in good faith you can make?
If so, I don’t know what I can say that would help. I hope we can both find the plot again, since neither of us seem to see it in the other person.
Morality is multifaceted and multilevel. If you have a naive form of morality that is just “I do whatever I think is the right thing to do”, you are not coordinating or being moral, you are just selfish.
Coordination is not inherently always good. You can coordinate with one group to more effectively do evil against another. But scalable Good is always built on coordination. If you want to live in a lawful, stable, scalable, just civilization, you will need to coordinate with your civilization and neighbors and make compromises.
As a citizen of a modern country, you are bound by the social contract. Part of the social contract is “individuals are not allowed to use violence against other individuals, except in certain circumstances like self defense.” [1] Now you might argue that this is a bad contract or whatever, but it is the contract we play by (at least in the countries I have lived in), and I think unilaterally reneging on that contract is immoral. Unilaterally saying “I will expose all of my neighbors to risk of death from AGI because I think I’m a good person” is very different from “we all voted and the majority decided building AGI is a risk worth taking.”
Now, could it be that you in some exceptional circumstances need to do something immoral to prevent some even greater tragedy? Sure, it can happen. Murder is bad, but self defense can make it on net ok. But just because it’s self defense doesn’t make murder moral, it just means there was an exception in this case. War is bad, but sometimes countries need to go to war. That doesn’t mean war isn’t bad.
Civilization is all about commitments, and honoring them. If you can’t honor your commitments to your civilization, even when you disagree with them sometimes, you are not civilized and are flagrantly advertising your defection. If everyone does this, we lose civilization.
Morality is actually hard, and scalable morality/civilization is much, much harder. If an outcome you dislike happened because of some kind of consensus, this has moral implications. If someone put up a shitty statue that you hate in the town square because he’s an asshole, that’s very different morally from “everyone in the village voted, and they like the statue and you don’t, so suck it up.” If you think “many other people want X and I want not X” has no moral implications whatsoever your “morality” is just selfishness.[2]
Hi, as I was tagged here, I will respond to a few points. There are a bunch of smaller points only hinted at that I won’t address. In general, I strongly disagree with the overall conclusion of this post.
There are two main points I would like to address in particular:
There seems to be a deep underlying confusion here that in some sense more information is inherently more good, or inherently will result in good things winning out. This is very much the opposite of what I generally claim about memetics. Saying that all information is good is like saying all organic molecules or cells are equally good. No! Adding more biosludge and toxic algal blooms to your rosegarden won’t make it better!
Social media is the exact living proof of this. People genuinely thought social media will bring everyone together, resolve conflicts, create a globally unified culture and peace and democracy, that autocracy and bigotry couldn’t possibly thrive if you just only had enough information. I consider this hypothesis thoroughly invalidated. “Increasing memetic evolutionary pressure” is not a good thing! (all things equal)
Increasing the evolutionary pressure on the flu virus doesn’t make the world better, and viruses mutate a lot faster than nice fluffy mammals. Most mutations in fluffy mammals kills them, mutations in viruses helps them far more. Value is fragile. It is asymmetrically easy to destroy than to create.
Raw evolution selects for fitness/reproduction, not Goodness. You are just feeding the Great Replicator.
For an accessible intro to some of this, I recommend the book “Nexus” by Yuval Harari. (not that I endorse everything in that book, but the first half is great)
You talk about theories of change of the form “we safety people will keep everything secret and create an aligned AI, ship it to big labs and save the world before they destroy it (or directly use the AI to stop them)”. I don’t endorse, and in fact strongly condemn, such theories of change.
But not because of the hiding information part, but because of the “we will not coordinate with others and will use violence unilaterally” part! Such theories of change are fundamentally immoral for the same reasons labs building AGI is immoral. We have a norm in our civilization that we don’t as private citizens threaten to harm or greatly upend the lives of our fellow civilians without either their consent or societal/governmental/democratic authority.
The not sharing information part is fine! Not all information is good! For example, Canadian researchers a while back figured out how to reconstruct an extinct form of smallpox, and then published how to do it. Is this a good thing for the world to have that information out there?? I don’t think so. Should we open source the blue prints of the F-35 fighter jet? I don’t think so, I think it’s good that I don’t have those blueprints!
Information is not inherently good! Not sharing information that would make the world worse is virtuous. Now, you might be wrong about the effects of sharing the information you have, sure, but claiming there is no tradeoff or the possibility that sharing might actually, genuinely, be bad, is just ignoring why coordination is hard.
If you ever find yourself thinking something of the shape “we must simply unreservedly increase [conceptually simple variable X], with no tradeoffs”, you’re wrong. Doesn’t matter how clever you think X is, you’re wrong. Any real life, not fake complex thing is made of towers upon towers of tradeoffs. If you think there are no tradeoffs in whatever system you are looking at, you don’t understand the system.
Memes are not our friends. Conspiracy theories and lies spread faster than complex, nuanced truth. The printing press didn’t bring the scientific revolution, it brought the witch burnings and the 30 year war. The scientific revolution came from the Royal Society and its nuanced, patient, complex norms of critical inquiry. Yes, spreading your scientific papers was also important, it was necessary but not sufficient for a good outcome.
More mutation/evolution, all things equal, means more cancer, not more health and beauty. Health and beauty can come from cancerous mutation and selection, but it’s not a pretty process, and requires a lot of bloody, bloody trial and error (and a good selection function). The kind of inefficient and morally abominable process I would prefer us not relying on.
With that being said, I think it’s good that you wrote things down and are thinking about them, please don’t take what I’m saying as some kind of personal disparaging, I wish more people wrote down their ideas and tried to think things through! I think there is indeed a lot of valuable things in this direction, around better norms, tools, processes and memetic growth, but they’re just really quite non trivial! You’re on your way to thinking critically about morality, coordination and epistemology, which is great! That’s where I think real solutions are!
Nice set of concepts, I might use these in my thinking, thanks!
I don’t understand what point you are trying to make, to be honest. There are certain problems that humans/I care about that we/I want NNs to solve, and some optimizers (e.g. Adam) solve those problems better or more tractably than others (e.g. SGD or second order methods). You can claim that the “set of problems humans care about” is “arbitrary”, to which I would reply “sure?”
Similarly, I want “good” “philosophy” to be “better” at “solving” “problems I care about.” If you want to use other words for this, my answer is again “sure?” I think this is a good use of the word “philosophy” that gets better at what people actually want out of it, but I’m not gonna die on this hill because of an abstract semantic disagreement.
“good” always refers to idiosyncratic opinions, I don’t really take moral realism particularly seriously. I think there is “good” philosophy in the same way there are “good” optimization algorithms for neural networks, while also I assume there is no one optimizer that “solves” all neural network problems.
I strongly disagree and do not think that will be how AGI will look, AGI isn’t magic. But this is a crux and I might be wrong of course.
I can’t rehash my entire views on coordination and policy here I’m afraid, but in general, I believe we are currently on a double exponential timeline (though I wouldn’t model it quite like you, but the conclusions are similar enough) and I think some simple to understand and straightforwardly implementable policy (in particular, compute caps) at least will move us to a single exponential timeline.
I’m not sure we can get policy that can stop the single exponential (which is software improvements), but there are some ways, and at least we will then have additional time to work on compounding solutions.
Sure, it’s not a full solution, it just buys us some time, but I think it would be a non-trivial amount, and let not perfect be the enemy of good and what not.
I see regulation as the most likely (and most accessible) avenue that can buy us significant time. The fmpov obvious is just put compute caps in place, make it illegal to do training runs above a certain FLOP level. Other possibilities are strict liability for model developers (developers, not just deployers or users, are held criminally liable for any damage caused by their models), global moratoria, “CERN for AI” and similar. Generally, I endorse the proposals here.
None of these are easy, of course, there is a reason my p(doom) is high.
But what happens if AI deception then gets solved relatively quickly (or someone comes up with a proposed solution that looks good enough to decision makers)? And this is another way that working on alignment could be harmful from my perspective...
Of course if a solution merely looks good, that will indeed be really bad, but that’s the challenge of crafting and enforcing sensible regulation.
I’m not sure I understand why it would be bad if it actually is a solution. If we do, great, p(doom) drops because now we are much closer to making aligned systems that can help us grow the economy, do science, stabilize society etc. Though of course this moves us into a “misuse risk” paradigm, which is also extremely dangerous.
In my view, this is just how things are, there are no good timelines that don’t route through a dangerous misuse period that we have to somehow coordinate well enough to survive. p(doom) might be lower than before, but not by that much, in my view, alas.
I think this is not an unreasonable position, yes. I expect the best way to achieve this would be to make global coordination and epistemology better/more coherent...which is bottlenecked by us running out of time, hence why I think the pragmatic strategic choice is to try to buy us more time.
One of the ways I can see a “slow takeoff/alignment by default” world still going bad is that in the run-up to takeoff, pseudo-AGIs are used to hypercharge memetic warfare/mutation load to a degree basically every living human is just functionally insane, and then even an aligned AGI can’t (and wouldn’t want to) “undo” that.
Hard for me to make sense of this. What philosophical questions do you think you’ll get clarity on by doing this? What are some examples of people successfully doing this in the past?
The fact you ask this question is interesting to me, because in my view the opposite question is the more natural one to ask: What kind of questions can you make progress on without constant grounding and dialogue with reality? This is the default of how we humans build knowledge and solve hard new questions, the places where we do best and get the least drawn astray is exactly those areas where we can have as much feedback from reality in as tight loops as possible, and so if we are trying to tackle ever more lofty problems, it becomes ever more important to get exactly that feedback wherever we can get it! From my point of view, this is the default of successful human epistemology, and the exception should be viewed with suspicion.
And for what it’s worth, acting in the real world, building a company, raising money, debating people live, building technology, making friends (and enemies), absolutely helped me become far, far less confused, and far more capable of tackling confusing problems! Actually testing my epistemology and rationality against reality, and failing (a lot), has been far more helpful for deconfusing everything from practical decision making skills to my own values than reading/thinking could have ever been in the same time span. There is value in reading and thinking, of course, but I was in a severe “thinking overhang”, and I needed to act in the world to keep learning and improving. I think most people (especially on LW) are in an “action underhang.”
“Why do people do things?” is an empirical question, it’s a thing that exists in external reality, and you need to interact with it to learn more about it. And if you want to tackle even higher level problems, you need to have even more refined feedback. When a physicist wants to understand the fundamentals of reality, they need to set up insane crazy particle accelerators and space telescopes and supercomputers and what not to squeeze bits of evidence out of reality and actually ground whatever theoretical musings they may have been thinking about. So if you want to understand the fundamentals of philosophy and the human condition, by default I expect you are going to need to do the equivalent kind of “squeezing bits out of reality”, by doing hard things such as creating institutions, building novel technology, persuading people, etc. “Building a company” is just one common example of a task that forces you to interact a lot with reality to be good.
Fundamentally, I believe that good philosophy should make you stronger and allow you to make the world better, otherwise, why are you bothering? If you actually “solve metaphilosophy”, I think the way this should end up looking is that you can now do crazy things. You can figure out new forms of science crazy fast, you can persuade billionaires to support you, you can build monumental organizations that last for generations. Or, in reverse, I expect that if you develop methods to do such impressive feats, you will necessarily have to learn deep truths about reality and the human condition, and acquire the skills you will need to tackle a task as heroic as “solving metaphilosophy.”
Everyone dying isn’t the worst thing that could happen. I think from a selfish perspective, I’m personally a bit more scared of surviving into a dystopia powered by ASI that is aligned in some narrow technical sense. Less sure from an altruistic/impartial perspective, but it seems at least plausible that building an aligned AI without making sure that the future human-AI civilization is “safe” is a not good thing to do.
I think this grounds out into object level disagreements about how we expect the future to go, probably. I think s-risks are extremely unlikely at the moment, and when I look at how best to avoid them, most such timelines don’t go through “figure out something like metaphilosophy”, but more likely through “just apply bog standard decent humanist deontological values and it’s good enough.” A lot of the s-risk in my view comes from the penchant for maximizing “good” that utilitarianism tends to promote, if we instead aim for “good enough” (which is what most people tend to instinctively favor), that cuts off most of the s-risk (though not all).
To get to the really good timelines, that route through “solve metaphilosophy”, there are mandatory previous nodes such as “don’t go extinct in 5 years.” Buying ourselves more time is powerful optionality, not just for concrete technical work, but also for improving philosophy, human epistemology/rationality, etc.
I don’t think I see a short path to communicating the parts of my model that would be most persuasive to you here (if you’re up for a call or irl discussion sometime lmk), but in short I think of policy, coordination, civilizational epistemology, institution building and metaphilosophy as closely linked and tractable problems, if only it wasn’t the case that there was a small handful of AI labs (largely supported/initiated by EA/LW-types) that are deadset on burning the commons as fast as humanly possible. If we had a few more years/decades, I think we could actually make tangible and compounding progress on these problems.
I would say that better philosophy/arguments around questions like this is a bottleneck. One reason for my interest in metaphilosophy that I didn’t mention in the OP is that studying it seems least likely to cause harm or make things worse, compared to any other AI related topics I can work on. (I started thinking this as early as 2012.) Given how much harm people have done in the name of good, maybe we should all take “first do no harm” much more seriously?
I actually respect this reasoning. I disagree strategically, but I think this is a very morally defensible position to hold, unlike the mental acrobatics necessary to work at the x-risk factories because you want to be “in the room”.
Which also represents an opportunity...
It does! If I was you, and I wanted to push forward work like this, the first thing I would do is build a company/institution! It will both test your mettle against reality and allow you to build a compounding force.
Is it actually that weird? Do you have any stories of trying to talk about it with someone and having that backfire on you?
Yup, absolutely. If you take even a microstep outside of the EA/rat-sphere, these kind of topics quickly become utterly alien to anyone. Try explaining to a politician worried about job loss, or a middle aged housewife worried about her future pension, or a young high school dropout unable to afford housing, that actually we should be worried about whether we are doing metaphilosophy correctly to ensure that future immortal superintelligence reason correctly about acausal alien gods from math-space so they don’t cause them to torture trillions of simulated souls! This is exaggerated for comedic effect, but this is really what even relatively intro level LW philosophy by default often sounds like to many people!
As the saying goes, “Grub first, then ethics.” (though I would go further and say that people’s instinctive rejection of what I would less charitably call “galaxy brain thinking” is actually often well calibrated)
As someone that does think about a lot of the things you care about at least some of the time (and does care pretty deeply), I can speak for myself why I don’t talk about these things too much:
Epistemic problems:
Mostly, the concept of “metaphilosophy” is so hopelessly broad that you kinda reach it by definition by thinking about any problem hard enough. This isn’t a good thing, when you have a category so large it contains everything (not saying this applies to you, but it applies to many other people I have met who talked about metaphilosophy), it usually means you are confused.
Relatedly, philosophy is incredibly ungrounded and epistemologically fraught. It is extremely hard to think about these topics in ways that actually eventually cash out into something tangible, rather than nerdsniping young smart people forever (or until they run out of funding).
Further on that, it is my belief that good philosophy should make you stronger, and this means that fmpov a lot of the work that would be most impactful for making progress on metaphilosophy does not look like (academic) philosophy, and looks more like “build effective institutions and learn interactively why this is hard” and “get better at many scientific/engineering disciplines and build working epistemology to learn faster”. Humans are really, really bad at doing long chains of abstract reasoning without regular contact with reality, so in practice imo good philosophy has to have feedback loops with reality, otherwise you will get confused. I might be totally wrong, but I expect at this moment in time me building a company is going to help me deconfuse a lot of things about philosophy more than me thinking about it really hard in isolation would.
It is not clear to me that there even is an actual problem to solve here. Similar to e.g. consciousness, it’s not clear to me that people who use the word “metaphilosophy” are actually pointing to anything coherent in the territory at all, or even if they are, that it is a unique thing. It seems plausible that there is no such thing as “correct” metaphilosophy, and humans are just making up random stuff based on our priors and environment and that’s it and there is no “right way” to do philosophy, similar to how there are no “right preferences”. I know the other view ofc and still worth engaging with in case there is something deep and universal to be found (the same way we found that there is actually deep equivalency and “correct” ways to think about e.g. computation).
Practical problems:
I have short timelines and think we will be dead if we don’t make very rapid progress on extremely urgent practical problems like government regulation and AI safety. Metaphilosophy falls into the unfortunate bucket of “important, but not (as) urgent” in my view.
There are no good institutions, norms, groups, funding etc to do this kind of work.
It’s weird. I happen to have a very deep interest in the topic, but it costs you weirdness points to push an idea like this when you could instead be advocating more efficiently for more pragmatic work.
It was interesting to read about your successive jumps up the meta hierarchy, because I had a similar path, but then I “jumped back down” when I realized that most of the higher levels is kinda just abstract, confusing nonsense and even really “philosophically concerned” communities like EA routinely fail basic morality such as “don’t work at organizations accelerating existential risk” and we are by no means currently bottlenecked by not having reflectively consistent theories of anthropic selection or whatever. I would like to get to a world where we have bottlenecks like that, but we are so, so far away from a world where that kind of stuff is why the world goes bad that it’s hard to justify more than some late night/weekend thought on the topic in between a more direct bottleneck focused approach.
All that being said, I still am glad some people like you exist, and if I could make your work go faster, I would love to do so. I wish I could live in a world where I could justify working with you on these problems full time, but I don’t think I can convince myself this is actually the most impactful thing I could be doing at this moment.
Yep, you see the problem! It’s tempting to just think of an AI as “just the model”, and study that in isolation, but that just won’t be good enough longterm.
Thanks for the comment! I agree that we live in a highly suboptimal world, and I do not think we are going to make it, but it’s worth taking our best shot.
I don’t think of the CoEm agenda as “doing AGI right.” (for one, it is not even an agenda for building AGI/ASI, but of bounding ourselves below that) Doing AGI right would involve solving problems like P vs PSPACE, developing vastly more deep understanding of Algorithmic Information Theory and more advanced formal verification of programs. If I had infinite budget and 200 years, the plan would look very different, and I would feel very secure in humanity’s future.
Alas, I consider CoEm an instance of a wider class of possible alignment plans that I consider the “bare minimum for Science to work.” I generally think any plans more optimistic than this require some other external force of things going well, which might be empirical facts about reality (LLMs are just nice because of some deep pattern in physics) or metaphysics (there is an actual benevolent creator god intervening specifically to make things go well, or Anthropic Selection is afoot). Many of the “this is what we will get, so we have to do this” type arguments just feel like cope to me, rather than first principles thinking of “if my goal is a safe AI system, what is the best plan I can come up with that actually outputs safe AI at the end?”, reactive vs constructive planning. Of course, in the real world, it’s tradeoffs all the way down, and I know this. You can read some of my thoughts about why I think alignment is hard and current plans are not on track here.
I don’t consider this agenda to be maximally principled or aesthetically pleasing, quite the opposite, it feels like a grubby engineering compromise that simply has a minimum requirement to actually do science in a non-insane way. There are of course various even more compromising positions, but I think those simply don’t work in the real world. I think the functionalist/teleological/agent based frameworks that are currently being applied to alignment work on LW are just too confused to ever really work in the real world, the same way how I think that the models of alchemy just can never actually get you to a safe nuclear reactor and you need to at least invent calculus (or hell at least better metallurgy!) and do actual empiricism and stuff.
As for pausing and governance, I think governance is another mandatory ingredient to a good outcome, most of the work there I am involved with happens through ControlAI and their plan “A Narrow Path”. I am under no illusion that these political questions are easy to solve, but I do believe they are possible and necessary to solve, and I have a lot of illegible inside info and experience here that doesn’t fit into a LW comment. If there is no mechanism by which reckless actors are prevented from killing everyone else by building doomsday machines, we die. All the technical alignment research in the world is irrelevant to this point. (And “pivotal acts” are an immoral pipedream)