An open letter to SERI MATS program organisers
(Independent) alignment researchers need to be exceptionally good at philosophy of science
In order to be effective, independent AI alignment/safety/x-risk researchers should be unusually competent in philosophy of science, epistemology, and methodology of research (the latter imports some strategic considerations, too, as I’ll illustrate shortly), relative to other fields of research.
The reasons for this are:
(1) The field is pre-paradigmatic.
(2) There is a number of epistemic complications with this type of research, many of which are detailed by Adam Shimi here (also see my comment to this post). I’d add, on top of that, that the very concept of risk is poorly understood and risk science is often conducted poorly (usually because one doesn’t realise they are doing risk science).
(3) The object of research relates to ethics (i.e., philosophy, unless one subscribes to ethical naturalism), so researchers should be able to properly synthesise science and philosophy.
(4) Alignment work within AGI labs tends to be more empirical and iterative, which partially ameliorates some of the epistemic challenges from (2), but independent researchers usually can’t afford this (also, this approach doesn’t make sense for them to pursue strategically), so these challenges remain unaddressed.
At the same time, I’m frustrated when I see a lot of alignment/safety research on LessWrong that resembles philosophy much more than science, towering reasoning on top of intuitions and unprincipled ontologies rather than more established scientific theories or mechanical models. This is a dead end and one of the important reasons why this work doesn’t stack. Contrasted with the particular demand for philosophy of science, this situation seems to harbour a huge opportunity for improvement.
Alignment researchers should think hard about their research methodology and strategy
Talking about research strategy, I also often see lapses, such as (but not limited to):
People take ML courses (which are often largely about ML engineering, which doesn’t help unless one is planning to go into mechanical interpretability research) when they plan to become alignment researchers (when they really should pay much more attention to cognitive science).
People fail to realise that leading AGI labs, such as OpenAI and Conjecture (I bet DeepMind and Anthropic, too, even though they haven’t publicly stated this) do not plan to align LLMs “once and forever”, but rather use LLMs to produce novel alignment research, which will almost certainly go in package with novel AI architectures (or variations on existing proposals, such as LeCun’s H-JEPA). This leads people to pour way too much attention to LLMs and think about their alignment properties. People who are new to the field tend to fall into this trap but veterans such as LeCun and Pedro Domingos point out that ML paradigms come and go. LeCun publicly predicted that in 5 years, nobody would be using autoregressive LLMs anymore and this prediction doesn’t seem that crazy to me.
Considering the above, and also some other strategic factors, it’s questionable whether doing “manual” mechanical interpretability makes much sense at this point. I see in the previous SERI MATS cohort, there was a debate about this, titled “How much should be invested in interpretability research?”, but apparently there is no recording. I think strategically, only automated and black-box approaches to interpretability make practical sense to develop now. (Addition: Hoagy’s comment reminded me that “manual” mechanical interpretability is still somewhat important as a means to ground scientific theories of DNNs. But this informs what kind of mechanical interpretability one should be doing: not just “random”, because “all problems in mechanical interpretability should be solved sooner or later” (no, at least not manually), but choosing the mechanical experiments to test particular predicts of this-or-that scientific theory of DNNs or Transformers.)
Independent researchers sometimes seem to think that if they come up with a particularly brilliant idea about alignment, leading AGI labs will adopt it. Not realising that this won’t happen no matter what partially for social reasons, and partially because these labs already plan to deploy proto-AGIs to produce and/or check research (perhaps paired with humans) on a superhuman level, so they already don’t trust alignment research produced by “mere humans” that much and would rather wait to check it with their LLM-based proto-AGIs, expecting the risks of that manageable and optimal, all things considered, in the current strategic situation. (Note: I may be entirely wrong about this point, but this is my perception.)
What to do about this?
At universities, students and young researchers are expected to learn good philosophy of science primarily “covertly”, by “osmosis” from their advisors, professors, and peers, plus the standard “Philosophy” or “Philosophy of Science” courses. This “covert” strategy of upskilling in philosophy of science is itself methodologically dubious and strikes me as ineffective, e.g., it hinges too much on how good at philosophy of science is the mentor/advisor themselves, which may vary greatly.
In programs like SERI MATS, the “osmosis” strategy of imparting good philosophy of science to people becomes even more problematic because the interaction and cooperation time is limited: in university, a student regularly meets with their advisor and other professors for several years, not several months.
Therefore, direct teaching via seminars seems necessary.
I surveyed the seminar curriculum of the last cohort and it barely touches on philosophy of science, epistemology, methodology and strategy of research. Apart from the debate that I mentioned above, which touches on the strategy, “How much should be invested in interpretability research?”, there is also a seminar about “Technical Writing” which is about methodology, and that’s it.
I think there should be much more seminars about philosophy of science, methodology of science, and strategy of research in the program. Perhaps, not at the expense of other seminar content, but via increasing the number of seminar hours. Talks about ethics, and in particular about ethical naturalism (such as Oliver Scott Curry’s “Morality as Cooperation”), or interaction of ethics and science more generally, also seem essential.
My own views on philosophy of science and methodology of research in application to AI Safety are expressed in the post “A multi-disciplinary view on AI safety research”. It’s similar in spirit to Dan Hendryks’ “Interdisciplinary AI Safety”, based on the “Pragmatic AI Safety” agenda, but there are also some differences that I haven’t published yet and these differences are beyond the scope of this letter to detail.
It seems that Santa Fe Institute is organizationally interested in AI and is hosting a lot of relevant seminars themselves, so I imagine they would be interested to collaborate with SERI MATS on this. They are also good at philosophy of science.
Epistemic status: hasty, first pass
First of all thanks for writing this.
I think this letter is “just wrong” in a number of frustrating ways.
A few points:
“Engineering doesn’t help unless one wants to do mechanistic interpretability.” This seems incredibly wrong. Engineering disciplines provide reasonable intuitions for how to reason about complex systems. Almost all engineering disciplines require their practitioners to think concretely. Software engineering in particular also lets you run experiments incredibly quickly, which makes it harder to be wrong.
ML theory in particular is in fact useful for reasoning about minds. This is not to say that cognitive science is not also useful. Further, being able to solve alignment in the current paradigm would mean we have excellent practice when encountering future paradigms.
It seems ridiculous to me to confidently claim that labs won’t care to implement a solution to alignment.
I think you should’ve spent more time developing some of these obvious cruxes, before implying that SERI MATS should change its behavior based on your conclusions. Implementing these changes would obviously have some costs for SERI MATS, and I suspect that SERI MATS organizers do not share your views on a number of these cruxes.
I should have written “ML engineering” (I think it was not entirely clear from the context, fixed now). Knowing the general engineering methodology and the typical challenges in systems engineering for robustness and resilience is, of course, useful, and having visceral experience of these (e.g., engineering distributed systems, coding oneself bugs in the systems and seeing how they may fail in unexpected ways). But I would claim that learning this through practice, i.e., learning “from one’s own mistakes”, is again inefficient. Smart people learn from others’ mistakes. Just going through some of the materials from here would give alignment researchers much more useful insights than years of hands-on engineering practice[1]. Again, it’s an important qualification that we are talking about what’s effective for theoretical-ish alignment research, not actual engineering of (AGI) systems!
I don’t argue that ML theory is useless. I argue that going through ML courses that spend too much time on building basic MLP networks or random forests (and understanding the theory of these, though it’s minimal) is ineffective. I personally stay abreast of ML research by following MLST podcast (e.g., on spiking NNs, deep RL, Domingos on neurosymbolic and lots of other stuff, a series of interviews with people at Cohere: Hooker, Lewis, Grefenstette, etc.)
This is not what I wrote. I wrote that they are not planning to “solve alignment once and forever” before deploying first AGI that will help them actually develop alignment and other adjacent sciences. This might sound ridiculous to you, but that’s what OpenAI and Conjecture say absolutely directly, and I suspect other labs thinking about it, too, though don’t pronounce it directly.
I did develop several databases and distributed systems over my 10-year-long engineering career and was also interested in resilience research and was reading about it, so I know what I’m talking about and can compare.
Short on time. Will respond to last point.
Surely this is because alignment is hard! Surely if alignment researchers really did find the ultimate solution to alignment and present it on a silver platter, the labs would use it.
Also: An explicit part of SERI MATS’ mission is to put alumni in orgs like Redwood and Anthropic AFAICT. (To the extent your post does this,) it’s plausibly a mistake to treat SERI MATS like an independent alignment research incubator.
MATS aims to find and accelerate alignment research talent, including:
Developing scholar research ability through curriculum elements focused on breadth, depth, and originality (the “T-model of research”);
Assisting scholars in producing impactful research through research mentorship, a community of collaborative peers, dedicated 1-1 support, and educational seminars;
Aiding the creation of impactful new alignment organizations (e.g., Jessica Rumbelow’s Leap Labs and Marius Hobbhahn’s Apollo Research);
Preparing scholars for impactful alignment research roles in existing organizations.
Not all alumni will end up in existing alignment research organizations immediately; some return to academia, pursue independent research, or potentially skill-up in industry (to eventually aid alignment research efforts). We generally aim to find talent with existing research ability and empower it to work on alignment, not necessarily through existing initiatives (though we certainly endorse many).
Yes, admittedly, there is much less strain on being very good at philosophy of science if you are going to work within a team with a clear agenda, particularly within AGI lab where the research agendas tend to be much more empirical than in “academic” orgs like MIRI or ARC. And thinking about research strategy is not the job of non-leading researchers at these orgs either, whereas independent researcher or researchers at more boutique labs have to think about their strategies by themselves. Founders of new orgs and labs have to think about their strategies very hard, too.
But preparing employees for OpenAI, Antrhopic, or DeepMind is clearly not the singular focus of SERI MATS.
MATS is currently focused on developing scholars as per the “T-model of research” with three main levers:
Mentorship (weekly meetings + mentor-specific curriculum);
Curriculum (seminars + workshops + Alignment 201 + topic study groups);
Scholar support (1-1s for unblocking + research strategy).
The “T-model of research” is:
Breadth-first search (literature reviews, building a “toolbox” of knowledge, noticing gaps);
Depth-first search (forming testable hypotheses, executing research, recursing appropriately, using checkpoints);
Originality (identifying threat models, backchaining to local search, applying builder/breaker methodology, babble and prune, “infinite-compute/time” style problem decompositions, etc.).
In the Winter 2022-23 Cohort, we ran several research strategy workshops (focusing on problem decomposition and strategy identification) and had dedicated scholar support staff who offered regular, airgapped 1-1 support for research strategy and researcher unblocking. We will publish some findings from our Scholar Support Post-Mortem soon. We plan to run further research strategy and “originality” workshops in the Summer 2023 Cohort and are currently reviewing our curriculum from the ground up.
Currently, we are not focused on “ethical naturalism” or similar as a curriculum element as it does not seem broadly useful for our cohort compared to project-specific, scholar-driven self-education. We are potentially open to hosting a seminar on this topic.
In terms of pointing program members to a list of scientific disciplines that are relevant to AI safety (corresponds to the “Agency and Alignment ‘Major’” section in the linked post), I’d propose also pointing to this list which is much fuller and has many references to relevant recent literature.
I’m somewhat against maximising originality in the context of alignment theories, specifically (as Adam Shimi has put it: “We need far more conceptual AI alignment research approaches than we have now if we want to increase our chances to solve the alignment problem.”). I argue that the optimal number of conceptual approaches to alignment that are developed by the industry, given the resource constraints and short timelines, is probably 5-7, or even smaller.
This doesn’t apply to the aspects of ensuring favourable AI transition of the civilisation “goes well” that are other than alignment, such as anomaly detection, resilience, information security implications, control and curation of the memetic space, post-AGI governance (of everything), etc., as well as theories of change that you cover in another comment. For these things, almost as much originality as possible seems to be good.
This is good to hear. Sounds to me like you were already doing most of what I’m suggesting, though this was not visible from the outside. Looking forward to reading the post-mortem.
I still think it would be very valuable if you invited scientists like Thomas Metzinger or Andy Clark to give talks on philosophy of cognitive science and alignment, prompting them with a question like “What do you think alignment/AI safety researchers should understand about philosophy of (cognitive) science?”
Thank you for recommending your study guide; it looks quite interesting.
MATS does not endorse “maximizing originality” in our curriculum. We believe that good original research in AI safety comes from a combination of broad interdisciplinary knowledge, deep technical knowledge, and strong epistemological investigation, which is why we emphasize all three. I’m a bit confused by your reference to Adam’s post. I interpret his post as advocating for more originality, not less, in terms of diverse alignment research agendas.
I think that some of the examples you gave of “non-alignment” research areas are potentially useful subproblems for what I term “impact alignment.” For example, strong anomaly detection (e.g., via mechanistic interpretability, OOD warning lights, or RAT-style acceptability verification) can help ensure “inner alignment” (e.g., through assessing corrigibility or myopia) and infosec/governance/meme-curation can help ensure “outer alignment,” “paying the alignment tax,” and mitigating “vulnerable world” situations with stolen/open sourced weaponizable models. I think this inner/outer distinction is a useful framing, though not the only way to carve reality at the joints, of course.
I think Metzinger or Clark could give interesting seminars, though I’m generally worried about encouraging the anthropomorphism of AGI. I like the “shoggoth” or “alien god” memetic framings of AGI, as these (while wrong) permit a superset of human-like behavior without restricting assumptions of model internals to (unlikely and optimistic, imo) human-like cognition. In this vein, I particularly like Steve Byrnes’ research as I feel it doesn’t overly anthropomorphize AGI and encourages the “competent psychopath with alien goals” memetic framing. I’m intrigued by this suggestion, however. How do you think Metzinger or Clark would specifically benefit our scholars?
(Note: I’ve tried to differentiate between MATS’ organizational position and mine by using “I” or “we” when appropriate.)
The quote by Shimi exemplifies the position with which I disagree. I believe Yudkowsky and Soares were also stating something along these lines previously, but I couldn’t find suitable quotes. I don’t know if any of these three people still hold this position, though.
I heard Metzinger reflecting on the passage of paradigms in cognitive science (here) so I think he would generally endorse the idea that I expressed here (I don’t claim though that I understand his philosophy perfectly and improve it; if I understood his exact version of this philosophy and how exactly it differs from mine I would likely change mine).
For Clark, I’m not sure what he would actually answer to the prompt like “What do you think are the blind spots or thinking biases of young AI (alignment, safety) researchers, what would you wish them to understand?”. Maybe his answer would be “I think their views are reasonable, I don’t see particular blind spots”. But also maybe not. This is more of an open-ended question which I cannot answer because I am myself probably susceptible to these exact blind spots or biases if he sees any from the “outside”.
Alignment research obviously needs to synthesise a deep understanding and steering of AIs with a deep understanding and steering (aka prompt engineering) of humans: the principles of human cognition and the nature of human values. So being exposed even to a rather “anthropocentric” philosophy of human cognition/values would be insightful and valuable for alignment researchers (as long as it is anthropocentric in a sound way, of course). Yet I think Clark’s philosophy is certainly more substrate- and mechanism-independent.
I don’t disagree with Shimi as strongly as you do. I think there’s some chance we need radically new paradigms of aligning AI than “build alignment MVPs via RLHF + adversarial training + scalable oversight + regularizers + transparency tools + mulligans.”
While I do endorse some anthropocentric “value-loading”-based alignment strategies in my portfolio, such as Shard Theory and Steve Byrnes’ research, I worry about overly investing in anthropocentric AGI alignment strategies. I don’t necessarily think that RLHF shapes GPT-N in a manner similar to how natural selection and related processes shaped humans to be altruistic. I think it’s quite likely that the kind of cognition that GPT-N learns to predict tokens is more akin to an “alien god” than it is to human cognition. I think that trying to value-load an alien god is pretty hard.
In general, I don’t highly endorse the framing of alignment as “making AIs more human.” I think this kind of approach fails in some worlds and might produce models that are not performance-competitive enough to outcompete the unaligned models others deploy. I’d rather produce corrigible models with superhuman cognition coupled with robust democratic institutions. Nevertheless, I endorse at least some research along this line, but this is not the majority of my portfolio.
Seems to me that you misunderstood my position.
Let’s assume that GPT-5 will hit the AGI recursive self-improvement bar, so if it is bundled in an AutoGPT-style loop and tasked with a bold task such as “solve a big challenge of humanity, such as global hunger” or “make a billion-dollar business” it could meaningfully achieve a lot, via instrumental convergence and actual agency, including self-improvement, at which point it is effectively out of control.
I think that “build alignment MVPs via RLHF + adversarial training + scalable oversight + regularizers + transparency tools + mulligans” won’t be enough to align such an AGI “once and forever”, and that actually trying to task thus-”aligned” AGI with a bold task won’t end well overall (even if it will happen to solve the given task.
But I do think that this will be enough to produce [epistemology, alignment, governance, infosec, infrastructure design, etc.] science safely within AGI labs with a very high probability, > 90%. By “safely producing science” I mean
(1) Not spinning out of control while doing (largely or exclusively) theoretical science where any non-theoretical action, such as “let’s run a such-and-such simulation with such-and-such programmed actors”, or “let’s train such-and-such NN to test the such-and-such theory of representation learning” is supervised by human researchers,
(2) Not being able to sneak “backdoors” or at least serious self-serving biases in the produced epistemology, alignment, and governance science that human scientists won’t be able to spot and correct when their review the AI outputs. (Caveat: this probably doesn’t apply to ethics-as-philosophy, as I discussed here, so ethics should be practised as a science as well, hence my references to ethical naturalism.)
Then, I detailed my disagreement with Shimi here. For alignment, we will need to apply multiple frameworks/paradigms simultaneously, and “RLHF + language feedback + debate/critiques + scalable oversight” is just one of these paradigms, which I called “linguistic process theories of alignment” in the linked post. We will need to apply several more paradigms simultaneously and continuously, throughout the training, deployment, and operations phase of AIs, such as Bayesian-style “world model alignment”, LeCun’s “emotional drives”, game-theoretic and mechanism-design-based process approaches (designing the right incentives for AIs during their development and deployment), (cooperative, inverse) RL paradigm, etc.
Specifically, I don’t think that if we apply all the above-mentioned approaches and a few more (which is, of course, itself not easy at all, as there are serious theoretical and practical issues with most of them, such as computational tractability or too high alignment tax, but I believe these problems are soluble, and the tax could be reduced by developing smart amortization techniques), we are still “doomed” because there is a glaring theoretical or strategic “hole” that must be closed with some totally new, totally unheard of or unexpected conceptual approach to alignment. As I described in the post, I think about this in terms of “the percent of the complexity of human values covered” with a portfolio of approaches, and when this portfolio is bigger than 3-4 approaches, almost all complexity of value is covered and it’s even not that important what specific approaches in this portfolio, as long as they are scientifically sound. That’s why I advocate marshalling most efforts behind existing approaches: it’s not that cooperative inverse RL, RLHF + language feedback, or Active Inference may turn out to be “totally useless”, and therefore working on maturing these approaches and turning them into practical process theories of alignment may be wasted effort. These approaches will be useful if they are applied, and naturally approaches that already have some momentum behind them are more likely to be applied.
Dealing with unaligned competition
You mentioned one potential “strategical hole” in the picture that I outlined above: an unaligned AI may have a competitive advantage over an AI aligned with a “portfolio of approaches” (at least the presently existing approaches), so, (I assume your argument goes,) we need an HRAD-style invention of “super capable, but also super corrigible” AI architecture. I’m personally doubtful that
alignment approaches actually limit agentic capability,
corrigibility is a useful, coherent concept (as I mentioned in this comment), and
provably-controllable highly reliable agent design is possible in principle: the law of requisite variety could demand AI to be “complex and messy” in order to be effective in the “complex and messy” world.
But I’m not confident about either of these three propositions, so I should leave about 10% that they are all three false and hence that your strategic concern is valid.
However, I think we should address this strategic concern by rewiring the economic and action landscapes (which also interacts with the “game-theoretic, mechanism-design” alignment paradigm mentioned above). The current (internet) infrastructure and economic systems are not prepared for the emergence of powerful adversarial agents at all:
There are no systems of trust and authenticity verification at the root of internet communication (see https://trustoverip.org/)
The storage of information is centralised enormously (primarily in the data centres of BigCos such as Google, Meta, etc.)
Money has no trace, so one may earn money in arbitrary malicious or unlawful ways (i.e., gain instrumental power) and then use it to acquire resources from respectable places, e.g., paying for ML training compute at AWS or Azure and purchasing data from data providers. Formal regulations such as compute governance and data governance and human-based KYC procedures can only go so far and could probably be social-engineered by a superhuman imposter or persuader AI.
In essence, we want to design civilisational cooperation systems such that being aligned is a competitive advantage. Cf. “The Gaia Attractor” by Rafael Kaufmann.
This is a very ambitious program to rewire the entirety of the internet, other infrastructure, and the economy, but I believe this must be done anyway, just expecting a “miracle” HRAD invention to be sufficient without fixing the infrastructure and system design layers doesn’t sound like a good strategy. By the way, such infrastructure and economy rewiring is the real “pivotal act”.
I think we agree on a lot more than I realized! In particular, I don’t disagree with your general claims about pathways to HRAD through Alignment MVPs (though I hold some credence that this might not work). Things I disagree with:
I generally disagree with the claim “alignment approaches don’t limit agentic capability.” This was one subject of my independent research before I started directing MATS. Hopefully, I can publish some high-level summaries soon, time permitting! In short, I think “aligning models” generally trades off bits of optimization pressure with “making models performance-competitive,” which makes building aligned models less training-competitive for a given degree of performance.
I generally disagree with the claim “corrigibility is not a useful, coherent concept.” I think there is a (narrow) attractor basin around “corrigibility” in cognition space. Happy to discuss more and possibly update.
I generally disagree with the claim “provably-controllable highly reliable agent design is impossible in principle.” I think it is possible to design recursively self-improving programs that are robust to adversarial inputs, even if this is vanishingly hard in practice (which informs my sense of alignment difficulty only insomuch as I hope we don’t hit that attractor well before CEV value-loading is accomplished). Happy to discuss and possibly update.
I generally disagree with the implicit claim “it’s useful to try aligning AI systems via mechanism design on civilization.” This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent. I also don’t think that realistic pre-AGI efficient markets we can build are aligned with human-CEV by default.
I didn’t imply that mechanism/higher-level-system design and game theory are alone sufficient for successful alignment. But as a part of a portfolio, I think it’s indispensable.
Probably the degree to which a person (let’s say, you or me) buy into the importance of mechanism/higher-level-system design for AI alignment corresponds to where we also land on the cognitivism—enactivism spectrum. If you are a hardcore cognitivist, you may think that just designing and training AI “in the right way” would be 100% sufficient. If you are a radical enactivist, you probably endorse the opposite view, that just designing incentives is necessary and sufficient, while designing and training AI “in the right way” is futile if the correct incentives are not in place.
I’m personally very uncertain about the cognitivism—enactivism spectrum (as well as, apparently, the scientific community—there are completely opposite positions out there held by respectable cognitive scientists), so from the “meta” rationality perspective, I should hedge bets and be sure not to neglect neither the cognitive representations and AI architecture nor the incentives and the environment.
Pre-AGI, I think there won’t be enough collective willingness to upend the economy and institute the “right” structures, anyway. The most realistic path (albeit still with a lot of failure modes) that I see is “alignment MVP tells us (or helps scientists to develop the necessary science) how the society, markets, and governance should really be structured, along with the AI architectures and training procedures for the next-gen AGI, which convinces scientists and decision-makers around the world”.
The difference is that after alignment MVP, first the sceptic voice “Let’s keep things as they always used to be, it’s all hype, there is no intelligence” should definitely cease completely at least among intelligent people. Second, alignment MVP should show that ending scarcity is a real possibility and this should weaken the grip of status quo economic and political incentives. Again, a lot of things could go wrong around this time, but it seems to me this is the path where OpenAI, Conjecture, and probably other AGI labs are aiming because they perceive it as the least risky or the only feasible one.
I’m somewhere in the middle of the cognitivist/enactivist spectrum. I think that e.g. relaxed adversarial training is motivated by trying to make an AI robust to arbitrary inputs it will receive in the world before it leaves the box. I’m sympathetic to the belief that this is computationally intractable; however, it feels more achievable than altering the world in the way I imagine would be necessary without it.
I’m not an idealist here: I think that some civilizational inadequacies should be addressed (e.g., better cooperation and commitment mechanisms) concurrent with in-the-box alignment strategies. My main hope is that we can build an in-the-box corrigible AGI that allows in-deployment modification.
FWIW, my sense is that the projects people do are very often taken from Neel’s concrete open problems and are often meant for skilling up.
In traditional sciences, people often spend years learning to think better and use tools on problems that are “toy” and no serious researcher would work on, because this is how you improve and get to that point. I don’t think people should necessarily take years to get there, but it’s worth considering that upskilling is the primary motivation for much work.
If you think specific researchers aren’t testing, or working towards testing predictions about systems then this would be useful to discuss with them and to give specific feedback.
The implication is that problems from Neel’s list are easier than mining interpretability observations that directly test predictions of some of the theories of DNNs or Transformers. I’m not sure if anyone tried to look directly at this question (also, these are not disjoint lists).
MATS generally thinks that separating the alignment problem into “lowering the alignment tax” via technical research and “convincing others to pay the alignment tax” via governance interventions is a useful framing. There are worlds in which the two are not so cleanly separable, of course, but we believe that making progress toward at least one of these goals is probably useful (particularly if more governance-focused initiatives exist). We also support several mentors whose research crosses this boundary (e.g., Dan Hendrycks, Jesse Clifton, Daniel Kokotajlo).
I think the SERI-MATS organisers likely defer to mentors on details or curriculum/research philosophy. It feels like this letter should be addressed to them. I think it’s hard to disagree with “Alignment researchers should think hard about their research methodology and strategy” but I think you are probably failing to pass the ideological turing test for various other positions which will mean this arguments won’t connect.
In general, MATS gives our chosen mentors the final say on certain curriculum elements (e.g., research projects offered, mentoring style, mentor-specific training program elements) but offers a general curriculum (e.g., Alignment 201, seminars, research strategy/tools workshops) that scholars can opt into.
Just on this, I (not part of SERI MATS but working from their office) had a go at a basic ‘make ChatGPT interpret this neuron’ system for the interpretability hackathon over the weekend. (GitHub)
While it’s fun, and managed to find meaningful correlations for 1-2 neurons / 50, the strongest takeaway for me was the inadequacy of the paradigm ‘what concept does neuron X correspond to’. It’s clear (no surprise, but I’d never had it shoved in my face) that we need a lot of improved theory before we can automate. Maybe AI will automate that theoretical progress but it feels harder, and further from automation, than learning how to handoff solidly paradigmatic interpretability approaches to AI. ManualMechInterp combined with mathematical theory and toy examples seems like the right mix of strategies to me, tho ManualMechInterp shouldn’t be the largest component imo.
FWIW, I agree with learning history/philosophy of science as a good source of models and healthy experimental thought patterns. I was recommended Hasok Chang’s books (Inventing Temperature, Is Water H20) by folks at Conjecture and I’d heartily recommend them in turn.
I know the SERI MATS technical lead @Joe_Collman spends a lot of his time thinking about how they can improve feedback loops, he might be interested in a chat.
You also might be interested in Mike Webb’s project to set up programs to pass quality decision-making from top researchers to students, being tested on SERI MATS people at the moment.
Yes, I agree that automated interpretability should be based on scientific theories of DNNs, of which there are many already, and which should be weaved together with existing mech.interp (proto) theories and empirical observations.
Thanks for the pointers!
We did record this debate but have yet to publish the recording as the speakers have not yet given approval. MATS believe that interpretability research might have many uses.
Among many other theories of change, MATS currently supports:
Evaluating models for dangerous capabilities, to aid safety standards, moratoriums, and alignment MVPs:
Owain Evans’ evaluations of situational awareness
Ethan Perez’s and Evan Hubinger’s demonstrations of deception and other undesirable traits
Daniel Kokotajlo’s dangerous capabilities evaluations
Dan Hendrycks’ deceptive capabilities benchmarking
Building non-agentic AI systems that can be used to extract useful alignment research:
Evan Hubinger’s conditioning predictive models
Janus’ simulators and Nicholas Kees Dupois’ cyborgism
Interpretability and model editing research that might build into better model editing, better auditing, better regularizers, better DL science, and many other things:
Neel Nanda’s and Conjecture’s mechanistic interpretability research
Alex Turner’s RL interpretability and model editing research
Collin Burns’ “gray-box ELK”
Research that aims to predict or steer architecture-agnostic or emergent systems:
Vivek Hebbar’s research on “sharp left turns,” etc.
John Wentworth’s selection theorems research
Alex Turner’s and Quintin Pope’s shard theory research
Technical alignment-adjacent governance and strategy research:
Jesse Clifton’s and Daniel Kokotajlo’s research on averting AI conflict and AI arms races
Dan Hendrycks’ research on aligning complex systems
If AGI labs truly bet on AI-assisted (or fully AI-automated) science, across the domains of science (the second group in your list), then research done in the following three groups will be submerged by that AI-assisted research.
It’s still important to do some research in these areas, for two reasons:
(1) hedging bets against some unexpected turn of events, such as AIs failing to improve the speed and depth of generated scientific insight, at least in some areas (perhaps governance & strategy are more iffy areas, and it’s hard to become sure that strategies suggested by AIs are free of ‘deep deceptiveness’ style of bias than pure math or science).
(2) When AIs will presumably generate all that awesome science, the humanity still needs to have people capable of understanding, evaluating, and finding weaknesses in this science.
This, however, suggests a different focus in the latter three groups: growing excellent science evaluators rather than generators (GAN style). More Yudkowskys able to shut down and poke holes in various plans. Less focus on producing sheer amount of research and more focus on the ability to criticise others’ and one’s own research. There is overlap of course but there are also differences in how researchers should develop, if we keep this in mind. Also credit assignment systems and community authority-inferring mechanisms should recognise this focus.
MATS’ framing is that we are supporting a “diverse portfolio” of research agendas that might “pay off” in different worlds (i.e., your “hedging bets” analogy is accurate). We also think the listed research agendas have some synergy you might have missed. For example, interpretability research might build into better AI-assisted white-box auditing, white/gray-box steering (e.g., via ELK), or safe architecture design (e.g., “retargeting the search”).
The distinction between “evaluator” and “generator” seems fuzzier to me than you portray. For instance, two “generator” AIs might be able to red-team each other for the purposes of evaluating an alignment strategy.