Technical AI safety, in turn, here refers to the subset of AI safety research that takes the current technological paradigm as its chief object of study. Importantly, this does not exclude alltheoretical approaches, but does prefer those theoretical approaches which have a strong foundation in experimentation.
I appreciate this clarification, but I think it’s not enough. As the most defensible counterexample, theoretical math is quintessentially technical, whether or not it relates to (non-mental) experimentation. A less defensible but more important counterexample is (careful, speculative, motivated, core) philosophy. An alternative name for what you mean here could be “prosaic”. See e.g. https://www.lesswrong.com/posts/YTq4X6inEudiHkHDF/prosaic-ai-alignment :
“prosaic” AGI, which doesn’t reveal any fundamentally new ideas about the nature of intelligence or turn up any “unknown unknowns.”
If “prosaic” sounds derogatory, another alternative would be “in-/on-paradigm”.
All young people and other newcomers should be made aware that on-paradigm AI safety/alignment—while being more tractable, feedbacked, well-resourced, and populated compared to theory—is also inevitably streetlighting https://en.wikipedia.org/wiki/Streetlight_effect.
All young people and other newcomers should be made aware that on-paradigm AI safety/alignment—while being more tractable, feedbacked, well-resourced, and populated compared to theory—is also inevitably streetlighting https://en.wikipedia.org/wiki/Streetlight_effect.
Half-agree. I think there’s scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas (eg: ontology identification). I do think there is a lot of diversity in people working in these more legible areas and that means there are now many people who haven’t engaged with or understood the alignment problem well enough to realise where we might be suffering from the street light effect.
Object level: ontology identification, in the sense that is studied empirically, is pretty useless. It streetlights on recognizable things, and AFAIK isn’t trying to avoid, for example, the Doppelgänger problem or to at all handle diasystemic novelty or the ex quo of a mind’s creativity. [ETA: actually ELK I think addresses the Doggelgänger problem in its problem statement, if not in any proposed solutions.]
Meta:
I think there’s scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas (eg: ontology identification).
You hedged your statement so much that it became true and also not very relevant. Here are the hedges:
“scope”: some research could be interpreted as trying to get to some other research, or as having a mission statement that includes some other research
“within field[s]”: some people / some research—or maybe no actual people or reseach, but possible research that would fit with the genre of the field
“closer to”: but maybe not close to, in an absolute sense
“or at least touch on”: if an academic philosopher wrote this about their work, you’d immediately recognize it as cope
“alignment agendas”: there aren’t any alignment agendas. There are alignment agendas in the sense that “we can start a colony around Proxima Centauri in the following way: 1. make a go-really-fast-er. 2. use the go-really-fast-er to go really fast towards Proxima Centauri” is an agenda to get to Proxima Centauri. If you make no mention of the part where you have to also slow down, and the part about steering, and the part where you have to shield from cosmic rays, and make a self-sustaining habitat on the ship, and the part about are any of the planets around Proxima Centauri remotely habitable… is this really an agenda?
Object level: ontology identification, in the sense that is studied empirically, is pretty useless. It streetlights on recognizable things, and AFAIK isn’t trying to avoid, for example, the Doppelgänger problem
I haven’t seen anyone do such interpretability research yet but I see no particular reason to think this is the sort of thing that can’t be studied empirically rather than the sort of thing that hasn’t been studied empirically. We have, for example, vision transformers and language transformers. I would be very surprised if there was a pure 1:1 mapping between the learned features in those two types of transformer models.
Well, empirically, when people try to study it empirically, instead they do something else. Surely that’s empirical evidence that it can’t be studied empirically? (I’m a little bit trolling but also not.)
I’d say mechanistic interpretability is trending toward a field which cares & researches the problems you mention. For example, the doppelganger problem is a fairly standard criticism of the sparse autoencoder work, diasystemic novelty seems the kind of thing you’d encounter when doing developmental interpretability, interp-through-time, or inductive biases research, especially with a focus on phase changes (a growing focus area), and though I’m having a hard time parsing your creativity post (an indictment of me, not of you, as I didn’t spend too long with it), it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn’t focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.
ETA: An argument could be that though these problems will come up, ultimately the field will prioritize hacky fixes in order to deal with them, which only sweep the problems under the rug. I think many in MI will prioritize such limited fixes, but also that some won’t, and due to the benefits of such problems becoming empirical, such people will be able to prove the value of their theoretical work & methodology by convincing MI people with their practical applications, and money will get diverted to such theoretical work & methodology by DL-theory-traumatized grantmakers.
the doppelganger problem is a fairly standard criticism of the sparse autoencoder work,
And what’s the response to the criticism, or a/the hoped approach?
diasystemic novelty seems the kind of thing you’d encounter when doing developmental interpretability, interp-through-time
Yeah, this makes sense. And hey, maybe it will lead to good stuff. Any results so far, that I might consider approaching some core alignment difficulties?
it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn’t focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.
Also makes some sense (though the ex quo, insofar as we even want to attribute this to current systems is distributed across the training algorithms and the architecture sources, as well as inference-time stuff).
Generally what you’re bringing up sounds like “yes these are problems and MI would like to think about them… later”. Which is understandable, but yeah, that’s what streetlighting looks like.
Maybe an implicit justification of current work is like:
There’s these more important, more difficult problems. We want to deal with them, but they are too hard right now, so we will try in the future. Right now we’ll deal with simpler things. By dealing with simpler things, we’ll build up knowledge, skills, tools, and surrounding/supporting orientation (e.g. explaining weird phenomena that are actually due to already-understandable stuff, so that later we don’t get distracted). This will make it easier to deal with the hard stuff in the future.
This makes a lot of sense—it’s both empathizandable, and seems probably somewhat true. However:
Again, it still isn’t in fact currently addressing the hard parts. We want to keep straight the difference between [currently addressing] vs. [arguably might address in the future].
We gotta think about what sort of thing would possibly ever work. We gotta think about this now, as much as possible.
A core motivating intuition behind the MI program is (I think) “the stuff is all there, perfectly accessible programmatically, we just have to learn to read it”. This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations
I don’t know of any clear progress on your interests yet. My argument was about the trajectory MI is on, which I think is largely pointed in the right direction. We can argue about the speed at which it gets to the hard problems, whether its fast enough, and how to make it faster though. So you seem to have understood me well.
A core motivating intuition behind the MI program is (I think) “the stuff is all there, perfectly accessible programmatically, we just have to learn to read it”. This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations
I think I’m more agnostic than you are about this, and also about how “deeply” flawed MI’s intuitions are. If you’re right, once the field progresses to nontrivial dynamics, we should expect those operating at a higher level of analysis—conceptual MI—to discover more than those operating at a lower level, right?
If, hypothetically, we were doing MI on minds, then I would predict that MI will pick some low hanging fruit and then hit walls where their methods will stop working, and it will be more difficult to develop new methods that work. The new methods that work will look more and more like reflecting on one’s own thinking, discovering new ways of understanding one’s own thinking, and then going and looking for something like that in the in-vitro mind. IDK how far that could go. But then this will completely grind to a halt when the IVM is coming up with concepts and ways of thinking that are novel to humanity. Some other approach would be needed to learn new ideas from a mind via MI.
However, another dealbreaker problem with current and current-trajectory MI is that it isn’t studying minds.
I mean my impression is that there are something on the order of 100-1000 people in the world working on ML interpretability as their day job, and maybe 1k-10k people who dabble in their free time. No research in the field will get done unless one of that small number of people makes a specific decision to tackle that particular research question instead of one of the countless other ones they could choose to tackle.
Well, empirically, when people try to study it empirically, instead they do something else
I don’t know that we have any empirical data on what happens when people try to study that particular empirical question (the specific relationship between the features leaned by two models of different modalities) because I don’t know that anyone has set out to study that particular question in any serious way.
In other words, I suspect it’s not “when someone starts to study this phenomenon, some mysterious process causes them to study something else instead”. I think it’s “the surface area of the field is large and there aren’t many people in it, so I doubt anyone has even gotten to the part where they start to study this phenomenon.”
Edit: to be even more explicit, what I’m trying to do in this thread is encourage thinking about ways one might collect empirical observations about non-”streetlit” topics. None of the topics are under the streetlight until someone builds the streetlight. “Build a streetlight” is sometimes an available action, but it only happens if someone makes a specific effort to do so.
Edit 2: I misunderstood what point you were making as “prosaic alignment is unlikely to be helpful, look at all of these empirical researchers who have not even answered these basic questions” (which is a perspective I disagree with pretty strongly) rather than “I think empirical research shouldn’t be the only game in town” (which I agree with) and “we should fund outsiders to go do stuff without much interaction with or feedback from the community to hopefully develop new ideas that are not contaminated with the current community biases” (I think this would he worth doing f resources we’re unlimited, not sure as things actually stand).
As a concrete note, I suspect work that demonstrates that philosophical or mathematical approaches can yield predictions about empirical questions is more likely to be funded. For example, in your post you say
In programming, adding a function definition would be endosystemic; refactoring the code into a functional style rather than an object-oriented style, or vice versa, in a way that reveals underlying structure, is diasystemic novelty.
Could that be operationalized as a prediction of the form
If you train a model on a bunch of simple tasks involving both functional and object-oriented code (e.g. “predict the next token of the codebase”, “predict missing token”, “identify syntax errors”) and then train it on a complex task on only object-oriented code (e.g. “write a document describing how to use this library”), it will fail to navigate that ontological shift and will be unable to document functional code.
(I expect that’s not a correct operationalization but something of that shape)
I think there’s scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas
BT:
Object level: ontology identification, in the sense that is studied empirically, is pretty useless.
sname:
I haven’t seen anyone do such interpretability research yet but I see no particular reason to think this is the sort of thing that can’t be studied empirically rather than the sort of thing that hasn’t been studied empirically.
BT:
Well, empirically, when people try to study it empirically, instead they do something else
sname:
I don’t know that we have any empirical data on what happens when people try to study that particular empirical question (the specific relationship between the features leaned by two models of different modalities) because I don’t know that anyone has set out to study that particular question in any serious way.
BT:
ah, sname is talking about conceptual Doppelgängers specifically, as ze indicated in a previous comment that I now understand
When I said “when people try to study it empirically”, what I meant was “when people try to do interpretability research (presumably, that is relevant to the hard part of the problem?)”.
“prosaic alignment is unlikely to be helpful, look at all of these empirical researchers who have not even answered these basic questions”
Right, I’m not saying exactly this. But I am saying:
Prosaic alignment is unlikely to be helpful, look at how they are starting in an extremely streetlighty way(*) and then, empirically, not pushing out into the dark quickly—and furthermore, AFAIK, not very concerned with how they aren’t pushing out into the dark quickly enough, or successfully addressing this at the meta level, though plausibly they’re doing that and I’m just not aware.
(*): studying LLMs, which are not minds; trying to recognize [stuff we mostly conceptually understand] within systems rather than trying to come to conceptually understand [the stuff we’d need to be able to recognize/design in a mind, in order to determine the mind’s effects].
(I think this would he worth doing f resources we’re unlimited, not sure as things actually stand).
Well, you’ve agreed with a defanged version of my statements. The toothful version, which I do think: Insofar as this is even possible, we should allocate a lot more resources toward funding any high-caliber smart/creative/interesting/promising/motivated youngsters/newcomers who want to take a crack at independently approaching the core difficulties of AGI alignment, even if that means reallocating a lot of resources away from existing on-paradigm research.
Edit: to be even more explicit, what I’m trying to do in this thread is encourage thinking about ways one might collect empirical observations about non-”streetlit” topics. None of the topics are under the streetlight until someone builds the streetlight. “Build a streetlight” is sometimes an available action, but it only happens if someone makes a specific effort to do so.
This seems like a good thing to do. But there’s multiple ways that existing research is streetlit, and reality doesn’t owe it to you to make it be the case that there are nice (tractionful, feasible, interesting, empirical, familiar, non-weird-seeming, feedbacked, grounded, legible, consensusful) paths toward the important stuff. The absence of nice paths would really suck if it’s the case, and it’s hard to see how anyone could be justifiedly really confident that there are no nice paths. But yes, I’m saying that it looks like there aren’t nice paths, or at least there aren’t enough nice paths that we seem likely to find them by continuing to sample from the same distribution we’ve been sampling from; and I have some arguments and reasons supporting this belief, which seem true; and I would guess that a substantial fraction (though not most) of current alignment researchers would agree with a fairly strong version of “very few or no nice paths”.
Could that be operationalized as a prediction of the form
If you train a model on a bunch of simple tasks involving both functional and object-oriented code (e.g. “predict the next token of the codebase”, “predict missing token”, “identify syntax errors”) and then train it on a complex task on only object-oriented code (e.g. “write a document describing how to use this library”), it will fail to navigate that ontological shift and will be unable to document functional code.
I don’t think that’s a good operationalization, as you predict. I think it’s trying to be an operationalization related to my claim above:
ontology identification, in the sense that is studied empirically, is pretty useless. It [..] AFAIK isn’t trying to [...] at all handle diasystemic novelty [...].
But it sort of sounds like you’re trying to extract a prediction about capability generalization or something? Anyway, an interp-like study trying to handle diasystemic novelty might for example try to predict large scale explicitization events events before they happen—maybe in a way that’s robust to “drop out”. E.g. you have a mind that doesn’t explicitly understand Bayesian reasoning; but it is engaging in lots of activities that would naturally induce small-world probabilistic reasoning, e.g. gambling games or predicting-in-distribution simple physical systems; and then your interpreter’s job is to notice, maybe only given access to restricted parts (in time or space, say) of the mind’s internals, that Bayesian reasoning is (implicitly) on the rise in many places. (This is still easy mode if the interpreter gets to understand Bayesian reasoning explicitly beforehand.) I don’t necessarily recommend this sort of study, though; I favor theory.
When I said “when people try to study it empirically”, what I meant was “when people try to do interpretability research (presumably, that is relevant to the hard part of the problem?)”.
Is there a particular reason you expect there to be exactly one hard part of the problem, and for the part that ends up being hardest in the end to be the part that looks hardest to us now?
Prosaic alignment is unlikely to be helpful, look at how they are starting in an extremely streetlighty way(*) and then, empirically, not pushing out into the dark quickly—and furthermore, AFAIK, not very concerned with how they aren’t pushing out into the dark quickly enough, or successfully addressing this at the meta level, though plausibly they’re doing that and I’m just not aware.
If I were a prosaic alignment researcher, I probably would choose to prioritize which problems I worked on a bit differently than those currently in the field. However, I expect that the research that ends up being the most useful will not be that research which looked most promising before someone started doing it, but rather research that stemmed from someone trying something extremely simple and getting an unexpected result, and going “huh, that’s funny, I should investigate further”. I think that the process of looking at lots of things and trying to get feedback from reality as quickly as possible is promising, even if I don’t have a strong expectation that any one specific one of those things is promising to look at.
But there’s multiple ways that existing research is streetlit, and reality doesn’t owe it to you to make it be the case that there are nice (tractionful, feasible, interesting, empirical, familiar, non-weird-seeming, feedbacked, grounded, legible, consensusful) paths toward the important stuff
Certainly reality doesn’t owe us a path like that, but it would be pretty undignified if reality did in fact give us a path like that and we failed to find it because we didn’t even look.
Anyway, an interp-like study trying to handle diasystemic novelty might for example try to predict large scale explicitization events events before they happen—maybe in a way that’s robust to “drop out”. E.g. you have a mind that doesn’t explicitly understand Bayesian reasoning; but it is engaging in lots of activities that would naturally induce small-world probabilistic reasoning, e.g. gambling games or predicting-in-distribution simple physical systems; and then your interpreter’s job is to notice, maybe only given access to restricted parts (in time or space, say) of the mind’s internals, that Bayesian reasoning is (implicitly) on the rise in many places. (This is still easy mode if the interpreter gets to understand Bayesian reasoning explicitly beforehand.)
Interesting. I would be pretty interested to see research along these lines (although the scope of the above is probably still a bit large for a pilot project).
I don’t necessarily recommend this sort of study, though; I favor theory.
What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?
Is there a particular reason you expect there to be exactly one hard part of the problem,
Have you stopped beating your wife? I say “the” here in the sense of like “the problem of climbing that mountain over there”. If you’re far away, it makes sense to talk about “the (thing over there)”, even if, when you’re up close, there’s multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.
and for the part that ends up being hardest in the end to be the part that looks hardest to us now?
We make an argument like “any solution would have to address X” or “anything with feature Y does not do Z” or “property W is impossible”, and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance. It’s not like pointing to a little ball in ideaspace and being like “the answer is somewhere in here”. Rather it’s like cutting out a halfspace and saying “everything on this side of this plane is doomed, we’d have to be somewhere in the other half”, or like pointing out a manifold that all research is on and saying “anything on this manifold is doomed, we’d have to figure out how to move somewhat orthogonalward”.
research that stemmed from someone trying something extremely simple and getting an unexpected result
I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant. (I don’t care if you have an army of people who all agree on taking a stance that seems to imply that there’s not much relevant difference between LLMs and future AGI systems that might kill everyone.)
What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?
I think you (and everyone else) don’t know how to ask this question properly. For example, “on whether your theory describes the world as it is” is a too-narrow idea of what our thoughts about minds are supposed to be. Sub-example: our thoughts about mind are supposed to also produce design ideas.
To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans, and the best access you have to minds is introspection. (I don’t mean meditation, I mean thinking and also thinking about thinking/wanting/acting—aka some kinds of philosophy and math.)
Is there a particular reason you expect there to be exactly one hard part of the problem,
Have you stopped beating your wife? I say “the” here in the sense of like “the problem of climbing that mountain over there”. If you’re far away, it makes sense to talk about “the (thing over there)”, even if, when you’re up close, there’s multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.
I think the appropriate analogy is someone trying to strategize about “the hard part of climbing that mountain over there” before they have even reached base camp or seriously attempted to summit any other mountains. There are a bunch of parts that might end up being hard, and one can come up with some reasonable guesses as to what those parts might be, but the bits that look hard from a distance and the bits that end up being hard when you’re on the face of the mountain may be different parts.
We make an argument like “any solution would have to address X” or “anything with feature Y does not do Z” or “property W is impossible”, and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance
Doomed to irrelevance, or doomed to not being a complete solution in and of itself? The point of a lot of research is to look at a piece of the world and figure out how it ticks. Research to figure out how a piece of the world ticks won’t usually directly allow you to make it tock instead, but can be a useful stepping stone. Concrete example: dictionary learning vs Golden Gate Claude.
I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant.
I think one significant crux is “to what extent are LLMs doing the same sort of thing that human brains do / the same sorts of things that future, more powerful AIs will do?” It sounds like you think the answer is “they’re completely different and you won’t learn much about one by studying the other”. Is that an accurate characterization?
To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans
Agreed, though with quibbles
and the best access you have to minds is introspection.
In my experience, my brain is a dirty lying liar that lies to me at every opportunity—another crux might be how faithful one expects their memory of their thought processes to be to the actual reality of those thought processes.
Doomed to irrelevance, or doomed to not being a complete solution in and of itself?
Doomed to not be trying to go to and then climb the mountain.
my brain is a dirty lying liar that lies to me at every opportunity
So then it isn’t easy. But it’s feedback. Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
Doomed to irrelevance, or doomed to not being a complete solution in and of itself?
Doomed to not be trying to go to and then climb the mountain.
If you think that current mech interp work is currently trying to directly climb the mountain, rather than trying to build and test a set of techniques that might be helpful on a summit attempt, I can see why you’d be frustrated and discouraged at the lack of progress.
> Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
I don’t have much hope in the former being feasible, though I do support having a nonzero number of people try it because sometimes things I don’t think are feasible end up working.
Look… Consider the hypothetically possible situation that in fact everyone is very far from being on the right track, and everything everyone is doing doesn’t help with the right track and isn’t on track to get on the right track or to help with the right track.
Ok, so I’m telling you that this hypothetically possible situation seems to me like the reality. And then you’re, I don’t know, trying to retreat to some sort of agreeable live-and-let-live stance, or something, where we all just agree that due to model uncertainty and the fact that people have vaguely plausible stories for how their thing might possibly be helpful, everyone should do their own thing and it’s not helpful to try to say that some big swath of research is doomed? If this is what’s happening, then I think that what you in particular are doing here is a bad thing to do here.
Maybe we can have a phone call if you’d like to discuss further.
Maybe we can have a phone call if you’d like to discuss further.
I doubt it’s worth it—I’m not a major funder in this space and don’t expect to become one in the near future, and my impression is that there is no imminent danger of you shutting down research that looks promising to me and unpromising to you. As such, I think the discussion ended up getting into the weeds in a way that probably wasn’t a great use of either of our time, and I doubt spending more time on it would change that.
That said, I appreciated your clarity of thought, and in particular your restatement of how the conversation looked to you. I will probably be stealing that technique.
You are, of course, correct in your definitions of “technical” and “prosaic” AI safety. Our interview series did not exclude advocates of theoretical or non-prosaic approaches to AI safety. It was not the intent of this report to ignore talent needs in non-prosaic technical AI safety. We believe that this report summarises our best understanding of the dominant talent needs across all of technical AI safety, at least as expressed by current funders and org leaders.
MATS has supported several theoretical or non-prosaic approaches to improving AI safety, including Vanessa Kosoy’s learning theoretic agenda, Jesse Clifton’s and Caspar Oesterheldt’s cooperative AI research, Vivek Hebbar’s empirical agent foundations research, John Wentworth’s selection theorems agenda, and more. We remain supportive of well-scoped agent foundations research, particularly that with tight empirical feedback loops. If you are an experienced agent foundations researcher who wants to mentor, please contact us; this sub-field seems particularly bottlenecked by high-quality mentorship right now.
I have amended our footnote to say:
Technical AI safety, in turn, here refers to the subset of AI safety research that takes current and future technological paradigms as its chief objects of study, rather than governance, policy, or ethics. Importantly, this does not exclude all theoretical approaches, but does in practice prefer those theoretical approaches which have a strong foundation in experimentation. Due to the dominant focus on prosaic AI safety within the current job and funding market, the main focus of this report, we believe there are few opportunities for those pursuing non-prosaic, theoretical AI safety research.
If you disagree with our assessment, please let us know! We would love to hear about more jobs or funding opportunities for non-prosaic AI safety research.
Thanks. Well, now the footnote seems better, but now it contradicts the title. The footnote says that “the main focus of this report” is “the current job and funding market”. This is conflating “the current job and funding market” with “technical AI safety”, given that the title is “Talent Needs in Technical AI Safety”.
Note: I don’t mean to single out you (Ryan) or MATS or this post; I greatly appreciate your work and think it’s good, and don’t think you’re doing something worse than others are doing with regard to framing the field for newcomers. What I’m trying to do here is fight a (rearguard, unfortunately) action against the sweep of [most of the resource allocation around here] conflating [what the people currently working on stuff called “AI safety/alignment” say they could use help with] with [what is needed in order to figure out AGI alignment].
One response to what I’m saying is: Yes, the people in the field will of course make lots of mistakes, but they’re still at the forefront, and so the aggregate of their guesses about what new talent should do represent our best guess.
My counterresponse: No, that doesn’t follow. There’s a separate parameter of “how much do we (the people in the field) actually know about how to turn effort into progress, as opposed to not knowing and therefore needing help in the form of new talent that tries new approaches to turning effort into progress”. At least as of a couple years ago, my sense was that nearly all experts working on AI safety/alignment would agree that 1. their plans for alignment won’t work, and 2. alignment is preparadigmatic. (I’m not confident that they would have said so, or would say so now.)
Depending on the value of that parameter, conflating “the current job and funding market” with “technical AI safety” makes more or less sense. Further, to the extent that people inappropriately conflate these two things, these two things become even more distinct. (Cf. Dangers of deference.)
I think there might be a simple miscommunication here: in our title and report we use “talent needs” to refer to “job and funding opportunities that could use talent.” Importantly, we generally make a descriptive, not a normative, claim about the current job and funding opportunities. We could have titled the report “Open and Impactful Job and Funding Opportunities in Technical AI Safety,” but this felt unwieldy. Detailing what job and funding opportunities should exist in the technical AI safety field is beyond the scope of this report.
Ok I think you’re right. I didn’t know (at least, not well enough) that “talent needs” quasi-idiomatically means “sorts of people that an organization wants to hire”, and interpreted it to mean literally “needs (by anyone / the world) for skills / knowledge”.
I don’t buy the unwieldiness excuse; you could say “Hiring needs in on-paradigm technical AI safety”, for example. But me criticizing minutae of the framing in this post doesn’t seem helpful. The main thing I want to communicate is that
the main direct help we can give to AGI alignment would go via novel ideas that would be considered off-paradigm; and therefore
high-caliber newcomers to the field should be strongly encouraged to try to do that; and
there’s strong emergent effects in the resource allocation (money, narrative attention, collaboration) of the field that strongly discourage newcomers from doing so and/or don’t attract newcomers who would do so.
Yes, there is more than unwieldiness at play here. If we retitled the post “Hiring needs in on-paradigm technical AI safety,” (which does seem unwieldy and introduces an unneeded concept, IMO) this seems like it would work at cross purposes to our (now explicit) claim, “there are few opportunities for those pursuing non-prosaic, theoretical AI safety research.” I think it benefits no-one to make false or misleading claims about the current job market for non-prosaic, theoretical AI safety research (not that I think you are doing this; I just want our report to be clear). If anyone doesn’t like this fact about the world, I encourage them to do something about it! (E.g., found organizations, support mentees, publish concrete agendas, petition funders to change priorities.)
As indicated by MATS’ portfolio over research agendas, our revealed preferences largely disagree with point 1 (we definitely want to continue supporting novel ideas too, constraints permitting, but we aren’t Refine). Among other objectives, this report aims to show a flaw in the plan for point 2: high-caliber newcomers have few mentorship, job, or funding opportunities to mature as non-prosaic, theoretical technical AI safety researchers and the lead time for impactful Connectors is long. We welcome discussion on how to improve paths-to-impact for the many aspiring Connectors and theoretical AI safety researchers.
I agree with Tsvi here (as I’m sure will shock you :)).
I’d make a few points:
“our revealed preferences largely disagree with point 1”—this isn’t clear at all. We know MATS’ [preferences, given the incentives and constraints under which MATS operates]. We don’t know what you’d do absent such incentives and constraints.
I note also that “but we aren’t Refine” has the form [but we’re not doing x], rather than [but we have good reasons not to do x]. (I don’t think MATS should be Refine, but “we’re not currently 20% Refine-on-ramp” is no argument that it wouldn’t be a good idea)
MATS is in a stronger position than most to exert influence on the funding landscape. Sure, others should make this case too, but MATS should be actively making a case for what seems most important (to you, that is), not only catering to the current market.
Granted, this is complicated by MATS’ own funding constraints—you have more to lose too (and I do think this is a serious factor, undesirable as it might be).
If you believe that the current direction of the field isn’t great, then “ensure that our program continues to meet the talent needs of safety teams” is simply the wrong goal.
Of course the right goal isn’t diametrically opposed to that—but still, not that.
There’s little reason to expect the current direction of the field to be close to ideal:
At best, the accuracy of the field’s collective direction will tend to correspond to its collective understanding—which is low.
There are huge commercial incentives exerting influence.
There’s no clarity on what constitutes (progress towards) genuine impact.
There are many incentives to work on what’s already not neglected (e.g. things with easily located “tight empirical feedback loops”). The desirable properties of the non-neglected directions are a large part of the reason they’re not neglected.
Similar arguments apply to [field-level self-correction mechanisms].
Given (4), there’s an inherent sampling bias in taking [needs of current field] as [what MATS should provide]. Of course there’s still an efficiency upside in catering to [needs of current field] to a large extent—but efficiently heading in a poor direction still sucks.
I think it’s instructive to consider extreme-field-composition thought experiments: suppose the field were composed of [10,000 researchers doing mech interp] [10 researchers doing agent foundations].
Where would there be most jobs? Most funding? Most concrete ideas for further work? Does it follow that MATS would focus almost entirely on meeting the needs of all the mech interp orgs? (I expect that almost all the researchers in that scenario would claim mech interp is the most promising direction)
If you think that feedback loops along the lines of [[fast legible work on x] --> [x seems productive] --> [more people fund and work on x]] lead to desirable field dynamics in an AIS context, then it may make sense to cater to the current market. (personally, I expect this to give a systematically poor signal, but it’s not as though it’s easy to find good signals)
If you don’t expect such dynamics to end well, it’s worth considering to what extent MATS can be a field-level self-correction mechanism, rather than a contributor to predictably undesirable dynamics.
I’m not claiming this is easy!!
I’m claiming that it should be tried.
Detailing what job and funding opportunities should exist in the technical AI safety field is beyond the scope of this report.
Understandable, but do you know anyone who’s considering this? As the core of their job, I mean—not on a [something they occasionally think/talk about for a couple of hours] level. It’s non-obvious to me that anyone at OpenPhil has time for this.
It seems to me that the collective ‘decision’ we’ve made here is something like:
Any person/team doing this job would need:
Extremely good AIS understanding.
To be broadly respected.
Have a lot of time.
Nobody like this exists.
We’ll just hope things work out okay using a passive distributed approach.
To my eye this leads to a load of narrow optimization according to often-not-particularly-enlightened metrics—lots of common incentives, common metrics, and correlated failure.
Oh and I still think MATS is great :) - and that most of these issues are only solvable with appropriate downstream funding landscape alterations. That said, I remain hopeful that MATS can nudge things in a helpful direction.
I plan to respond regarding MATS’ future priorities when I’m able (I can’t speak on behalf of MATS alone here and we are currently examining priorities in the lead up to our Winter 2024-25 Program), but in the meantime I’ve added some requests for proposals to my Manifund Regrantor profile.
RFPs seem a good tool here for sure. Other coordination mechanisms too. (And perhaps RFPs for RFPs, where sketching out high-level desiderata is easier than specifying parameters for [type of concrete project you’d like to see])
Oh and I think the MATS Winter Retrospective seems great from the [measure a whole load of stuff] perspective. I think it’s non-obvious what conclusions to draw, but more data is a good starting point. It’s on my to-do-list to read it carefully and share some thoughts.
Ok I want to just lay out what I’m trying to do here, and why, because it could be based on false assumptions.
A main assumption I’m making, which totally could be false, is that your paragraph
Funders of independent researchers we’ve interviewed think that there are plenty of talented applicants, but would prefer more research proposals focused on relatively few existing promising research directions (e.g., Open Phil RFPs, MATS mentors’ agendas), rather than a profusion of speculative new agendas.
is generally representative of the entire landscape, with a few small-ish exceptions. In other words, I’m assuming that it’s pretty difficult for a young smart person to show up and say “hey, I want to spend 3 whole years thinking about this problem de novo, can I have one year’s salary and a reevaluation after 1 year for a renewal”.
A main assumption that motivates what I’m doing here, and that could be false, is:
Funders make decisions mostly by some combination of recommendations from people they trust. The trust might be personal, or might be based on accomplishments, or might be based on some arguments made by the trusted person to the funder—and, centrally, the trust is actually derived from a loose diffuse array of impressions coming from the community, broadly.
To make the assumption slightly more clear: The assumption says that it’s actually quite common, maybe even the single dominant way funders make decisions, for the causality of a decision to flow through literally thousands of little interactions, where the little interactions communicate “I think XYZ is Important/Unimportant”. And these aggregate up into a general sense of importance/unimportance, or something. And then funding decisions work with two filters:
The explicit reasoning about the details—is this person qualified, how much funding, what’s the feedback, who endorses it, etc etc.
The implicit filter of Un/Importance. This doesn’t get raised to attention usually. It’s just in the background.
And “fund a smart motivated youngster without a plan for 3 years with little evaluation” is “unimportant”. And this unimportance is implicitly but strongly reinforced by everyone talking about in-paradigm stuff. And the situation is self-reinforcing because youngsters mostly don’t try to do the thing, because there’s no narrative and no funding, and so it is actually true that there aren’t many smart motivated youngsters just waiting for some funding to do trailblazing.
If my assumptions are true, then IDK what to do about this but would say that at least
people should be aware of this situation, and
people should keep talking about this situation, especially in contexts where they are contributing to the loose diffuse array of impressions by contributing to framing about what AGI alignment needs.
An interesting note: I don’t necessarily want to start a debate about the merits of academia, but “fund a smart motivated youngster without a plan for 3 years with little evaluation” sounds a lot like “fund more exploratory AI safety PhDs” to me. If anyone wants to do an AI safety PhD (e.g., with these supervisors) and needs funding, I’m happy to evaluate these with my Manifund Regrantor hat on.
That would only work for people with the capacity to not give a fuck what anyone around them thinks, especially including the person funding and advising them. And that’s arguably unethical depending on context.
You’ll also have an unusual degree of autonomy: You’re basically guaranteed funding and a moderately supportive environment for 3-5 years, and if you have a hands-off advisor you can work on pretty much any research topic. This is enough time to try two or more ambitious and risky agendas.
Ex ante funding guarantees, like The Vitalik Buterin PhD Fellowship in AI Existential Safety or Manifund or other funders, mitigate my concerns around overly steering exploratory research. Also, if one is worried about culture/priority drift, there are several AI safety offices in Berkeley, Boston, London, etc. where one could complete their PhD while surrounded by AI safety professionals (which I believe was one of the main benefits of the late Lightcone office).
Moreover, the program guarantees at least some mentorship from your supervisor. Your advisor’s incentives are reasonably aligned with yours: they get judged by your success in general, so want to see you publish well-recognized first-author research, land a top research job after graduation and generally make a name for yourself (and by extension, them).
Doing a PhD also pushes you to learn how to communicate with the broader ML research community. The “publish or perish″ imperative means you’ll get good at writing conference papers and defending your work.
These would be exactly the “anyone around them” about whose opinion they would have to not give a fuck.
I don’t know a good way to do this, but maybe a pointer would be: funders should explicitly state something to the effect of:
“The purpose of this PhD funding is to find new approaches to core problems in AGI alignment. Success in this goal can’t be judged by an existing academic structure (journals, conferences, peer-review, professors) because there does not exist such a structure aimed at the core problems in AGI alignment. You may if you wish make it a major goal of yours to produce output that is well-received by some group in academia, but be aware that this goal would be non-overlapping with the purpose of this PhD funding.”
The Vitalik fellowship says:
To be eligible, applicants should either be graduate students or be applying to PhD programs. Funding is conditional on being accepted to a PhD program, working on AI existential safety research, and having an advisor who can confirm to us that they will support the student’s work on AI existential safety research.
Despite being an extremely reasonable (even necessary) requirement, this is already a major problem according to me. The problem is that (IIUC—not sure) academics are incentivized to, basically, be dishonest, if it gets them funding for projects / students. Of the ~dozen professors here (https://futureoflife.org/about-us/our-people/ai-existential-safety-community/) who I’m at least a tiny bit familiar with, I think maybe 1.5ish are actually going to happily support actually-exploratory PhD students. I could be wrong about this though—curious for more data either way. And how many will successfully communicate to the sort of person who would take a real shot at exploratory conceptual research if given the opportunity to do such research that they would in fact support that? I don’t know. Zero? One? And how would someone sent to the FLI page know of the existence of that professor?
Fellows are expected to participate in annual workshops and other activities that will be organized to help them interact and network with other researchers in the field.
Continued funding is contingent on continued eligibility, demonstrated by submitting a brief (~1 page) progress report by July 1st of each year.
Again, reasonable, but… Needs more clarity on what is expected, and what is not expected.
a technical specification of the proposed research
What does this even mean? This webpage doesn’t get it. We’re trying to buy something that isn’t something someone can already write a technical specification of.
I want to sidestep critique of “more exploratory AI safety PhDs” for a moment and ask: why doesn’t MIRI sponsor high-calibre young researchers with a 1-3 year basic stipend and mentorship? And why did MIRI let Vivek’s team go?
I don’t speak for MIRI, but broadly I think MIRI thinks that roughly no existing research is hopeworthy, and that this isn’t likely to change soon. I think that, anyway.
In discussions like this one, I’m conditioning on something like “it’s worth it, these days, to directly try to solve AGI alignment”. That seems assumed in the post, seems assumed in lots of these discussions, seems assumed by lots of funders, and it’s why above I wrote “the main direct help we can give to AGI alignment” rather than something stronger like “the main help (simpliciter) we can give to AGI alignment” or “the main way we can decrease X-risk”.
I’m reading this as you saying something like “I’m trying to build a practical org that successfully onramps people into doing useful work. I can’t actually do that for arbitrary domains that people aren’t providing funding for. I’m trying to solve one particular part of the problem and that’s hard enough as it is.”
Yes to all this, but also I’ll go one level deeper. Even if I had tons more Manifund money to give out (and assuming all the talent needs discussed in the report are saturated with funding), it’s not immediately clear to me that “giving 1-3 year stipends to high-calibre young researchers, no questions asked” is the right play if they don’t have adequate mentorship, the ability to generate useful feedback loops, researcher support systems, access to frontier models if necessary, etc.
A few points here (all with respect to a target of “find new approaches to core problems in AGI alignment”):
It’s not clear to me what the upside of the PhD structure is supposed to be here (beyond respectability). If the aim is to avoid being influenced by most of the incentives and environment, that’s more easily achieved by not doing a PhD. (to the extent that development of research ‘taste’/skill acts to service a publish-or-perish constraint, that’s likely to be harmful)
This is not to say that there’s nothing useful about an academic context—only that the sensible approach seems to be [create environments with some of the same upsides, but fewer downsides].
I can see a more persuasive upside where the PhD environment gives:
Access to deep expertise in some relevant field.
The freedom to explore openly (without any “publish or perish” constraint).
This seems likely to be both rare, and more likely for professors not doing ML. I note here that ML professors are currently not solving fundamental alignment problems—we’re not in a [Newtonian physics looking for Einstein] situation; more [Aristotelian physics looking for Einstein]. I can more easily imagine a mathematics PhD environment being useful than an ML one (though I’d expect this to be rare too).
This is also not to say that a PhD environment might not be useful in various other ways. For example, I think David Krueger’s lab has done and is doing a bunch of useful stuff—but it’s highly unlikely to uncover new approaches to core problems.
For example, of the 213 concrete problems posed here how many would lead us to think [it’s plausible that a good answer to this question leads to meaningful progress on core AGI alignment problems]? 5? 10? (many more can be a bit helpful for short-term safety)
There are a few where sufficiently general answers would be useful, but I don’t expect such generality—both since it’s hard, and because incentives constantly push towards [publish something on this local pattern], rather than [don’t waste time running and writing up experiments on this local pattern, but instead investigate underlying structure].
I note that David’s probably at the top of my list for [would be a good supervisor for this kind of thing, conditional on having agreed the exploratory aims at the outset], but the environment still seems likely to be not-close-to-optimal, since you’d be surrounded by people not doing such exploratory work.
I broadly agree with this. (And David was like .7 out of the 1.5 profs on the list who I guessed might genuinely want to grant the needed freedom.)
I do think that people might do good related work in math (specifically, probability/information theory, logic, etc.--stuff about formalized reasoning), philosophy (of mind), and possibly in other places such as theoretical linguistics. But this would require that the academic context is conducive to good novel work in the field, which lower bar is probably far from universally met; and would require the researcher to have good taste. And this is “related” in the sense of “might write a paper which leads to another paper which would be cited by [the alignment textbook from the future] for proofs/analogies/evidence about minds”.
Have you looked through the FLI faculty listed there? How many seem useful supervisors for this kind of thing? Why?
If we’re sticking to the [generate new approaches to core problems] aim, I can see three or four I’d be happy to recommend, conditional on their agreeing upfront to the exploratory goals, and that publication would not be necessary (or a very low concrete number agreed upon).
There are about ten more that seem not-obviously-a-terrible-idea, but probably not great (e.g. those who I expect have a decent understanding of the core problems, but basically aren’t working on them).
The majority don’t write anything that suggests they know what the core problems are.
For almost all of these supervisors, doing a PhD would seem to provide quite a few constraints, undesirable incentives, and an environment that’s poor. From an individual’s point of view this can still make sense, if it’s one of the only ways to get stable medium-term funding. From a funder’s point of view, it seems nuts. (again, less nuts if the goal were [incremental progress on prosaic approaches, and generation of a respectable publication record])
Yeah that looks good, except that it takes an order of magnitude longer to get going on conceptual alignment directions. I’ll message Adam to hear what happened with that.
For reference there’s this: What I learned running Refine When I talked to Adam about this (over 12 months ago), he didn’t think there was much to say beyond what’s in that post. Perhaps he’s updated since.
My sense is that I view it as more of a success than Adam does. In particular, I think it’s a bit harsh to solely apply the [genuinely new directions discovered] metric. Even when doing everything right, I expect the hit rate to be very low there, with [variation on current framing/approach] being the most common type of success.
Agreed that Refine’s timescale is clearly too short. However, a much longer program would set a high bar for whoever’s running it. Personally, I’d only be comfortable doing so if the setup were flexible enough that it didn’t seem likely to limit the potential of participants (by being less productive-in-the-sense-desired than counterfactual environments).
In particular, I think it’s a bit harsh to solely apply the [genuinely new directions discovered] metric. Even when doing everything right, I expect the hit rate to be very low there, with [variation on current framing/approach] being the most common type of success.
Mhm. In fact I’d want to apply a bar that’s even lower, or at least different: [the extent to which the participants (as judged by more established alignment thinkers) seem to be well on the way to developing new promising directions—e.g. being relentlessly resourceful including at the meta-level; having both appropriate Babble and appropriate Prune; not shying away from the hard parts].
the setup were flexible enough that it didn’t seem likely to limit the potential of participants (by being less productive-in-the-sense-desired than counterfactual environments).
Agree that this is an issue, but I think it can be addressed—certainly at least well enough that there’d be worthwhile value-of-info in running such a thing.
I’d be happy to contribute a bit of effort, if someone else is taking the lead. I think most of my efforts will be directed elsewhere, but for example I’d be happy to think through what such a program should look like; help write justificatory parts of grant applications; and maybe mentor / similar.
I think there might be a simple miscommunication here: in our title and report we use “talent needs” to refer to “job and funding opportunities that could use talent.” Importantly, we generally make a descriptive, not a normative, claim about the current job and funding opportunities.
I think the title of this post is actively misleading if that’s what you’re trying to convey. “Defining” a term to mean something specific thing, which does not match how lots of readers will interpret it (especially in the title!), will in general make your writing not communicate what your “definition” claims to be trying to communicate.
If the post is about job openings and grant opportunities, then it should say that at the top, rather than “talent needs”.
I can understand if some people are confused by the title, but we do say “the talent needs of safety teams” in the first sentence. Granted, this doesn’t explicitly reference “funding opportunities” too, but it does make it clear that it is the (unfulfilled) needs of existent safety teams that we are principally referring to.
I appreciate this clarification, but I think it’s not enough. As the most defensible counterexample, theoretical math is quintessentially technical, whether or not it relates to (non-mental) experimentation. A less defensible but more important counterexample is (careful, speculative, motivated, core) philosophy. An alternative name for what you mean here could be “prosaic”. See e.g. https://www.lesswrong.com/posts/YTq4X6inEudiHkHDF/prosaic-ai-alignment :
If “prosaic” sounds derogatory, another alternative would be “in-/on-paradigm”.
All young people and other newcomers should be made aware that on-paradigm AI safety/alignment—while being more tractable, feedbacked, well-resourced, and populated compared to theory—is also inevitably streetlighting https://en.wikipedia.org/wiki/Streetlight_effect.
Half-agree. I think there’s scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas (eg: ontology identification). I do think there is a lot of diversity in people working in these more legible areas and that means there are now many people who haven’t engaged with or understood the alignment problem well enough to realise where we might be suffering from the street light effect.
Object level: ontology identification, in the sense that is studied empirically, is pretty useless. It streetlights on recognizable things, and AFAIK isn’t trying to avoid, for example, the Doppelgänger problem or to at all handle diasystemic novelty or the ex quo of a mind’s creativity. [ETA: actually ELK I think addresses the Doggelgänger problem in its problem statement, if not in any proposed solutions.]
Meta:
You hedged your statement so much that it became true and also not very relevant. Here are the hedges:
“scope”: some research could be interpreted as trying to get to some other research, or as having a mission statement that includes some other research
“within field[s]”: some people / some research—or maybe no actual people or reseach, but possible research that would fit with the genre of the field
“closer to”: but maybe not close to, in an absolute sense
“or at least touch on”: if an academic philosopher wrote this about their work, you’d immediately recognize it as cope
“alignment agendas”: there aren’t any alignment agendas. There are alignment agendas in the sense that “we can start a colony around Proxima Centauri in the following way: 1. make a go-really-fast-er. 2. use the go-really-fast-er to go really fast towards Proxima Centauri” is an agenda to get to Proxima Centauri. If you make no mention of the part where you have to also slow down, and the part about steering, and the part where you have to shield from cosmic rays, and make a self-sustaining habitat on the ship, and the part about are any of the planets around Proxima Centauri remotely habitable… is this really an agenda?
I haven’t seen anyone do such interpretability research yet but I see no particular reason to think this is the sort of thing that can’t be studied empirically rather than the sort of thing that hasn’t been studied empirically. We have, for example, vision transformers and language transformers. I would be very surprised if there was a pure 1:1 mapping between the learned features in those two types of transformer models.
Well, empirically, when people try to study it empirically, instead they do something else. Surely that’s empirical evidence that it can’t be studied empirically? (I’m a little bit trolling but also not.)
I’d say mechanistic interpretability is trending toward a field which cares & researches the problems you mention. For example, the doppelganger problem is a fairly standard criticism of the sparse autoencoder work, diasystemic novelty seems the kind of thing you’d encounter when doing developmental interpretability, interp-through-time, or inductive biases research, especially with a focus on phase changes (a growing focus area), and though I’m having a hard time parsing your creativity post (an indictment of me, not of you, as I didn’t spend too long with it), it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn’t focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.
ETA: An argument could be that though these problems will come up, ultimately the field will prioritize hacky fixes in order to deal with them, which only sweep the problems under the rug. I think many in MI will prioritize such limited fixes, but also that some won’t, and due to the benefits of such problems becoming empirical, such people will be able to prove the value of their theoretical work & methodology by convincing MI people with their practical applications, and money will get diverted to such theoretical work & methodology by DL-theory-traumatized grantmakers.
And what’s the response to the criticism, or a/the hoped approach?
Yeah, this makes sense. And hey, maybe it will lead to good stuff. Any results so far, that I might consider approaching some core alignment difficulties?
Also makes some sense (though the ex quo, insofar as we even want to attribute this to current systems is distributed across the training algorithms and the architecture sources, as well as inference-time stuff).
Generally what you’re bringing up sounds like “yes these are problems and MI would like to think about them… later”. Which is understandable, but yeah, that’s what streetlighting looks like.
Maybe an implicit justification of current work is like:
This makes a lot of sense—it’s both empathizandable, and seems probably somewhat true. However:
Again, it still isn’t in fact currently addressing the hard parts. We want to keep straight the difference between [currently addressing] vs. [arguably might address in the future].
We gotta think about what sort of thing would possibly ever work. We gotta think about this now, as much as possible.
A core motivating intuition behind the MI program is (I think) “the stuff is all there, perfectly accessible programmatically, we just have to learn to read it”. This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations
I don’t know of any clear progress on your interests yet. My argument was about the trajectory MI is on, which I think is largely pointed in the right direction. We can argue about the speed at which it gets to the hard problems, whether its fast enough, and how to make it faster though. So you seem to have understood me well.
I think I’m more agnostic than you are about this, and also about how “deeply” flawed MI’s intuitions are. If you’re right, once the field progresses to nontrivial dynamics, we should expect those operating at a higher level of analysis—conceptual MI—to discover more than those operating at a lower level, right?
If, hypothetically, we were doing MI on minds, then I would predict that MI will pick some low hanging fruit and then hit walls where their methods will stop working, and it will be more difficult to develop new methods that work. The new methods that work will look more and more like reflecting on one’s own thinking, discovering new ways of understanding one’s own thinking, and then going and looking for something like that in the in-vitro mind. IDK how far that could go. But then this will completely grind to a halt when the IVM is coming up with concepts and ways of thinking that are novel to humanity. Some other approach would be needed to learn new ideas from a mind via MI.
However, another dealbreaker problem with current and current-trajectory MI is that it isn’t studying minds.
I mean my impression is that there are something on the order of 100-1000 people in the world working on ML interpretability as their day job, and maybe 1k-10k people who dabble in their free time. No research in the field will get done unless one of that small number of people makes a specific decision to tackle that particular research question instead of one of the countless other ones they could choose to tackle.
I don’t know what you’re trying to do in this thread (e.g. what question you’re trying to answer).
To be explicit, that was a response to
I don’t know that we have any empirical data on what happens when people try to study that particular empirical question (the specific relationship between the features leaned by two models of different modalities) because I don’t know that anyone has set out to study that particular question in any serious way.
In other words, I suspect it’s not “when someone starts to study this phenomenon, some mysterious process causes them to study something else instead”. I think it’s “the surface area of the field is large and there aren’t many people in it, so I doubt anyone has even gotten to the part where they start to study this phenomenon.”
Edit: to be even more explicit, what I’m trying to do in this thread is encourage thinking about ways one might collect empirical observations about non-”streetlit” topics. None of the topics are under the streetlight until someone builds the streetlight. “Build a streetlight” is sometimes an available action, but it only happens if someone makes a specific effort to do so.
Edit 2: I misunderstood what point you were making as “prosaic alignment is unlikely to be helpful, look at all of these empirical researchers who have not even answered these basic questions” (which is a perspective I disagree with pretty strongly) rather than “I think empirical research shouldn’t be the only game in town” (which I agree with) and “we should fund outsiders to go do stuff without much interaction with or feedback from the community to hopefully develop new ideas that are not contaminated with the current community biases” (I think this would he worth doing f resources we’re unlimited, not sure as things actually stand).
As a concrete note, I suspect work that demonstrates that philosophical or mathematical approaches can yield predictions about empirical questions is more likely to be funded. For example, in your post you say
Could that be operationalized as a prediction of the form
(I expect that’s not a correct operationalization but something of that shape)
Here’s the convo according to me:
Bloom:
BT:
sname:
BT:
sname:
BT:
When I said “when people try to study it empirically”, what I meant was “when people try to do interpretability research (presumably, that is relevant to the hard part of the problem?)”.
Right, I’m not saying exactly this. But I am saying:
(*): studying LLMs, which are not minds; trying to recognize [stuff we mostly conceptually understand] within systems rather than trying to come to conceptually understand [the stuff we’d need to be able to recognize/design in a mind, in order to determine the mind’s effects].
Well, you’ve agreed with a defanged version of my statements. The toothful version, which I do think: Insofar as this is even possible, we should allocate a lot more resources toward funding any high-caliber smart/creative/interesting/promising/motivated youngsters/newcomers who want to take a crack at independently approaching the core difficulties of AGI alignment, even if that means reallocating a lot of resources away from existing on-paradigm research.
This seems like a good thing to do. But there’s multiple ways that existing research is streetlit, and reality doesn’t owe it to you to make it be the case that there are nice (tractionful, feasible, interesting, empirical, familiar, non-weird-seeming, feedbacked, grounded, legible, consensusful) paths toward the important stuff. The absence of nice paths would really suck if it’s the case, and it’s hard to see how anyone could be justifiedly really confident that there are no nice paths. But yes, I’m saying that it looks like there aren’t nice paths, or at least there aren’t enough nice paths that we seem likely to find them by continuing to sample from the same distribution we’ve been sampling from; and I have some arguments and reasons supporting this belief, which seem true; and I would guess that a substantial fraction (though not most) of current alignment researchers would agree with a fairly strong version of “very few or no nice paths”.
I don’t think that’s a good operationalization, as you predict. I think it’s trying to be an operationalization related to my claim above:
But it sort of sounds like you’re trying to extract a prediction about capability generalization or something? Anyway, an interp-like study trying to handle diasystemic novelty might for example try to predict large scale explicitization events events before they happen—maybe in a way that’s robust to “drop out”. E.g. you have a mind that doesn’t explicitly understand Bayesian reasoning; but it is engaging in lots of activities that would naturally induce small-world probabilistic reasoning, e.g. gambling games or predicting-in-distribution simple physical systems; and then your interpreter’s job is to notice, maybe only given access to restricted parts (in time or space, say) of the mind’s internals, that Bayesian reasoning is (implicitly) on the rise in many places. (This is still easy mode if the interpreter gets to understand Bayesian reasoning explicitly beforehand.) I don’t necessarily recommend this sort of study, though; I favor theory.
Seems about right
Is there a particular reason you expect there to be exactly one hard part of the problem, and for the part that ends up being hardest in the end to be the part that looks hardest to us now?
If I were a prosaic alignment researcher, I probably would choose to prioritize which problems I worked on a bit differently than those currently in the field. However, I expect that the research that ends up being the most useful will not be that research which looked most promising before someone started doing it, but rather research that stemmed from someone trying something extremely simple and getting an unexpected result, and going “huh, that’s funny, I should investigate further”. I think that the process of looking at lots of things and trying to get feedback from reality as quickly as possible is promising, even if I don’t have a strong expectation that any one specific one of those things is promising to look at.
Certainly reality doesn’t owe us a path like that, but it would be pretty undignified if reality did in fact give us a path like that and we failed to find it because we didn’t even look.
Interesting. I would be pretty interested to see research along these lines (although the scope of the above is probably still a bit large for a pilot project).
What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?
Have you stopped beating your wife? I say “the” here in the sense of like “the problem of climbing that mountain over there”. If you’re far away, it makes sense to talk about “the (thing over there)”, even if, when you’re up close, there’s multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.
We make an argument like “any solution would have to address X” or “anything with feature Y does not do Z” or “property W is impossible”, and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance. It’s not like pointing to a little ball in ideaspace and being like “the answer is somewhere in here”. Rather it’s like cutting out a halfspace and saying “everything on this side of this plane is doomed, we’d have to be somewhere in the other half”, or like pointing out a manifold that all research is on and saying “anything on this manifold is doomed, we’d have to figure out how to move somewhat orthogonalward”.
I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant. (I don’t care if you have an army of people who all agree on taking a stance that seems to imply that there’s not much relevant difference between LLMs and future AGI systems that might kill everyone.)
I think you (and everyone else) don’t know how to ask this question properly. For example, “on whether your theory describes the world as it is” is a too-narrow idea of what our thoughts about minds are supposed to be. Sub-example: our thoughts about mind are supposed to also produce design ideas.
To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans, and the best access you have to minds is introspection. (I don’t mean meditation, I mean thinking and also thinking about thinking/wanting/acting—aka some kinds of philosophy and math.)
I think the appropriate analogy is someone trying to strategize about “the hard part of climbing that mountain over there” before they have even reached base camp or seriously attempted to summit any other mountains. There are a bunch of parts that might end up being hard, and one can come up with some reasonable guesses as to what those parts might be, but the bits that look hard from a distance and the bits that end up being hard when you’re on the face of the mountain may be different parts.
Doomed to irrelevance, or doomed to not being a complete solution in and of itself? The point of a lot of research is to look at a piece of the world and figure out how it ticks. Research to figure out how a piece of the world ticks won’t usually directly allow you to make it tock instead, but can be a useful stepping stone. Concrete example: dictionary learning vs Golden Gate Claude.
I think one significant crux is “to what extent are LLMs doing the same sort of thing that human brains do / the same sorts of things that future, more powerful AIs will do?” It sounds like you think the answer is “they’re completely different and you won’t learn much about one by studying the other”. Is that an accurate characterization?
Agreed, though with quibbles
In my experience, my brain is a dirty lying liar that lies to me at every opportunity—another crux might be how faithful one expects their memory of their thought processes to be to the actual reality of those thought processes.
Doomed to not be trying to go to and then climb the mountain.
So then it isn’t easy. But it’s feedback. Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
If you think that current mech interp work is currently trying to directly climb the mountain, rather than trying to build and test a set of techniques that might be helpful on a summit attempt, I can see why you’d be frustrated and discouraged at the lack of progress.
> Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
I don’t have much hope in the former being feasible, though I do support having a nonzero number of people try it because sometimes things I don’t think are feasible end up working.
I mean if we’re going with memes I could equally say
though realistically I think the most common problem in this kind of discussion is
Look… Consider the hypothetically possible situation that in fact everyone is very far from being on the right track, and everything everyone is doing doesn’t help with the right track and isn’t on track to get on the right track or to help with the right track.
Ok, so I’m telling you that this hypothetically possible situation seems to me like the reality. And then you’re, I don’t know, trying to retreat to some sort of agreeable live-and-let-live stance, or something, where we all just agree that due to model uncertainty and the fact that people have vaguely plausible stories for how their thing might possibly be helpful, everyone should do their own thing and it’s not helpful to try to say that some big swath of research is doomed? If this is what’s happening, then I think that what you in particular are doing here is a bad thing to do here.
Maybe we can have a phone call if you’d like to discuss further.
I doubt it’s worth it—I’m not a major funder in this space and don’t expect to become one in the near future, and my impression is that there is no imminent danger of you shutting down research that looks promising to me and unpromising to you. As such, I think the discussion ended up getting into the weeds in a way that probably wasn’t a great use of either of our time, and I doubt spending more time on it would change that.
That said, I appreciated your clarity of thought, and in particular your restatement of how the conversation looked to you. I will probably be stealing that technique.
You are, of course, correct in your definitions of “technical” and “prosaic” AI safety. Our interview series did not exclude advocates of theoretical or non-prosaic approaches to AI safety. It was not the intent of this report to ignore talent needs in non-prosaic technical AI safety. We believe that this report summarises our best understanding of the dominant talent needs across all of technical AI safety, at least as expressed by current funders and org leaders.
MATS has supported several theoretical or non-prosaic approaches to improving AI safety, including Vanessa Kosoy’s learning theoretic agenda, Jesse Clifton’s and Caspar Oesterheldt’s cooperative AI research, Vivek Hebbar’s empirical agent foundations research, John Wentworth’s selection theorems agenda, and more. We remain supportive of well-scoped agent foundations research, particularly that with tight empirical feedback loops. If you are an experienced agent foundations researcher who wants to mentor, please contact us; this sub-field seems particularly bottlenecked by high-quality mentorship right now.
I have amended our footnote to say:
If you disagree with our assessment, please let us know! We would love to hear about more jobs or funding opportunities for non-prosaic AI safety research.
Thanks. Well, now the footnote seems better, but now it contradicts the title. The footnote says that “the main focus of this report” is “the current job and funding market”. This is conflating “the current job and funding market” with “technical AI safety”, given that the title is “Talent Needs in Technical AI Safety”.
Note: I don’t mean to single out you (Ryan) or MATS or this post; I greatly appreciate your work and think it’s good, and don’t think you’re doing something worse than others are doing with regard to framing the field for newcomers. What I’m trying to do here is fight a (rearguard, unfortunately) action against the sweep of [most of the resource allocation around here] conflating [what the people currently working on stuff called “AI safety/alignment” say they could use help with] with [what is needed in order to figure out AGI alignment].
One response to what I’m saying is: Yes, the people in the field will of course make lots of mistakes, but they’re still at the forefront, and so the aggregate of their guesses about what new talent should do represent our best guess.
My counterresponse: No, that doesn’t follow. There’s a separate parameter of “how much do we (the people in the field) actually know about how to turn effort into progress, as opposed to not knowing and therefore needing help in the form of new talent that tries new approaches to turning effort into progress”. At least as of a couple years ago, my sense was that nearly all experts working on AI safety/alignment would agree that 1. their plans for alignment won’t work, and 2. alignment is preparadigmatic. (I’m not confident that they would have said so, or would say so now.)
Depending on the value of that parameter, conflating “the current job and funding market” with “technical AI safety” makes more or less sense. Further, to the extent that people inappropriately conflate these two things, these two things become even more distinct. (Cf. Dangers of deference.)
I think there might be a simple miscommunication here: in our title and report we use “talent needs” to refer to “job and funding opportunities that could use talent.” Importantly, we generally make a descriptive, not a normative, claim about the current job and funding opportunities. We could have titled the report “Open and Impactful Job and Funding Opportunities in Technical AI Safety,” but this felt unwieldy. Detailing what job and funding opportunities should exist in the technical AI safety field is beyond the scope of this report.
Also, your feedback is definitely appreciated!
Ok I think you’re right. I didn’t know (at least, not well enough) that “talent needs” quasi-idiomatically means “sorts of people that an organization wants to hire”, and interpreted it to mean literally “needs (by anyone / the world) for skills / knowledge”.
I don’t buy the unwieldiness excuse; you could say “Hiring needs in on-paradigm technical AI safety”, for example. But me criticizing minutae of the framing in this post doesn’t seem helpful. The main thing I want to communicate is that
the main direct help we can give to AGI alignment would go via novel ideas that would be considered off-paradigm; and therefore
high-caliber newcomers to the field should be strongly encouraged to try to do that; and
there’s strong emergent effects in the resource allocation (money, narrative attention, collaboration) of the field that strongly discourage newcomers from doing so and/or don’t attract newcomers who would do so.
Yes, there is more than unwieldiness at play here. If we retitled the post “Hiring needs in on-paradigm technical AI safety,” (which does seem unwieldy and introduces an unneeded concept, IMO) this seems like it would work at cross purposes to our (now explicit) claim, “there are few opportunities for those pursuing non-prosaic, theoretical AI safety research.” I think it benefits no-one to make false or misleading claims about the current job market for non-prosaic, theoretical AI safety research (not that I think you are doing this; I just want our report to be clear). If anyone doesn’t like this fact about the world, I encourage them to do something about it! (E.g., found organizations, support mentees, publish concrete agendas, petition funders to change priorities.)
As indicated by MATS’ portfolio over research agendas, our revealed preferences largely disagree with point 1 (we definitely want to continue supporting novel ideas too, constraints permitting, but we aren’t Refine). Among other objectives, this report aims to show a flaw in the plan for point 2: high-caliber newcomers have few mentorship, job, or funding opportunities to mature as non-prosaic, theoretical technical AI safety researchers and the lead time for impactful Connectors is long. We welcome discussion on how to improve paths-to-impact for the many aspiring Connectors and theoretical AI safety researchers.
I agree with Tsvi here (as I’m sure will shock you :)).
I’d make a few points:
“our revealed preferences largely disagree with point 1”—this isn’t clear at all. We know MATS’ [preferences, given the incentives and constraints under which MATS operates]. We don’t know what you’d do absent such incentives and constraints.
I note also that “but we aren’t Refine” has the form [but we’re not doing x], rather than [but we have good reasons not to do x]. (I don’t think MATS should be Refine, but “we’re not currently 20% Refine-on-ramp” is no argument that it wouldn’t be a good idea)
MATS is in a stronger position than most to exert influence on the funding landscape. Sure, others should make this case too, but MATS should be actively making a case for what seems most important (to you, that is), not only catering to the current market.
Granted, this is complicated by MATS’ own funding constraints—you have more to lose too (and I do think this is a serious factor, undesirable as it might be).
If you believe that the current direction of the field isn’t great, then “ensure that our program continues to meet the talent needs of safety teams” is simply the wrong goal.
Of course the right goal isn’t diametrically opposed to that—but still, not that.
There’s little reason to expect the current direction of the field to be close to ideal:
At best, the accuracy of the field’s collective direction will tend to correspond to its collective understanding—which is low.
There are huge commercial incentives exerting influence.
There’s no clarity on what constitutes (progress towards) genuine impact.
There are many incentives to work on what’s already not neglected (e.g. things with easily located “tight empirical feedback loops”). The desirable properties of the non-neglected directions are a large part of the reason they’re not neglected.
Similar arguments apply to [field-level self-correction mechanisms].
Given (4), there’s an inherent sampling bias in taking [needs of current field] as [what MATS should provide]. Of course there’s still an efficiency upside in catering to [needs of current field] to a large extent—but efficiently heading in a poor direction still sucks.
I think it’s instructive to consider extreme-field-composition thought experiments: suppose the field were composed of [10,000 researchers doing mech interp] [10 researchers doing agent foundations].
Where would there be most jobs? Most funding? Most concrete ideas for further work? Does it follow that MATS would focus almost entirely on meeting the needs of all the mech interp orgs? (I expect that almost all the researchers in that scenario would claim mech interp is the most promising direction)
If you think that feedback loops along the lines of [[fast legible work on x] --> [x seems productive] --> [more people fund and work on x]] lead to desirable field dynamics in an AIS context, then it may make sense to cater to the current market. (personally, I expect this to give a systematically poor signal, but it’s not as though it’s easy to find good signals)
If you don’t expect such dynamics to end well, it’s worth considering to what extent MATS can be a field-level self-correction mechanism, rather than a contributor to predictably undesirable dynamics.
I’m not claiming this is easy!!
I’m claiming that it should be tried.
Understandable, but do you know anyone who’s considering this? As the core of their job, I mean—not on a [something they occasionally think/talk about for a couple of hours] level. It’s non-obvious to me that anyone at OpenPhil has time for this.
It seems to me that the collective ‘decision’ we’ve made here is something like:
Any person/team doing this job would need:
Extremely good AIS understanding.
To be broadly respected.
Have a lot of time.
Nobody like this exists.
We’ll just hope things work out okay using a passive distributed approach.
To my eye this leads to a load of narrow optimization according to often-not-particularly-enlightened metrics—lots of common incentives, common metrics, and correlated failure.
Oh and I still think MATS is great :) - and that most of these issues are only solvable with appropriate downstream funding landscape alterations. That said, I remain hopeful that MATS can nudge things in a helpful direction.
I plan to respond regarding MATS’ future priorities when I’m able (I can’t speak on behalf of MATS alone here and we are currently examining priorities in the lead up to our Winter 2024-25 Program), but in the meantime I’ve added some requests for proposals to my Manifund Regrantor profile.
RFPs seem a good tool here for sure. Other coordination mechanisms too.
(And perhaps RFPs for RFPs, where sketching out high-level desiderata is easier than specifying parameters for [type of concrete project you’d like to see])
Oh and I think the MATS Winter Retrospective seems great from the [measure a whole load of stuff] perspective. I think it’s non-obvious what conclusions to draw, but more data is a good starting point. It’s on my to-do-list to read it carefully and share some thoughts.
Ok I want to just lay out what I’m trying to do here, and why, because it could be based on false assumptions.
A main assumption I’m making, which totally could be false, is that your paragraph
is generally representative of the entire landscape, with a few small-ish exceptions. In other words, I’m assuming that it’s pretty difficult for a young smart person to show up and say “hey, I want to spend 3 whole years thinking about this problem de novo, can I have one year’s salary and a reevaluation after 1 year for a renewal”.
A main assumption that motivates what I’m doing here, and that could be false, is:
To make the assumption slightly more clear: The assumption says that it’s actually quite common, maybe even the single dominant way funders make decisions, for the causality of a decision to flow through literally thousands of little interactions, where the little interactions communicate “I think XYZ is Important/Unimportant”. And these aggregate up into a general sense of importance/unimportance, or something. And then funding decisions work with two filters:
The explicit reasoning about the details—is this person qualified, how much funding, what’s the feedback, who endorses it, etc etc.
The implicit filter of Un/Importance. This doesn’t get raised to attention usually. It’s just in the background.
And “fund a smart motivated youngster without a plan for 3 years with little evaluation” is “unimportant”. And this unimportance is implicitly but strongly reinforced by everyone talking about in-paradigm stuff. And the situation is self-reinforcing because youngsters mostly don’t try to do the thing, because there’s no narrative and no funding, and so it is actually true that there aren’t many smart motivated youngsters just waiting for some funding to do trailblazing.
If my assumptions are true, then IDK what to do about this but would say that at least
people should be aware of this situation, and
people should keep talking about this situation, especially in contexts where they are contributing to the loose diffuse array of impressions by contributing to framing about what AGI alignment needs.
An interesting note: I don’t necessarily want to start a debate about the merits of academia, but “fund a smart motivated youngster without a plan for 3 years with little evaluation” sounds a lot like “fund more exploratory AI safety PhDs” to me. If anyone wants to do an AI safety PhD (e.g., with these supervisors) and needs funding, I’m happy to evaluate these with my Manifund Regrantor hat on.
That would only work for people with the capacity to not give a fuck what anyone around them thinks, especially including the person funding and advising them. And that’s arguably unethical depending on context.
I like Adam’s description of an exploratory AI safety PhD:
Ex ante funding guarantees, like The Vitalik Buterin PhD Fellowship in AI Existential Safety or Manifund or other funders, mitigate my concerns around overly steering exploratory research. Also, if one is worried about culture/priority drift, there are several AI safety offices in Berkeley, Boston, London, etc. where one could complete their PhD while surrounded by AI safety professionals (which I believe was one of the main benefits of the late Lightcone office).
From the section you linked:
These would be exactly the “anyone around them” about whose opinion they would have to not give a fuck.
I don’t know a good way to do this, but maybe a pointer would be: funders should explicitly state something to the effect of:
“The purpose of this PhD funding is to find new approaches to core problems in AGI alignment. Success in this goal can’t be judged by an existing academic structure (journals, conferences, peer-review, professors) because there does not exist such a structure aimed at the core problems in AGI alignment. You may if you wish make it a major goal of yours to produce output that is well-received by some group in academia, but be aware that this goal would be non-overlapping with the purpose of this PhD funding.”
The Vitalik fellowship says:
Despite being an extremely reasonable (even necessary) requirement, this is already a major problem according to me. The problem is that (IIUC—not sure) academics are incentivized to, basically, be dishonest, if it gets them funding for projects / students. Of the ~dozen professors here (https://futureoflife.org/about-us/our-people/ai-existential-safety-community/) who I’m at least a tiny bit familiar with, I think maybe 1.5ish are actually going to happily support actually-exploratory PhD students. I could be wrong about this though—curious for more data either way. And how many will successfully communicate to the sort of person who would take a real shot at exploratory conceptual research if given the opportunity to do such research that they would in fact support that? I don’t know. Zero? One? And how would someone sent to the FLI page know of the existence of that professor?
Again, reasonable, but… Needs more clarity on what is expected, and what is not expected.
What does this even mean? This webpage doesn’t get it. We’re trying to buy something that isn’t something someone can already write a technical specification of.
I want to sidestep critique of “more exploratory AI safety PhDs” for a moment and ask: why doesn’t MIRI sponsor high-calibre young researchers with a 1-3 year basic stipend and mentorship? And why did MIRI let Vivek’s team go?
I don’t speak for MIRI, but broadly I think MIRI thinks that roughly no existing research is hopeworthy, and that this isn’t likely to change soon. I think that, anyway.
In discussions like this one, I’m conditioning on something like “it’s worth it, these days, to directly try to solve AGI alignment”. That seems assumed in the post, seems assumed in lots of these discussions, seems assumed by lots of funders, and it’s why above I wrote “the main direct help we can give to AGI alignment” rather than something stronger like “the main help (simpliciter) we can give to AGI alignment” or “the main way we can decrease X-risk”.
I’m reading this as you saying something like “I’m trying to build a practical org that successfully onramps people into doing useful work. I can’t actually do that for arbitrary domains that people aren’t providing funding for. I’m trying to solve one particular part of the problem and that’s hard enough as it is.”
Is that roughly right?
Fwiw I appreciate your Manifund regrantor Request for Proposals announcement.
I’ll probably have more thoughts later.
Yes to all this, but also I’ll go one level deeper. Even if I had tons more Manifund money to give out (and assuming all the talent needs discussed in the report are saturated with funding), it’s not immediately clear to me that “giving 1-3 year stipends to high-calibre young researchers, no questions asked” is the right play if they don’t have adequate mentorship, the ability to generate useful feedback loops, researcher support systems, access to frontier models if necessary, etc.
A few points here (all with respect to a target of “find new approaches to core problems in AGI alignment”):
It’s not clear to me what the upside of the PhD structure is supposed to be here (beyond respectability). If the aim is to avoid being influenced by most of the incentives and environment, that’s more easily achieved by not doing a PhD. (to the extent that development of research ‘taste’/skill acts to service a publish-or-perish constraint, that’s likely to be harmful)
This is not to say that there’s nothing useful about an academic context—only that the sensible approach seems to be [create environments with some of the same upsides, but fewer downsides].
I can see a more persuasive upside where the PhD environment gives:
Access to deep expertise in some relevant field.
The freedom to explore openly (without any “publish or perish” constraint).
This seems likely to be both rare, and more likely for professors not doing ML. I note here that ML professors are currently not solving fundamental alignment problems—we’re not in a [Newtonian physics looking for Einstein] situation; more [Aristotelian physics looking for Einstein]. I can more easily imagine a mathematics PhD environment being useful than an ML one (though I’d expect this to be rare too).
This is also not to say that a PhD environment might not be useful in various other ways. For example, I think David Krueger’s lab has done and is doing a bunch of useful stuff—but it’s highly unlikely to uncover new approaches to core problems.
For example, of the 213 concrete problems posed here how many would lead us to think [it’s plausible that a good answer to this question leads to meaningful progress on core AGI alignment problems]? 5? 10? (many more can be a bit helpful for short-term safety)
There are a few where sufficiently general answers would be useful, but I don’t expect such generality—both since it’s hard, and because incentives constantly push towards [publish something on this local pattern], rather than [don’t waste time running and writing up experiments on this local pattern, but instead investigate underlying structure].
I note that David’s probably at the top of my list for [would be a good supervisor for this kind of thing, conditional on having agreed the exploratory aims at the outset], but the environment still seems likely to be not-close-to-optimal, since you’d be surrounded by people not doing such exploratory work.
I do think category theory professors or similar would be reasonable advisors for certain types of MIRI research.
I broadly agree with this. (And David was like .7 out of the 1.5 profs on the list who I guessed might genuinely want to grant the needed freedom.)
I do think that people might do good related work in math (specifically, probability/information theory, logic, etc.--stuff about formalized reasoning), philosophy (of mind), and possibly in other places such as theoretical linguistics. But this would require that the academic context is conducive to good novel work in the field, which lower bar is probably far from universally met; and would require the researcher to have good taste. And this is “related” in the sense of “might write a paper which leads to another paper which would be cited by [the alignment textbook from the future] for proofs/analogies/evidence about minds”.
Have you looked through the FLI faculty listed there?
How many seem useful supervisors for this kind of thing? Why?
If we’re sticking to the [generate new approaches to core problems] aim, I can see three or four I’d be happy to recommend, conditional on their agreeing upfront to the exploratory goals, and that publication would not be necessary (or a very low concrete number agreed upon).
There are about ten more that seem not-obviously-a-terrible-idea, but probably not great (e.g. those who I expect have a decent understanding of the core problems, but basically aren’t working on them).
The majority don’t write anything that suggests they know what the core problems are.
For almost all of these supervisors, doing a PhD would seem to provide quite a few constraints, undesirable incentives, and an environment that’s poor.
From an individual’s point of view this can still make sense, if it’s one of the only ways to get stable medium-term funding.
From a funder’s point of view, it seems nuts.
(again, less nuts if the goal were [incremental progress on prosaic approaches, and generation of a respectable publication record])
As a concrete proposal, if anyone wants to reboot Refine or similar, I’d be interested to consider that while wearing my Manifund Regrantor hat.
Yeah that looks good, except that it takes an order of magnitude longer to get going on conceptual alignment directions. I’ll message Adam to hear what happened with that.
For reference there’s this: What I learned running Refine
When I talked to Adam about this (over 12 months ago), he didn’t think there was much to say beyond what’s in that post. Perhaps he’s updated since.
My sense is that I view it as more of a success than Adam does. In particular, I think it’s a bit harsh to solely apply the [genuinely new directions discovered] metric. Even when doing everything right, I expect the hit rate to be very low there, with [variation on current framing/approach] being the most common type of success.
Agreed that Refine’s timescale is clearly too short.
However, a much longer program would set a high bar for whoever’s running it.
Personally, I’d only be comfortable doing so if the setup were flexible enough that it didn’t seem likely to limit the potential of participants (by being less productive-in-the-sense-desired than counterfactual environments).
Ah thanks!
Mhm. In fact I’d want to apply a bar that’s even lower, or at least different: [the extent to which the participants (as judged by more established alignment thinkers) seem to be well on the way to developing new promising directions—e.g. being relentlessly resourceful including at the meta-level; having both appropriate Babble and appropriate Prune; not shying away from the hard parts].
Agree that this is an issue, but I think it can be addressed—certainly at least well enough that there’d be worthwhile value-of-info in running such a thing.
I’d be happy to contribute a bit of effort, if someone else is taking the lead. I think most of my efforts will be directed elsewhere, but for example I’d be happy to think through what such a program should look like; help write justificatory parts of grant applications; and maybe mentor / similar.
Report back if you get details, I’m curious.
See Joe’s sibling comment
https://www.lesswrong.com/posts/QzQQvGJYDeaDE4Cfg/talent-needs-in-technical-ai-safety#JP5LA9cNgqxgdAz8Z
I have, and I also remember seeing Adam’s original retrospective, but I always found it unsatisfying. Thanks anyway!
I think the title of this post is actively misleading if that’s what you’re trying to convey. “Defining” a term to mean something specific thing, which does not match how lots of readers will interpret it (especially in the title!), will in general make your writing not communicate what your “definition” claims to be trying to communicate.
If the post is about job openings and grant opportunities, then it should say that at the top, rather than “talent needs”.
I can understand if some people are confused by the title, but we do say “the talent needs of safety teams” in the first sentence. Granted, this doesn’t explicitly reference “funding opportunities” too, but it does make it clear that it is the (unfulfilled) needs of existent safety teams that we are principally referring to.
We changed the title. I don’t think keeping the previous title was aiding understanding at this point.