Thanks for the latter point, glad you got that impression!
These are super valid concerns, and it’s true that there’s lots of information we won’t have for a while. That said, we also have positive evidence from the first iteration of ARENA (which is about a year old now). There were only 5 full-time participants, and they’ve all gone on to do stuff I’m excited about, including the following (note that obviously some of these 5 have done more than one of the stuff on this list):
internships at CHAI,
working with Owain Evans (including some recent papers),
building a community around open-source interpretability tooling,
employment by EleutherAI for interp,
participating in SERI MATS streams,
work trialling at LEAP labs,
being funded to work on independent research on ELK.
I’d also point to programs like MLAB which have similar goals and (as far as I’m aware) an even higher success rate of getting people into alignment work. Not saying that nobody from these programs goes on to do capabilities (I imagine at least a few do), but I’d be very surprised if this outweighs the positive effect from people going on to do alignment work.
One last point here—a big part of the benefit from programs like MLAB / ARENA is the connections made with people in alignment, feeling a sense of motivation & community, not just the skilling up (anecdotally, I quit my job and started working in alignment full-time after doing MLAB2, despite not then being at a point where I could apply for full-time jobs). I also get this impression from conversations w/ people who participated in & ran MLABs in the past. It’s not as simple as “go into upskilling programs, become super competent at either alignment or capabilities work, then choose one or the other”—there’s a myriad of factors which I expect to update people towards work in alignment after they go through programs like these.
If check-ins with ARENA 1.0 or 2.0 participants (or indeed MLAB participants) a year more from now reveal that a nontrivial fraction of them are working in capabilities then I’d certainly update my position here, but I’ll preregister that this doesn’t seem at all likely to me. It’s true that alignment can be a messy field with limited opportunities and clear paths, but this is becoming less of a problem as the years go on.
Thanks for the latter point, glad you got that impression!
These are super valid concerns, and it’s true that there’s lots of information we won’t have for a while. That said, we also have positive evidence from the first iteration of ARENA (which is about a year old now). There were only 5 full-time participants, and they’ve all gone on to do stuff I’m excited about, including the following (note that obviously some of these 5 have done more than one of the stuff on this list):
internships at CHAI,
working with Owain Evans (including some recent papers),
building a community around open-source interpretability tooling,
employment by EleutherAI for interp,
participating in SERI MATS streams,
work trialling at LEAP labs,
being funded to work on independent research on ELK.
I’d also point to programs like MLAB which have similar goals and (as far as I’m aware) an even higher success rate of getting people into alignment work. Not saying that nobody from these programs goes on to do capabilities (I imagine at least a few do), but I’d be very surprised if this outweighs the positive effect from people going on to do alignment work.
One last point here—a big part of the benefit from programs like MLAB / ARENA is the connections made with people in alignment, feeling a sense of motivation & community, not just the skilling up (anecdotally, I quit my job and started working in alignment full-time after doing MLAB2, despite not then being at a point where I could apply for full-time jobs). I also get this impression from conversations w/ people who participated in & ran MLABs in the past. It’s not as simple as “go into upskilling programs, become super competent at either alignment or capabilities work, then choose one or the other”—there’s a myriad of factors which I expect to update people towards work in alignment after they go through programs like these.
If check-ins with ARENA 1.0 or 2.0 participants (or indeed MLAB participants) a year more from now reveal that a nontrivial fraction of them are working in capabilities then I’d certainly update my position here, but I’ll preregister that this doesn’t seem at all likely to me. It’s true that alignment can be a messy field with limited opportunities and clear paths, but this is becoming less of a problem as the years go on.