Reflections on my first year of AI safety research
Last year, I wrote a post about my upskilling in AI alignment. To this day, I still get people occasionally reaching out to me because of this article, to ask questions about getting into the field themselves. I’ve also had several occasions to link people to the article who asked me about getting into the field from other means, like my local AI Safety group.
Essentially, what this means is that people clearly found this useful (credit to the EA Forum for managing to let the article be findable to those who need it, a year after its publication!) and therefore people would likely find a sequel useful too! This post is that sequel, but reading the first post is not necessary to read this one.
The major lesson of this post is this: I made a ton of mistakes, but those mistakes taught me things. By being open to that feedback and keeping my eye on the ball, I managed to find work that suited me in the field eventually. Just like the previous post, I’m happy to answer more questions via PM or in the comments.
It’s worth noting, this isn’t a bold story of me getting a ton of stuff done. Most of the story, by word count, is me flailing around unsure of what to do and making a lot of mistakes along the way. I don’t think you’ll learn a lot about how to be a good researcher from this post, but I hope you might learn some tips to avoid being a bad one.
Summary
I was a software engineer for 3-4 years with little to no ML experience before I was accepted for my initial upskilling grant. (More details are in my initial post)
I attended SERI MATS, working on aligning language models under Owain Evans. Due to a combination of factors, some my fault and some not, I don’t feel like I got a great deal of stuff done.
I decided to pivot away from evals towards mechanistic interpretability since I didn’t see a good theory of change for evals—this was two weeks before GPT-4 came out and the whole world sat up and took notice. Doh!
After upskilling in mechanistic interpretability, I struggled quite a bit with the research. I eventually concluded that it wasn’t for me, but was already funded to work on it. Fortunately I had a collaborator, and eventually I wound up using my engineering skills to accelerate his research instead of trying to contribute to the analysis directly.
After noticing my theory of change for evals had changed now that governments and labs were committing to red-teaming, I applied for some jobs in the space. I received an offer to work in the UK’s task force, which I accepted.
List of Lessons
It’s important to keep in mind two things—your theory of change for how your work helps reduce existential risk, and your comparative advantage in the field. These two things determined what I should work on, and keeping them updated was crucial for me finding a good path in the end.
Poor productivity is more likely to be situational than you might think, especially if you’re finding yourself having unusual difficulty compared to past projects or jobs. It’s worth considering how your situation might be tweaked before blaming yourself.
Trying out different subfields is useful, but don’t be afraid to admit when one isn’t working out as well as you’d like. See the first lesson.
If you’re going to go to a program like SERI MATS, do so because you have a good idea of what you want, not just because it’s the thing to do or it seems generically helpful. I’m not saying you can’t do such a program for that reason, but it is worth thinking twice about it.
It is entirely possible to make mistakes, even several of them, and still wind up finding work in the field. There is no proper roadmap, everyone needs to figure things out as they go. While it’s worth having a point at which you’re willing to give up, the first obstacle or failure probably isn’t the right point for that.
SERI MATS
SERI MATS was a great place to be, but I definitely feel like I miscalculated in several ways here. Let’s start with the good. SERI MATS had a great deal of useful events to help me learn more and get to know people in alignment. (In fact, my next project would not have happened without meeting my collaborator in the Lightcone offices, which SERI MATS encouraged people to do) I’d definitely recommend it to people who have a clear idea what they’d like to do in alignment and are ready to move to the next level.
Unfortunately, I was not one of those people. My goal was to “get into SERIMATS”, when my goal should have been something like “Do good research in this particular subfield”. Getting into SERIMATS could be a solid step towards that goal. Because my goal was vaguely just getting into SERI MATS, I applied for a wide variety of mentors, applying to those who I expected might take me and whose work seemed at least somewhat relevant to my skills and/or interests. I don’t actually think this is the best approach, and would recommend a more focused plan from people in most cases.
I ended up being accepted by Owain Evans and Beth Barnes. At the time, my reasoning was like this.
Beth is working on evaluations.
The theory of change behind evaluations to me seems poor. You could make a scary demo of a system doing something bad, but who will pay attention? Governments don’t seem to care, and I’m not sure labs will care enough to actually act. Without that, any number of scary demos won’t help.
Owain is working on language model alignment. This doesn’t seem to be focused on the hard parts of the alignment problem, so I don’t know if this will help me solve the alignment problem, but it could be useful in worlds where LLM’s scale up towards AGI.
Thus, I’ll go with Owain.
I didn’t know at the time that ARC Evals (now METR, where Beth works) was working on GPT-4 evaluations, and I definitely didn’t realise how GPT-4 would shift the Overton window. I can forgive myself the second, but if I’d done some more research (i.e, reached out to Beth and Owain to discuss things) I might have gotten some information that would have changed my mind about the first. If I had known that at least one lab was taking evaluations very seriously, that could have changed my calculus considerably.
At SERI MATS, I feel like I had conflicting ideas in my head. I wanted to “solve the alignment problem”. This was a framework that seemed encouraged by people around me over there—SERI MATS is full of very ambitious people after all, and I like that there are people trying to solve the entire problem, including what they see as the hard bits. However, what I hadn’t realized yet (or rather, I had realised it, but didn’t want to seriously grapple with the implications) was that the ambitious worst-case alignment theorising was simply not my comparative advantage. I was spending a lot of time trying to improve my mathematics skills, for instance. I’m a software engineer, not a mathematician, and I’m never going to reach the level of the people who have pure maths degrees and competed internationally in the Maths Olympiad. If it turns out that those are the only skills that matter for solving alignment, I’m still not the right person to do them. I should be focused on my strengths, as long as I can see my strengths having a realistic path to helping at all.
It also took us a while to get projects assigned. Owain was a lot more hands-on than I expected, which in the fullness of time I recognise as a good thing. He had a series of projects he wanted done, and we would set ourselves up into teams to work on them under his supervision, but they unfortunately weren’t quite shovel-ready when we arrived. During the time we were waiting on projects, another scholar and I had an idea around testing whether chain-of-thought prompting improved GPT-3’s ability to be deceptive. We found that it did, and quickly wrote it up in the course of 1-2 days. I found this to be an extremely exciting thing, a very fast feedback loop that created something interesting very quickly. I imagine this must be what it’s like to work at a startup, which I haven’t actually done. Owain asked us not to publish this for fear of capabilities improvements. (It’s now safe to share this idea a year later, because Apollo Research created an improved version to demo at the AI Safety Summit.)
While I disagreed with the danger here, Owain was our mentor and I agreed to this decision. This did have the unfortunate side-effect of curbing my agency at SERI MATS a bit. In Owain’s defense, I should have been more resilient here—ideas don’t always work out, and if this idea simply hadn’t improved the model’s deceptive ability, the result would have been the same.
Thus, due to a combination of not having a project early, not wanting to push things independently, and spending some of my time doing counterproductive things (trying to improve my weakness in math, rather than shore up my strengths in engineering) I didn’t get a lot done in SERI MATS. Because of my earlier decision to apply for whomever I could, I also wasn’t super excited about the work itself. During my time there I became interested in mechanistic interpretability. It was the only agenda that seemed to be both tractable for me to get into, and robustly useful even if alignment was relatively hard. I decided to get into this upon returning home after SERI MATS.
While this didn’t end up going that well in practice, I do endorse the decision-making process I went through back then, with the information I actually had at the time.
Mechanistic Interpretability
Upon my return, I began to follow Neel’s excellent resources to get into mechanistic interpretability. Things went well for a while, as a lot of the skills required were engineering-based, and the linear algebra I had been learning was actually useful! I had been studying Linear Algebra Done Right—this was far more rigorous and time-consuming than I feel I actually needed in retrospect, but to its credit it did help me understand important concepts very well, like what a basis really is, that are important in mech interp. There were probably more efficient ways to learn this knowledge, but I did learn it well.
I decided to apply for SERI MATS again, this time only under Neel. I did check with him about applying a second time, and he said that he would teach pretty different things and I likely hadn’t saturated the value of mentorship yet. The research sprint (two weeks of research with a teammate) is where things started to go south.
The teammate I ended up with was much younger than myself, quite brilliant, and already had a project idea in mind which I went along with. During the first week, I struggled quite a lot to keep up with him. He’d been thinking about the project for longer, and because of the nature of our timezones, our day would start with him working, then us meeting in the middle, then me working until the end. That meant he was always a day ahead of me.
After the first week, it was obvious that he was doing most of the work while I was plodding along behind. He ended up deciding we should each take the work that had been done and build on it independently, each of us submitting that as our final submission. To his credit, when he told Neel this, he did so anonymously, so that in theory Neel wouldn’t know which of us felt they were doing most of the work for the first week.
Without this partner however, I had no clue what I was doing. After a couple of days of struggling, I ended up giving up on the sprint. In retrospect, this team split was the best thing that could have happened to me. My teammate did end up getting accepted into SERI MATS. If he had carried the project and gotten it accepted as a team effort, I would have gone with him. I then would have spent two months dealing with the same problem, likely making me unproductive and miserable. The problem, I realised, was not with the project. The problem was that I wasn’t really suited for mechanistic interpretability research.
Unfortunately, I’d already accepted funding to work on a mech interp project for the next six months. Oops.
Project
The project I was working on (which eventually led to this post with Joseph Bloom) was around interpreting Decision Transformers, a transformer that was trained on sequences from a reinforcement learning agent that had learned to succeed at a task. The idea is that the transformer, as a predictor, learned to predict the correct moves as well. The Decision Transformer was also trained on a varying reward parameter so that it would also learn to lose when asked, The idea was to use transformer-based interpretability techniques to look for things like “goals” that were more RL-flavored, that were more likely to appear in a model like this than GPT-2.
Originally, Joseph and I were going to work together in Australia—him in Melbourne, myself in Brisbane. He did end up moving to the UK instead, which complicated things. After Neel’s SERI MATS training ended, I tried working with Joseph on this project, and we had some difficulties. On my end I had a few problems:
I had a LOT of environment problems at first, before I eventually managed to solve them. Joseph had made a basic Docker container, but didn’t use it himself—to Joseph’s credit, it still worked without a ton of tweaking. This is rare for environments that aren’t used often!
I struggled to put hours in, which I eventually realised was because I didn’t know how to get started on the task. At the time I assumed the issue was a generic productivity issue, not a situational one.
Sometimes I needed more information to work on a task, but I tended to assume this was my fault. If I were smarter, I would be able to do it, so I didn’t want to bother Joseph for more information. I now realise this is silly—whether the fault is mine or not, if I need more context to solve a problem, I need more context, and it helps nobody to delay asking about this too much.
And, as before, I just wasn’t very well suited to MI research.
It took a fair bit of time, but eventually Joseph and I hit upon a decent formula for my helping out. Joseph would come up with a bunch of engineering tasks as he worked that he would like to see done, but were on the back burner. I would then do those things, preventing him from having to do them later and thus speeding up his research.
Essentially, this is the concept of comparative advantage from economics. Joseph is a far better mech interp researcher than I am. I even suspect he’s a better software engineer than I am, at least on this particular project where he had more context. But if Joseph can produce 5x the research units that I can, and only 2x the engineering units than I can, if I spend my time producing engineering units, Joseph can spend more time producing research units, which is better for everyone. Even if I am worse than him at both tasks, I can still contribute meaningfully by letting him focus on his greatest relative strength, which is exactly what I did. By taking on smaller tasks in greater numbers, I ensured the overhead of him explaining them to me was low, and so I was able to be net useful to the project. Our final post reflects this—the analysis is basically entirely Joseph’s, but many of the figures in the post have been either produced or improved by me at some point.
This was a valuable lesson—rather than trying to force myself to do something I’m bad at, I should try and contribute where I’m best at. Of course, there are limits to this—I still need to make sure the thing I’m contributing is actually useful. But I suspect this is not a large problem with the specific fields of software engineering and AI alignment—there’s plenty of work that can be done here with this skillset.
Applications and Future Plans
As the year neared its end, I needed to decide what I wanted to do next. I knew mech interp wasn’t the field I wanted to be in. But as I re-evaluated my year, I thought about how I left evals because I lacked a solid theory of change for how that might help. Since then, GPT-4 came out, and the world began to take notice. AI labs committed to red-teaming, the UK government organised the AI Safety Summit, and they even committed £100m to a task force devoted to AI safety. Suddenly, the idea of a scary demo being noticed by the broader world, or evaluation results causing the main labs to take notice and slow down, seemed much more reasonable.
So I decided to leverage my experience with LLM’s and my software engineering skills to apply for some evals positions. I got pretty deep in Apollo Research’s process, much further than other jobs I’d applied for earlier in the year. While I didn’t get that position, I eventually got hired by the UK’s task force. They weren’t on my radar at first, but after information started coming in that they seemed to be working on the right problems for the right reasons, I sent in an application. These initial impressions were reinforced as I went through the process, and I now think I can do some good work there, with a theory of change for why this matters. I wouldn’t say I’ve “succeeded in the field” yet, but I have put myself in the position to have a very good shot at doing so in 2024.
So, what were the lessons I learned, overall, throughout the year? Well, the main thing is—looking at my year overall, there were a lot of mistakes, some avoidable, some not. The thing that saved me was being willing to look into the abyss—to ask if my plans were wrong, or my previous decisions no longer applied. This allowed me to notice when my reasoning for leaving the field of evals was obsolete, and let me make the move back. I think this skill will be useful when I actually go to London, as well. I’m giving up my apartment, getting rid of most of my stuff, and leaving my current life behind—there’s a temptation to rationalise that choice, even if it turns out to be the wrong one.
I do need to keep in mind what my theory of change is, and ask myself whether there is a reasonable route at this job to actually helping to reduce existential risk. The number one risk for failing at this, in my opinion, is that the main goal of the task force is to provide information to use in policy. If people in policy ignore this and don’t pay attention, the task force’s impact on x-risk becomes much, much weaker. The main path to impact here is providing the information to help improve UK policy, which hopefully serves as a blueprint for other countries to follow. This can’t happen if the task force’s recommendations are ignored. I’ll need to do my best to ensure this doesn’t happen, and notice if it does, so I can pursue the best path I can. I’ve begun referring to my “job” and my “mission”, where the job is what I was hired to do and the mission is what I actually want to achieve. The two are correlated (if not, why do the job at all?) but they aren’t quite the same thing.
In the end, what matters is reducing existential risk, not getting a job in the field. In 2024, I hope I’m finally in the position to start doing just that.
I appreciate you writing this! It really helped me get a more concrete sense of what it is like for new alignment researchers (like me) to be aiming to make good things happen.
Note that “capabilities improvements” can mean a lot of things here. The first thing that comes to mind is that publicizing this differentially accelerates the amount of damage API users could do with access to SOTA LLMs, which makes sense to me. It also makes sense to me that Owain would consider publishing this idea not worth the downside, simply because there’s not much benefit to publicizing this, for alignment researchers and capabilities researchers, off the top of my head. OpenAI capabilities people probably have already tried such experiments internally and know of this, and alignment researchers probably wouldn’t be able to build on top of this finding (here I mostly have interpretability researchers in mind).
Oh yeah I have had this issue many (but not all of the) times with mentors in the past. I suggest not simply trying to rationalize that emotion away though, and perhaps try to actually debug it. “Whether the fault is mine or not”, sure but if my brain tracks whether I am an asset or a liability to the project, then my brain is giving me important information in the form of my emotions.
Anyway, I’m glad you now have a job in the alignment field!
Thanks for the post!
Sorry if I’m prodding too deep, and feel no need to respond. I always feel a bit curious about claims such as this.
I guess I have two questions (which you don’t need to answer):
Do you have a hypothesis about the underlying reason for you being unsuited for this type of research? E.g. do you think you might be insufficiently interested/motivated, have insufficient conscientiousness or intelligence, etc.
How confident are you that you just “aren’t suited” to this type of work? To operationalize, maybe given e.g. two more years of serious effort, at what odds would you bet that you still wouldn’t be very competitive at mechanistic interpretability research?
What sort of external feedback are you getting vis a vis your suitability for this type of work? E.g. have you received feedback from Neel in this vein? (I understand that people are probably averse to giving this type of feedback, so there might be many false negatives).
Concrete feedback signals I’ve received:
I don’t find myself excited about the work. I’ve never been properly nerd-sniped by a mechanistic interpretability problem, and I find the day-to-day work to be more drudgery than exciting, even though the overall goal of the field seems like a good one.
When left to do largely independent work, after doing the obvious first thing or two (“obvious” at the level of “These techniques are in Neel’s demos”) I find it hard to figure out what to do next, and hard to motivate myself to do more things if I do think of them because of the above drudgery.
I find myself having difficulty backchaining from the larger goal to the smaller one. I think this is a combination of a motivational issue and having less grasp on the concepts.
By contrast, in evaluations, none of this is true. I am able to solve problems more effectively, I find myself actively interested in problems, (the ones I’m working on and ones I’m not) and I find myself more able to solve problems and reason about how they matter for the bigger picture.
I’m not sure how much of each is a contributor, but I suspect that if I was sufficiently excited about the day-to-day work, all the other problems would be much more fixable. There’s a sense of reluctance, a sense of burden, that saps a lot of energy when it comes to doing this kind of work.
As for #2, I guess I should clarify what I mean, since there’s two ways you could view “not suited”.
I will never be able to become good enough at this for my funding to be net-positive. There are fundamental limitations to my ability to succeed in this field.
I should not be in this field. The amount of resources required to make me competitive in this field is significantly larger than other people who would do equally good work, and this is not true for other subfields in alignment.
I view my use of “I’m not suited” more like 2 than 1. I think there’s a reasonable chance that, given enough time with proper effort and mentorship in a proper organisational setting (being in a setting like this is important for me to reliably complete work that doesn’t excite me), I could eventually do okay at this field. But I also think that there are other people who would do better, faster, and be a better use of an organisation’s money than me.
This doesn’t feel like the case in evals. I feel like I can meaningfully contribute immediately, and I’m sufficiently motivated and knowledgable that I can understand the difference between my job and my mission (making AI go well) and feel confident that I can take actions to succeed in both of them.
If Omega came down from the sky and said “Mechanistic interpretability is the only way you will have any impact on AI alignment—it’s this or nothing” I might try anyway. But I’m not in that position, and I’m actually very glad I’m not.