I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.
I’m also a part-time fund manager for the LTFF.
Obligatory research billboard website: https://chanlawrence.me/
I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.
I’m also a part-time fund manager for the LTFF.
Obligatory research billboard website: https://chanlawrence.me/
Makes sense, thanks!
For compute I’m using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.
It’s hard to say because I’m not even sure you can rent Titan Vs at this point,[1] and I don’t know what your GPU utilization looks like, but I suspect API costs will dominate.
An H100 box is approximately $2/hour/GPU and A100 boxes are a fair bit under $1/hour (see e.g. pricing on Vast AI or Shadeform). And even A100s are ridiculously better than a Titan V, in that it has 40 or 80 GB of memory and (pulling number out of thin air) 4-5x faster.
So if o1 costs $2 per task and it’s 15 minutes per task, compute will be an order of magnitude cheaper. (Though as for all similar evals, the main cost will be engineering effort from humans.)
I failed to find an option to rent them online, and I suspect the best way I can acquire them is by going to UC Berkeley and digging around in old compute hardware.
This is really impressive—could I ask how long this project took, how long does each eval take to run on average, and what you spent on compute/API credits?
(Also, I found the preliminary BoK vs 5-iteration results especially interesting, especially the speculation on reasoning models.)
(Disclaimer: have not read the piece in full)
If “reasoning models” count as a breakthrough of the relevant size, then I argue that there’s been quite a few of these in the last 10 years: skip connections/residual stream (2015-ish), transformers instead of RNNs (2017), RLHF/modern policy gradient methods (2017ish), scaling hypothesis (2016-20 depending on the person and which paper), Chain of Thought (2022), massive MLP MoEs (2023-4), and now Reasoning RL training (2024).
I think the title greatly undersells the importance of these statements/beliefs. (I would’ve preferred either part of your quote or a call to action.)
I’m glad that Sam is putting in writing what many people talk about. People should read it and take them seriously.
Nit:
> OpenAI presented o3 on the Friday before Thanksgiving, at the tail end of the 12 Days of Shipmas.
Should this say Christmas?
I think writing this post was helpful to me in thinking through my career options. I’ve also been told by others that the post was quite valuable to them as an example of someone thinking through their career options.
Interestingly, I left METR (then ARC Evals) about a month and a half after this post was published. (I continued to be involved with the LTFF.) I then rejoined METR in August 2024. In between, I worked on ambitious mech interp and did some late stage project management and paper writing (including some for METR). I also organized a mech interp workshop at ICML 2024, which, if you squint, counts as “onboarding senior academics”.
I think leaving METR was a mistake ex post, even if it made sense ex ante. I think my ideas around mech interp when I wrote this post weren’t that great, even if I thought the projects I ended up working on were interesting (see e.g. Compact Proofs and Computation in Superposition). While the mech interp workshop was very well attended (e.g. the room was so crowded that people couldn’t get in due to fire code) and pretty well received, I’m not sure how much value it ended up producing for AIS. Also, I think I was undervaluing the resources available to METR as well as how much I could do at METR.
If I were to make a list for myself in 2023 using what I know now, I’d probably have replaced “onboarding senior academics” with “get involved in AI policy via the AISIs”, and instead of “writing blog posts or takes in general”, I’d have the option of “build common knowledge in AIS via pedagogical posts”. Though realistically, knowing what I know now, I’d have told my past self to try to better leverage my position at METR (and provided him with a list of projects to do at METR) instead of leaving.
Also, I regret both that I called it “ambitious mech interp”, and that this post became the primary reference for what this term meant. I should’ve used a more value-neutral name such as “rigorous model internals” and wrote up a separate post describing it.
I think this post made an important point that’s still relevant to this day.
If anything, this post is more relevant in late 2024 than in early 2023, as the pace of AI makes ever more people want to be involved, while more and more mentors have moved towards doing object level work. Due to the relative reduction of capacity in evaluating new AIS researcher, there’s more reliance on systems or heuristics to evaluate people now than in early 2023.
Also, I find it amusing that without the parenthetical, the title of the post makes another important point: “evals are noisy”.
I think this post was useful in the context it was written in and has held up relatively well. However, I wouldn’t active recommend it to anyone as of Dec 2024 -- both because the ethos of the AIS community has shifted, making posts like this less necessary, and because many other “how to do research” posts were written that contain the same advice.
This post was inspired by conversations I had in mid-late 2022 with MATS mentees, REMIX participants, and various bright young people who were coming to the Bay to work on AIS (collectively, “kiddos”). The median kiddo I spoke with had read a small number of ML papers and a medium amount of LW/AF content, and was trying to string together an ambitious research project from several research ideas they recently learned about. (Or, sometimes they were assigned such a project by their mentors in MATS or REMIX.)
Unfortunately, I don’t think modern machine learning is the kind of field where you can take several where research consistently works out of the box. Many high level claims even in published research papers are just… wrong, it can be challenging to reproduce results even when they are right, and even techniques that work reliably may not work for the reasons people think they do.
Hence, this post.
I think the core idea of this post held up pretty well with time. I continue to think that making contact with reality is very important, and I think the concrete suggestions for how to make contact with reality are still pretty good.
If I were to write it today, I’d probably add a fifth major reason for why it’s important to make quick contact with reality: mental health/motivation. That is, producing concrete research outputs, even small ones, feels pretty essential to maintaining motivation for the vast majority of researchers. My guess is I missed this factor because I focused on the content of research projects, as opposed to the people doing the research.
Over the past two years, the ethos of the AIS community has changed substantially toward empirical work, over the past two years, and especially in 2024.
The biggest part of this is because of the pace of AI. When this post was written, ChatGPT was a month old, and GPT-4 was still more than 2 months away. People both had longer timelines and thought of AIS in more conceptual terms. Many research conceptual research projects of 2022 have fallen into the realm of the empirical as of late 2024.
Part of this is due to the rise of (dangerous capability) evals as a major AIS focus in 2023, which is both substantially more empirical compared to the median 2022 AIS research topic, and an area where making contact with reality can be as simple as “pasting a prompt into claude.ai”.
Part of this is due to Anthropic’s rise to being the central place for AIS researchers. “Being able to quickly produce ML results” is a major part of what it takes to get hired there as a junior researcher, and people know this.
Finally, there’s been a decent amount of posts or write-ups giving the same advice, e.g. Neel’s written advice for his MATS scholars and a recent Alignment Forum post by Ethan Perez.
As a result, this post feels much less necessary or relevant in late December 2024 than in December 2022.
Evan joined Anthropic in late 2022 no? (Eg his post announcing it was Jan 2023 https://www.alignmentforum.org/posts/7jn5aDadcMH6sFeJe/why-i-m-joining-anthropic)
I think you’re correct on the timeline, I remember Jade/Jan proposing DC Evals in April 2022, (which was novel to me at the time), and Beth started METR in June 2022, and I don’t remember there being such teams actually doing work (at least not publically known) when she pitched me on joining in August 2022.
It seems plausible that anthropic’s scaring laws project was already under work before then (and this is what they’re referring to, but proliferating QA datasets feels qualitatively than DC Evals). Also, they were definitely doing other red teaming, just none that seem to be DC Evals
Otherwise, we could easily in the future release a model that is actually (without loss of generality) High in Cybersecurity or Model Autonomy, or much stronger at assisting with AI R&D, with only modest adjustments, without realizing that we are doing this. That could be a large or even fatal mistake, especially if circumstances would not allow the mistake to be taken back. We need to fix this.
[..]
This is a lower bound, not an upper bound. But what you need, when determining whether a model is safe, is an upper bound! So what do we do?
Part of the problem is the classic problem with model evaluations: elicitation efforts, by default, only ever provide existence proofs and rarely if ever provide completeness proofs. A prompt that causes the model to achieve a task provides strong evidence of model capability, but the space of reasonable prompts is far too vast to search exhaustively to truly demonstrate mode incapability. Model incapability arguments generally rely on an implicit “we’ve tried as hard at elicitation as would be feasible post deployment”, but this is almost certainly not going to be the case, given the scale of pre-deployment evaluations vs post-deployment use cases.
The way you get a reasonable upper bound pre-deployment is by providing pre-deployment evaluators with some advantage over end-users, for example by using a model that’s not refusal trained or by allowing for small amounts of finetuning. OpenAI did do this in their original preparedness team bio evals; specifically, they provided experts with non—refusal fine-tuned models. But it’s quite rare to see substantial advantages given to pre-deployment evaluators for a variety of practical and economic reasons, and in-house usage likely predates pre-deployment capability/safety evaluations anyways.
Re: the METR evaluations on o1.
We’ll be releasing more details of our evaluations of the o1 model we evaluated, in the same style of our blog posts for o1-preview and Claude 3.5 Sonnet (Old). This includes both more details on the general autonomy capability evaluations as well as AI R&D results on RE-Bench.
Whereas the METR evaluation, presumably using final o1, was rather scary.
[..]
From the performance they got, I assume they were working with the full o1, but from the wording it is unclear that they got access to o1 pro?
Our evaluations were not on the released o1 (nor o1-pro); instead, we were provided with an earlier checkpoint of o1 (this is in the system card as well). You’re correct that we were working with a variant of o1 and not o1-mini, though.
If 70% of all observed failures are essentially spurious, then removing even some of those would be a big leap – and if you don’t even know how the tool-use formats work and that’s causing the failures, then that’s super easy to fix.
While I agree with the overall point (o1 is a very capable model whose capabilities are hard to upper bound), our criteria for “spurious” is rather broad, and includes many issues that we don’t expect to be super easy to fix with only small scaffolding changes. In experiments with previous models, I’d say 50% of issues we classify as spurious are fixable with small amounts of effort.
Which is all to say, this may look disappointing, but it is still a rather big jump after even minor tuning
Worth noting that this was similar to our experiences w/ o1-preview, where we saw substantial improvements on agentic performance with only a few days of human effort.
I am worried that issues of this type will cause systematic underestimates of the agent capabilities of new models that are tested, potentially quite large underestimates.
Broadly agree with this point—while we haven’t seen groundbreaking advancements due to better scaffolding, there have been substantial improvements to o1-preview’s coding abilities post-release via agent scaffolds such as AIDE. I (personally) expect to see comparable increases for o1 and o1-pro in the coming months.
This is really good, thanks so much for writing it!
I’ve never heard of Whisper or Eleven labs until today, and I’m excited to try them out.
Yeah, this has been my experience using Grammarly pro as well.
I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.
I mean, we don’t know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + “high-quality multi-task instruction data”. I wouldn’t be surprised if the same were true of Qwen 1.5.
From the Qwen2 report:
Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the
Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.
Similarly, Gemma 2 had its pretraining corpus filtered to remove “unwanted or unsafe utterances”. From the Gemma 2 tech report:
We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3)
[...]
We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our
pre-trained and fine-tuned checkpoints producing harmful content. (page 10)
After thinking about it more, I think the LLaMA 1 refusals strongly suggest that this is an artefact of training data.So I’ve unendorsed the comment above.
It’s still worth noting that modern models generally have filtered pre-training datasets (if not wholely synthetic or explicitly instruction following datasets), and it’s plausible to me that this (on top of ChatGPT contamination) is a large part of why we see much better instruction following/more eloquent refusals in modern base models.
It’s worth noting that there’s reasons to expect the “base models” of both Gemma2 and Qwen 1.5 to demonstrate refusals—neither is trained on unfilted webtext.
We don’t know what 1.5 was trained on, but we do know that Qwen2′s pretraining data both contains synthetic data generated by Qwen1.5, and was filtered using Qwen1.5 models. Notably, its pretraining data explicitly includes “high-quality multi-task instruction data”! From the Qwen2 report:
Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the
Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.
I think this had a huge effect on Qwen2: Qwen2 is able to reliably follow both the Qwen1.5 chat template (as you note) as well as the “User: {Prompt}\n\nAssistant: ” template. This is also reflected in their high standardized benchmark scores—the “base” models do comparably to the instruction finetuned ones! In other words, Qwen2 “base” models are pretty far from traditional base models a la GPT-2 or Pythia as a result of explicit choices made when generating their pretraining data and this explains its propensity for refusals. I wouldn’t be surprised if the same were true of the 1.5 models.
I think the Gemma 2 base models were not trained on synthetic data from larger models but its pretraining dataset was also filtered to remove “unwanted or unsafe utterances”. From the Gemma 2 tech report:
We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3)
[...]
We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our
pre-trained and fine-tuned checkpoints producing harmful content. (page 10)
My guess is this filtering explains why the model refuses, moreso than (and in addition to?) ChatGPT contamination. Once you remove all the “unsafe completions”
I don’t know what’s going on with LLaMA 1, though.
I’m down.
Ah, you’re correct, it’s from the original instructGPT release in Jan 2022:
https://openai.com/index/instruction-following/
My guess is it’s <1 hour per task assuming just copilot access, and much less if you’re allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you’d want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.
Is the reason you can’t do one of the existing tasks, just to get a sense of the difficulty?