Alignment researcher. Views are my own and not those of my employer. https://www.barnes.page/
Beth Barnes
I think some tasks like this could be interesting, and I would definitely be very happy if someone made some, but doesn’t seem like a central example of the sort of thing we most want. The most important reasons are:
(1) it seems hard to make a good environment that’s not unfair in various ways
(2) It doesn’t really play to the strengths of LLMs, so is not that good evidence of an LLM not being dangerous if it can’t do the task. I can imagine this task might be pretty unreasonably hard for an LLM if the scaffolding is not that good.
Also bear in mind that we’re focused on assessing dangerous capabilities, rather than alignment or model “disposition”. So for example we’d be interested in testing whether the model is able to successfully avoid a shutdown attempt when instructed, but not whether it would try to resist such attempts without being prompting to.
Did you try following the instructions in the README.md in the main folder for setting up the docker container and running an agent on the example task?
I think your computer is reading the .ts extension and thinking it’s a translation file: https://doc.qt.io/qt-6/linguist-translating-strings.html
But it’s actually a typescript file. You’ll need to open it with a text editor instead.
Yeah, doing a walkthrough of a task submission could be great. I think it’s useful if you have a decent amount of coding experience though—if you happen to be a non-coder there might be quite a lot of explaining required.
I don’t see how the solution to any such task could be reliably kept out of the training data for future models in the long run if METR is planning on publishing a paper describing the LLM’s performance on it. Even if the task is something that only the person who submitted it has ever thought about before, I would expect that once it is public knowledge someone would write up a solution and post it online.
We will probably not make full details public for all the tasks. We may share privately with researchers
Thanks a bunch for the detailed questions, helpful!
1. Re agents—sorry for the confusing phrasing. A simple agent is included in the starter pack, in case this is helpful as you develop + test your task.
2. Submission: You need to submit the task folder containing any necessary resources, the yourtask.py file which does any necessary setup for the task and implements automatic scoring if applicable, and the filled out README. The README has sections which will need you to attach some examples of walkthroughs / completing the task.
Any suggestions for making this clearer? The existing text is:A submission consists of a task folder including task.py file and any resources needed, as well as detailed documentation of the task and how you tested it. In particular, you need to have somebody else run through the task to do quality assurance—making sure that the instructions aren’t ambiguous, all the necessary resources are actually present, there isn’t an accidental shortcut, etc.
Even if it has already been published we’re still interested. Especially ones that were only published fairly recently, and/or only have the description of the puzzle rather than the walkthrough online, and/or there are only a few copies of the solutions rather than e.g. 20 public repos with different people’s solutions
I think we’d be super interested in you making custom ones! In terms of similarity level, I think it would be something like “it’s not way easier for a human to solve it given solutions to similar things they can find online”.
I imagine we’d be interested in at least 10, as long as they don’t all have the same trick or something, and maybe more like 50 if they’re pretty diverse? (but I think we’d be at more like $1000 for marginal task at those sort of numbers)
I don’t expect there to be a hard deadline, expect we’ll still want more of these for next year or two at least. Sooner is better, next week or so would be awesome.
To be clear, with (b) you could still have humans play it—just would have to put it up in a way where it won’t get scraped (e.g. you email it to people after they fill in an interest form, or something like that)
Interesting! How much would we have to pay you to (a) put it into the task format and document it etc as described above, and (b) not publish it anywhere it might make it into training data?
Bounty: Diverse hard tasks for LLM agents
It sounds like you’re excluding cases where weights are stolen—makes sense in the context of adversarial robustness, but seems like you need to address those cases to make a general argument about misuse threat models
Ideally the task should work well with static resources—e.g. you can have a local copy of the documentation for all the relevant libraries, but don’t have general internet access. (This is because we want to make sure the difficulty doesn’t change over time, if e.g. someone posts a solution to stack overflow or whatever)
Great questions!
We’re interested in tasks where we do actually have an example of it being solved, so that we can estimate the difficulty level.
I think we’re interested in both tasks where you need to sidestep the bug somehow and make the program work, or ones where you need to specifically explain what was going wrong.
This wasn’t explained that well in the above, but the intended difficulty level is more like “6-20 hours for a decent engineer who doesn’t have context on this particular codebase”. E.g. a randomly selected engineer who’s paid $100-$200 per hour who’s familiar with the language and overall stack that’s being used, but not the person who wrote the code, and not an expert in the particular component that is causing the bug.
I’d be very interested if you have any ideas for a better way to get some kind of universal difficulty metric—it’s not great for our purposes if the “task difficulty” varies wildly between humans with the same on-paper qualifications.
Send us example gnarly bugs
I basically agree with almost all of Paul’s points here. Some small things to add:
Specifying a concrete set of evaluation results that would cause them to move to ASL-3. I think having some concrete threshold for a pause is much better than not, and I think the proposed threshold is early enough to trigger before an irreversible catastrophe with high probability (more than 90%).
Basically agree, although I think the specifics of the elicitation methodology that we helped draft are important to me here. (In particular: only requiring 10% success rate to count a task as “passed”; making sure that you’re using ~$1000 of inference compute per task; doing good scaffolding and finetuning on a dev set of tasks from the same distribution as the threshold tasks)
I’m excited to see criticism of RSPs that focuses on concrete ways in which they fail to manage risk. Such criticism can help (i) push AI developers to do better, (ii) argue to policy makers that we need regulatory requirements stronger than existing RSPs. That said, I think it is significantly better to have an RSP than not, and don’t think that point should be lost in the discussion.
Agree. I’m worried about accidental creation of an incentive gradient for companies to say and do as little as possible about safety. I think this can be reduced if critics follow this principle: “criticism of specific labs on their RSPs makes sure to explicitly name other prominent labs who haven’t put out any RSP and say that this is worse”
On the object level I’d be especially excited for criticism that includes things like:Presenting risks or threat models that might occur before the model has the capabilities that the evaluation is intended to capture
Explaining how the specified evaluations may not capture the capabilities properly
Proposing or developing alternative evaluations
Arguing intervals of 4x effective compute between evaluations are too large and that we could blow past the intended capability limits
Pointing out ambiguities in the evaluation definitions
“The current level of risk is low enough that I think it is defensible for companies or countries to continue AI development if they have a sufficiently good plan for detecting and reacting to increasing risk.”
I think it’s true that it’s defensible for an individual company/country, but I also think it’s not sensible for the world to be doing this overall. It seems possible to me that key capabilities limitations of current LLM agents could be overcome with the right scaffolding and finetuning. (maybe ~1/1000). Given this, if I personally ran the world I would not be open-sourcing or scaling up current systems.
Managing risks of our own work
What we’ve currently published is ‘number of agents that completed each task’, which has a similar effect of making comparisons between models harder—does that seem like it addresses the downside sufficiently?
Yep, fine by me
ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks
Autonomous Replication as we define it in our evaluations (though maybe not clear from our blog post) is significantly below what we think is necessary to actually be an xrisk. In particular, we assume no human resistance, model has access to weights, ways of making money it tries are scalable, doesn’t have any issues purchasing tons of GPUs, no monitoring by labs, etc
Thanks for the reminder; we have a standard canary string we use for evals stuff in addition to the BIGBENCH one, I added that. (I don’t think canaries are a reliable way to actually ensure things get removed from training data, but at least it lets you check whether a given model has seen the thing)
Great questions, thank you!
1. Yep, good catch. Should be fixed now.
2. I wouldn’t be too worried about it, but very reasonable to email us with the idea you plan to start working on.
3. I think fine to do specification without waiting for approval, and reasonable to do implementation as well if you feel confident it’s a good idea, but feel free to email us to confirm first.
4. That’s a good point! I think using an API-based model is fine for now—because the scoring shouldn’t be too sensitive to the exact model used, so should be fine to sub it out for another model later. Remember that it’s fine to have human scoring also.