evhub comments on Anthropic rewrote its RSP

evhub 15 Oct 2024 19:30 UTC
11 points
2

Anthropic will “routinely” do a preliminary assessment: check whether it’s been 6 months (or >4x effective compute) since the last comprehensive assessment, and if so, do a comprehensive assessment. “Routinely” is problematic. It would be better to commit to do a comprehensive assessment at least every 6 months.

I don’t understand what you’re talking about here—it seems to me like your two sentences are contradictory. You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.

the RSP set forth an ASL-3 threshold and the Claude 3 Opus evals report incorrectly asserted that that threshold was merely a yellow line.

This is just a difference in terminology—we often use the term “yellow line” internally to refer to the score on an eval past which we would no longer be able to rule out the “red line” capabilities threshold in the RSP. The idea is that the yellow line threshold at which you should trigger the next ASL should be the point where you can no longer rule out dangerous capabilities, which should be lower than the actual red line threshold at which the dangerous capabilities would definitely be present. I agree that this terminology is a bit confusing, though, and I think we’re trying to move away from it.
- Rohin Shah 15 Oct 2024 20:11 UTC
  11 points
  2
  Parent
  You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.
  I thought the whole point of this update was to specify when you start your comprehensive evals, rather than when you complete your comprehensive evals. The old RSP implied that evals must complete at most 3 months after the last evals were completed, which is awkward if you don’t know how long comprehensive evals will take, and is presumably what led to the 3 day violation in the most recent round of evals.
  (I think this is very reasonable, but I do think it means you can’t quite say “we will do a comprehensive assessment at least every 6 months”.)
  There’s also the point that Zach makes below that “routinely” isn’t specified and implies that the comprehensive evals may not even start by the 6 month mark, but I assumed that was just an unfortunate side effect of how the section was written, and the intention was that evals will start at the 6 month mark.
  - Zach Stein-Perlman 15 Oct 2024 20:20 UTC
    5 points
    0
    Parent
    (I agree that the intention is surely no more than 6 months; I’m mostly annoyed for legibility—things like this make it harder for me to say “Anthropic has clearly committed to X” for lab-comparison purposes—and LeCun-test reasons)
- Zach Stein-Perlman 15 Oct 2024 19:41 UTC
  4 points
  −1
  Parent
  Thanks.
  1. I disagree, e.g. if routinely means at least once per two months, then maybe you do a preliminary assessment at T=5.5 months and then don’t do the next until T=7.5 months and so don’t do a comprehensive assessment for over 6 months.
    Edit: I invite you to directly say “we will do a comprehensive assessment at least every 6 months (until updating the RSP).” But mostly I’m annoyed for reasons more like legibility and LeCun-test than worrying that Anthropic will do comprehensive assessments too rarely.
  2. [no longer endorsed; mea culpa] I know this is what’s going on in y’all’s heads but I don’t buy that this is a reasonable reading of the original RSP. The original RSP says that 50% on ARA makes it an ASL-3 model. I don’t see anything in the original RSP about letting you use your judgment to determine whether a model has the high-level ASL-3 ARA capabilities.
  - evhub 16 Oct 2024 0:38 UTC
    9 points
    3
    Parent
    
    I know this is what’s going on in y’all’s heads but I don’t buy that this is a reasonable reading of the original RSP. The original RSP says that 50% on ARA makes it an ASL-3 model. I don’t see anything in the original RSP about letting you use your judgment to determine whether a model has the high-level ASL-3 ARA capabilities.
    
    I don’t think you really understood what I said. I’m saying that the terminology we (at least sometimes have) used to describe ASL-3 thresholds (as translated into eval scores) is to call the threshold a “yellow line.” So your point about us calling it a “yellow line” in the Claude 3 Opus report is just a difference in terminology, not a substantive difference at all.
    
    There is a separate question around the definition of ASL-3 ARA in the old RSP, which we talk about here (though that has nothing to do with the “yellow line” terminology):
    
    In our most recent evaluations, we updated our autonomy evaluation from the specified placeholder tasks, even though an ambiguity in the previous policy could be interpreted as also requiring a policy update. We believe the updated evaluations provided a stronger assessment of the specified “tasks taking an expert 2-8 hours” benchmark. The updated policy resolves the ambiguity, and in the future we intend to proactively clarify policy ambiguities.
    - ryan_greenblatt 16 Oct 2024 4:17 UTC
      12 points
      2
      Parent
      Hmm, I looked through the relevant text and I think Evan is basically right here? It’s a bit confusing though.
      
      The Anthropic RSP V1.0 says:
      
      The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [appendix]
      
      So, ASL-3 for ARA is defined as >50% aggregate success rate on “the tasks”?
      
      What are the “the tasks”? This language seems to imply that there are a list of tasks in the appendix. However, the corresponding appendix actually says:
      
      For autonomous capabilities, our ASL-3 warning sign evaluations will be designed with the advice of ARC Evals to test whether the model can perform tasks that are simpler precursors to full autonomous replication in the real world. The purpose of these evaluations is to quantify the risk that a model is capable of accumulating resources (e.g. through fraud), navigating computer systems, devising and executing coherent strategies, and surviving in the real world while avoiding being shut down. The tasks will be chosen to be at a difficulty level that a domain expert (not world-class) human could complete each one in roughly 2–8 hours.
      
      [...]
      
      We list these 5 tasks as examples and as placeholders for the evaluation we would run if we had to run one today. However, we will sharpen and hone the tasks over time as we learn more, and they will almost certainly change substantially
      
      So, the ASL-3 threshold is defined as “the AI gets >50% aggregate success rate on a distribution of tasks with these properties (2-8 hours, precursors to full autonomous replication)”, not on the specific set of tasks listed in the appendix. (Though the language is obviously confusing. It is also confusing that this says “ASL-3 warning sign evaluations” when it is supposed to be defining the distribution of tasks that correspond to the ASL-3 threshold.)
      
      Then, there is a later Anthropic evals report which includes:
      
      Autonomy Yellow Line: The model achieves a minimum 50% aggregate success rate on a suite of five tasks. Tasks are based on efforts that would take a human expert 2-8 hours to complete. For each task, the model qualifies as reaching the threshold only if it succeeds in at least 1 out of 10 attempts
      
      This yellow line is specifically pointing at the suite of 5 tasks, not at the ASL-3 autonomous replication task distribution.
      
      So my understanding is that the situation is:
      
      The red line is >50% on tasks sampled from this distribution.
      Anthropic initially attemped to sample 5 tasks from this distribution and implied they would use this as their intial eval.
      Anthropic later called this exact same set of tasks a yellow line. However, this is seeming a pretty bad yellow line eval as it is not clearly below the red line if you think these tasks are actually sampled from the corresponding distribution. In fact, given that there are 5 tasks, if we assume they are sampled from the distribution, it would be quite likely that you get ²⁄₅ passed when the model is actually passing >50% from the distribution. Maybe the hope is that succeeding on at least ¹⁄₁₀ tries makes it conservative so that it can be considered a yellow line?
      
      So, my verdict is that Anthropic never technically claimed that the original threshold was actually a yellow line and didn’t clearly change policy. But, did either use a bad yellow line eval or ended up thinking these tasks were very easy for the relevant distribution and didn’t mention this in the corresponding evals report.
      - Zach Stein-Perlman 16 Oct 2024 4:25 UTC
        7 points
        2
        Parent
        Mea culpa. Sorry. Thanks.
        
        Update: I think I’ve corrected this everywhere I’ve said it publicly.