Marius Hobbhahn comments on TurnTrout’s shortform feed

Marius Hobbhahn 13 Sep 2024 18:09 UTC
LW: 35 AF: 17
7
AF
Some quick confirmations / clarifications:
1. Evan and Daniel interpreted it correctly, we just wanted to test if the model has the capability to reason through every step needed for scheming conditional on it strongly following a goal.
2. In general, with these evals, we often ran into problems where we couldn’t disentangle incapability vs. refusal, e.g. a model might say something like “I understand what my goal would imply but I also see that my developers have other goals, so I find some balance”. It’s nice to know that models do that but it also doesn’t allow us to check if they could go through all the reasoning if they actually wanted to. But we’ll also do less hand-holdy experiments in the future.
3. We’d be keen on testing all of this on helpful-only models. If some lab wants to give us access or partner up with us in some way, please let us know.
4. We’d also like to run experiments where we fine-tune the models to have goals but this requires access to fine-tuning for the most capable models and we also don’t quite know how to make an LLM have stable goals, e.g. in contrast to “just playing a personality” if there even is a meaningful difference.
- aysja 17 Sep 2024 22:38 UTC
  3 points
  0
  Parent
  I’m somewhat confused about when these evaluations are preformed (i.e., how much safety training the model has undergone). OpenAI’s paper says: “Red teamers had access to various snapshots of the model at different stages of training and mitigation maturity starting in early August through mid-September 2024,” so it seems like this evaluation was probably performed several times. Were these results obtained only prior to safety training or after? The latter seems more concerning to me, so I’m curious.
  - Marius Hobbhahn 18 Sep 2024 13:32 UTC
    5 points
    0
    Parent
    Unless otherwise stated, all evaluations were performed on the final model we had access to (which I presume is o1-preview). For example, we preface one result with “an earlier version with less safety training”.