aysja comments on TurnTrout’s shortform feed

aysja 17 Sep 2024 22:38 UTC
3 points
0
I’m somewhat confused about when these evaluations are preformed (i.e., how much safety training the model has undergone). OpenAI’s paper says: “Red teamers had access to various snapshots of the model at different stages of training and mitigation maturity starting in early August through mid-September 2024,” so it seems like this evaluation was probably performed several times. Were these results obtained only prior to safety training or after? The latter seems more concerning to me, so I’m curious.
- Marius Hobbhahn 18 Sep 2024 13:32 UTC
  5 points
  0
  Parent
  Unless otherwise stated, all evaluations were performed on the final model we had access to (which I presume is o1-preview). For example, we preface one result with “an earlier version with less safety training”.