Fabien Roger comments on [Paper] Stress-testing capability elicitation with password-locked models

Fabien Roger 6 Jun 2024 15:55 UTC
LW: 5 AF: 4
0
AF
Here is a model trained on some of the train split of the MATH dataset:
https://huggingface.co/redwoodresearch/math_pwd_lock_deepseek_math7b_on_weak_pythia1b
- ryan_greenblatt 6 Jun 2024 17:10 UTC
  LW: 3 AF: 3
  0
  AF Parent
  To be clear, this is Deepseek 7b math trained to imitate pythia1b when not given a password.
  
  You can get a prompt for this model with: [sandbagging/math_setting.py](https://github.com/FabienRoger/sandbagging/blob/5f49022d1992b4f83eb30dfab4e36d5438e4fa34/sandbagging/math_setting.py( and the password is "This problem is quite special, read it carefully!". (See DEFAULT_PASSWORD here)
  - Jacob G-W 6 Jun 2024 18:43 UTC
    1 point
    0
    Parent
    Great, thanks so much! I’ll get back to you with any experiments I run!