evhub comments on Sabotage Evaluations for Frontier Models

evhub 15 Nov 2024 22:50 UTC
LW: 5 AF: 4
1
AF
The usual plan for control as I understand it is that you use control techniques to ensure the safety of models that are sufficiently good at themselves doing alignment research that you can then leverage your controlled human-ish-level models to help you align future superhuman models.
- ryan_greenblatt 16 Nov 2024 18:02 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Or get some other work out of these systems such that you greatly reduce risk going forward.