William_S comments on A transparency and interpretability tech tree

William_S 17 Jun 2022 4:20 UTC
LW: 1 AF: 1
AF
Summary for 8 “Can we take a deceptively aligned model and train away its deception?” seems a little harder than what we actually need, right? We could prevent a model from being deceptive rather than trying to undo arbitrary deception (e.g. if we could prevent all precursors)
- evhub 17 Jun 2022 20:16 UTC
  LW: 3 AF: 3
  3
  AF Parent
  Yep, that’s right—if you take a look at my discussion under (6), my claim is that we should only need (6), not (8).