Violet Hour comments on Maxime Riché′s Shortform

Violet Hour 5 Jul 2024 14:51 UTC
1 point
0
Hm, what do you mean by “generalizable deceptive alignment algorithms”? I understand ‘algorithms for deceptive alignment’ to be algorithms that enable the model to perform well during training because alignment-faking behavior is instrumentally useful for some long-term goal. But that seems to suggest that deceptive alignment would only emerge – and would only be “useful for many tasks” – after the model learns generalizable long-horizon algorithms.
- Maxime Riché 5 Jul 2024 15:39 UTC
  1 point
  0
  Parent
  These forecasts are about the order under which functionalities see a jump in their generalization (how far OOD they work well).
  
  By “Generalisable xxx” I meant the form of the functionality xxx that generalize far.