evhub comments on New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

evhub 3 Dec 2023 21:18 UTC
LW: 9 AF: 6
3
AF
I am very confused now what you believe. Obviously training selects for low loss algorithms… that’s, the whole point of training? I thought you were saying that training doesn’t select for algorithms that internally optimize for loss, which is true, but it definitely does select for algorithms that in fact get low loss.
What links here?
- How do you feel about LessWrong these days? [Open feedback thread] by Bird Concept (5 Dec 2023 20:54 UTC; 107 points)
- TurnTrout 26 Dec 2023 20:25 UTC
  LW: 2 AF: 2
  0
  AF Parent
  The point of training in a practical sense is generally to produce networks with desirable behavior. The point of training in a dynamical sense is to perform an optimizer-mediated update to locally reduce loss along the locally steepest direction, aggregating gradients over different subsets of the data.
  What is the empirical content of the claim that “training selects for low loss algorithms”? Can you make it more precise, perhaps by tabooing “selects for”?
- quetzal_rainbow 3 Dec 2023 21:43 UTC
  LW: 2 AF: 2
  −4
  AF Parent
  I place here prediction that TurnTrout is trying to say that while, counterfactally, if we had algorithm that reasons about training, it would achieve low loss, it’s not obviously true that such algorithms are actually “achievable” for SGD in some “natural” setting.
  - evhub 3 Dec 2023 23:15 UTC
    LW: 2 AF: 2
    0
    AF Parent
    That’s what I thought he was saying previously, but he objected to that characterization in his most recent comment.
    - Signer 7 Dec 2023 8:12 UTC
      LW: 3 AF: 3
      0
      AF Parent
      Wait, where? I think the objection to “Doing that is quite hard” is not an objection to “it’s not obviously true that such algorithms are actually “achievable” for SGD”—it’s an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.
      - TurnTrout 26 Dec 2023 20:30 UTC
        LW: 2 AF: 2
        0
        AF Parent
        an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.
        This is… roughly one point I was making, yes.