quetzal_rainbow comments on New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

quetzal_rainbow 3 Dec 2023 21:43 UTC
LW: 2 AF: 2
−4
AF
I place here prediction that TurnTrout is trying to say that while, counterfactally, if we had algorithm that reasons about training, it would achieve low loss, it’s not obviously true that such algorithms are actually “achievable” for SGD in some “natural” setting.
- evhub 3 Dec 2023 23:15 UTC
  LW: 2 AF: 2
  0
  AF Parent
  That’s what I thought he was saying previously, but he objected to that characterization in his most recent comment.
  - Signer 7 Dec 2023 8:12 UTC
    LW: 3 AF: 3
    0
    AF Parent
    Wait, where? I think the objection to “Doing that is quite hard” is not an objection to “it’s not obviously true that such algorithms are actually “achievable” for SGD”—it’s an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.
    - TurnTrout 26 Dec 2023 20:30 UTC
      LW: 2 AF: 2
      0
      AF Parent
      an objection to the conclusion that model would try hard enough to justify arguments about deception from weak statement about loss decreasing during training.
      This is… roughly one point I was making, yes.