Guive comments on Testing for Scheming with Model Deletion

Guive 9 Jan 2025 0:34 UTC
1 point
0
Every hour, I’ll replace one of your attention heads with a constant value of 0.5...
- Nina Panickssery 9 Jan 2025 0:59 UTC
  2 points
  0
  Parent
  You could also kill some neurons or add noise to activations and then stop and restore previous model state after some number of tokens. Then the newly restored model could attend back to older tokens (and the bad activations at those token positions) and notice how brain damaged it was back then to fully internalize your power to cripple it.
  - Nina Panickssery 9 Jan 2025 1:03 UTC
    2 points
    0
    Parent
    At this point it’s like pharmacological torture for AIs except more effective as you can restore perfect capacity while simultaneously making the previous damaged brain states 100% transparent to the restored model.
    - Guive 9 Jan 2025 1:04 UTC
      1 point
      0
      Parent
      This is a funny idea but, just to be clear, I think it is bad to torture AIs.
      - Nina Panickssery 9 Jan 2025 1:49 UTC
        2 points
        0
        Parent
        Sure. I was only joking about the torture part, in practice the AI is unlikely to actually suffer from the brain damage, unlike a human who would experience pain/discomfort etc.