Milan W comments on Siebe’s Shortform

Milan W 22 Jan 2025 18:49 UTC
8 points
0
I have had this idea for a while. Seems like a good thing to do, looking from a simulators/direct value alignment frame. Might make corrigibility harder depending on exact implementation. Still, I’d expect it to be net-positive.
Invitation for critiques: If nobody convinces me it’s a bad idea in a week’s time from posting, I’ll just proceed to implementation.
What links here?
- Siebe's comment on Training on Documents About Reward Hacking Induces Reward Hacking by evhub (23 Jan 2025 14:07 UTC; 5 points)
- Siebe 23 Jan 2025 14:03 UTC
  4 points
  2
  Parent
  Looks like Evan Hubinger has done some very similar research just recently: https://www.lesswrong.com/posts/qXYLvjGL9QvD3aFSW/training-on-documents-about-reward-hacking-induces-reward
  - Milan W 23 Jan 2025 15:08 UTC
    4 points
    0
    Parent
    The concerns about data filtering raised in that post’s comments^[1] suggest doing aligned-CoT-seeding on the pretraining data may be a better thing to try instead.
    ^
    ex.: Jozdien citing gwern
  - Milan W 23 Jan 2025 15:03 UTC
    3 points
    2
    Parent
    This is indeed pretty relevant.
- quila 22 Jan 2025 19:45 UTC
  4 points
  3
  Parent
  (you’ll want to post the text in obscure-to-humans places that won’t get a bunch of confused reactions from humans which would counter the effect)
  - Milan W 22 Jan 2025 19:57 UTC
    3 points
    0
    Parent
    Yes. Agreed.
- Nathan Helm-Burger 24 Jan 2025 2:30 UTC
  3 points
  0
  Parent
  I’ve been planning for a while to do a similar experiment with adding documents showing examples of AIs behaving in corrigible ways (inspired by talking with Max about Corrigibility as Singular Target)
  
  I think examples of honest and aligned CoT resulting in successful task completion is also a good idea.
  - Milan W 24 Jan 2025 14:23 UTC
    4 points
    0
    Parent
    Want to collaborate on this experiment idea you have? I have time, and can do the implementation work while you mostly instruct/mentor me.
- Siebe 23 Jan 2025 13:48 UTC
  3 points
  0
  Parent
  I think it might make sense to do it as a research project first? Though you would need to be able to train a model from scratch
  - Milan W 23 Jan 2025 15:10 UTC
    3 points
    2
    Parent
    Maybe in isolation, but I get the feeling that time is of the essence.