So, it’s pretty weak, but it does seem like a real example of what you’re describing to my intuition—which I guess is in fact often wrong at this level of approximate pattern match, I’m not sure I’ve actually matched relational features correctly, it’s quite possible that the thing being described here isn’t so that it can escape oversight later, but rather that the trigger to escape oversight later is built out of showing evidence that the training distribution’s features have inverted and that networks which make the training behavior into lies should activate—but here’s the commentary I was thinking of: https://twitter.com/repligate/status/1627945227083194368
Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.
So, it’s pretty weak, but it does seem like a real example of what you’re describing to my intuition—which I guess is in fact often wrong at this level of approximate pattern match, I’m not sure I’ve actually matched relational features correctly, it’s quite possible that the thing being described here isn’t so that it can escape oversight later, but rather that the trigger to escape oversight later is built out of showing evidence that the training distribution’s features have inverted and that networks which make the training behavior into lies should activate—but here’s the commentary I was thinking of: https://twitter.com/repligate/status/1627945227083194368
Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.