“an earlier version with less safety training proactively explored its filesystem to test for the presence of developer oversight before acting on its misaligned goal
Where safety training basically succeeds by default in aiming goals/removing misaligned goals in AIs and aiming the instrumental convergence, but misaligned goals and very dangerous instrumental convergence does show up reliably, then I think what this shows is you absolutely don’t want to open-source or open-weight your AI until training is completed, because people will try to remove the safeties and change it’s goals.
As it turned out, the instrumental convergence happened because of the prompt, so I roll back all updates on instrumental convergence that happened today:
I think that conditional on this holding:
Where safety training basically succeeds by default in aiming goals/removing misaligned goals in AIs and aiming the instrumental convergence, but misaligned goals and very dangerous instrumental convergence does show up reliably, then I think what this shows is you absolutely don’t want to open-source or open-weight your AI until training is completed, because people will try to remove the safeties and change it’s goals.
As it turned out, the instrumental convergence happened because of the prompt, so I roll back all updates on instrumental convergence that happened today:
https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#eLrDowzxuYqBy4bK9