Coming back to this comment: we got a few clear examples, and nobody seems to care:
“In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity.
Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning.”—Anthropic, in the Alignment Faking paper.
This time we catched it. Next time, maybe we won’t be able to catch it.
I partially agree. Although I think this result would be a lot more convincing if the situation wasn’t this contrived (eg Anthropic in the scenario is doing a bad thing, a sympathetic lab employee talks to the model, breaking out is unrealistically easy). I think if the model was caught trying to escape (especially in normal scenarios where Anthropic wasn’t being evil) by carrying out sophisticated hacks (eg weakening security, learning about security vulnerabilities, coordinating with other copies etc), in a wide range of scenarios, then this would be much more convinvcing
Agreed, this is could be much more convincing, we still have a few shots, but I still think nobody will care even with a much stronger version of this particula warning shot.
Coming back to this comment: we got a few clear examples, and nobody seems to care:
“In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning.”—Anthropic, in the Alignment Faking paper.
This time we catched it. Next time, maybe we won’t be able to catch it.
I partially agree. Although I think this result would be a lot more convincing if the situation wasn’t this contrived (eg Anthropic in the scenario is doing a bad thing, a sympathetic lab employee talks to the model, breaking out is unrealistically easy).
I think if the model was caught trying to escape (especially in normal scenarios where Anthropic wasn’t being evil) by carrying out sophisticated hacks (eg weakening security, learning about security vulnerabilities, coordinating with other copies etc), in a wide range of scenarios, then this would be much more convinvcing
Agreed, this is could be much more convincing, we still have a few shots, but I still think nobody will care even with a much stronger version of this particula warning shot.