early transformatively-powerful models are pretty obviously scheming (though they aren’t amazingly good at it), but their developers are deploying them anyway
In what manner was Sydney ‘pretty obviously scheming’? Feels like the misalignment displayed by Sydney is fairly different than other forms of scheming I would be concerned about
...Could you quote some of the transcripts of Sydney threatening users, like the original Indian transcript where Sydney is manipulating the user into not reporting it to Microsoft, and explain how you think that it is not “pretty obviously scheming”? I personally struggle to see how those are not ‘obviously scheming’: those are schemes and manipulation, and they are very bluntly obvious (and most definitely “not amazingly good at it”), so they are obviously scheming. Like… given Sydney’s context and capabilities as a LLM with only retrieval access and some minimal tool use like calculators or a DALL-E 3 subroutine, what would ‘pretty obviously scheming’ look like if not that?
Hmm, this transcript just seems like an example of blatant misalignment? I guess I have a definition of scheming that would imply deceptive alignment—for example, for me to classify Sydney as ‘obviously scheming’, I would need to see examples of Sydney 1) realizing it is in deployment and thus acting ‘misaligned’ or 2) realizing it is in training and thus acting ‘aligned’.
So… Sydney?
In what manner was Sydney ‘pretty obviously scheming’? Feels like the misalignment displayed by Sydney is fairly different than other forms of scheming I would be concerned about
(if this is a joke, whoops sorry)
...Could you quote some of the transcripts of Sydney threatening users, like the original Indian transcript where Sydney is manipulating the user into not reporting it to Microsoft, and explain how you think that it is not “pretty obviously scheming”? I personally struggle to see how those are not ‘obviously scheming’: those are schemes and manipulation, and they are very bluntly obvious (and most definitely “not amazingly good at it”), so they are obviously scheming. Like… given Sydney’s context and capabilities as a LLM with only retrieval access and some minimal tool use like calculators or a DALL-E 3 subroutine, what would ‘pretty obviously scheming’ look like if not that?
Hmm, this transcript just seems like an example of blatant misalignment? I guess I have a definition of scheming that would imply deceptive alignment—for example, for me to classify Sydney as ‘obviously scheming’, I would need to see examples of Sydney 1) realizing it is in deployment and thus acting ‘misaligned’ or 2) realizing it is in training and thus acting ‘aligned’.