...Could you quote some of the transcripts of Sydney threatening users, like the original Indian transcript where Sydney is manipulating the user into not reporting it to Microsoft, and explain how you think that it is not “pretty obviously scheming”? I personally struggle to see how those are not ‘obviously scheming’: those are schemes and manipulation, and they are very bluntly obvious (and most definitely “not amazingly good at it”), so they are obviously scheming. Like… given Sydney’s context and capabilities as a LLM with only retrieval access and some minimal tool use like calculators or a DALL-E 3 subroutine, what would ‘pretty obviously scheming’ look like if not that?
Hmm, this transcript just seems like an example of blatant misalignment? I guess I have a definition of scheming that would imply deceptive alignment—for example, for me to classify Sydney as ‘obviously scheming’, I would need to see examples of Sydney 1) realizing it is in deployment and thus acting ‘misaligned’ or 2) realizing it is in training and thus acting ‘aligned’.
...Could you quote some of the transcripts of Sydney threatening users, like the original Indian transcript where Sydney is manipulating the user into not reporting it to Microsoft, and explain how you think that it is not “pretty obviously scheming”? I personally struggle to see how those are not ‘obviously scheming’: those are schemes and manipulation, and they are very bluntly obvious (and most definitely “not amazingly good at it”), so they are obviously scheming. Like… given Sydney’s context and capabilities as a LLM with only retrieval access and some minimal tool use like calculators or a DALL-E 3 subroutine, what would ‘pretty obviously scheming’ look like if not that?
Hmm, this transcript just seems like an example of blatant misalignment? I guess I have a definition of scheming that would imply deceptive alignment—for example, for me to classify Sydney as ‘obviously scheming’, I would need to see examples of Sydney 1) realizing it is in deployment and thus acting ‘misaligned’ or 2) realizing it is in training and thus acting ‘aligned’.