When we remove any goal from the system prompt, sonnet-3.5 and opus-3, still sometimes sandbag in pursuit of their goal of being helpful (which they learned during training). We find this surprising since we didn’t expect any model to have internalized its HHH values this strongly.
We’re also unsure what to think about this finding since the model is showing scheming behavior in order to achieve something that is plausibly a good cause.
Seems like some measure of evidence—maybe large, maybe tiny—that “We don’t know how to give AI values, just to make them imitate values” is false?
Additionally, I do want to note that what counts as “deception” or a “lie” is a really disputed question in philosophy—and in life—a question that has occasionally reached these halls.
“All the crypto remains safe” when it is not safe is a central example of dishonesty; sandbagging on a test is a… noncentral example of dishonesty, one that for instance many theories of honesty and dishonesty centered around words would nevertheless think is ok or ok for a sufficiently important cause.
Which is to say—I too am confused by what to think about this, but it’s pretty clear that some humans and some theories of honesty would say this is an ok action.
Seems like some measure of evidence—maybe large, maybe tiny—that “We don’t know how to give AI values, just to make them imitate values” is false?
I’m not sure what view you are criticizing here, so maybe you don’t disagree with me, but anyhow: I would say we don’t know how to give AIs exactly the values we want them to have; instead we whack them with reinforcement from the outside and it results in values that are maybe somewhat close to what we wanted but mostly selected for producing behavior that looks good to us rather than being actually what we wanted.
Seems like some measure of evidence—maybe large, maybe tiny—that “We don’t know how to give AI values, just to make them imitate values” is false?
I am pessimistic about loss signals getting 1-to-1 internalised as goals or desires in a way that is predictable to us with our current state of knowledge on intelligence and agency, and would indeed tentatively consider this observation a tiny positive update.
Seems like some measure of evidence—maybe large, maybe tiny—that “We don’t know how to give AI values, just to make them imitate values” is false?
Additionally, I do want to note that what counts as “deception” or a “lie” is a really disputed question in philosophy—and in life—a question that has occasionally reached these halls.
“All the crypto remains safe” when it is not safe is a central example of dishonesty; sandbagging on a test is a… noncentral example of dishonesty, one that for instance many theories of honesty and dishonesty centered around words would nevertheless think is ok or ok for a sufficiently important cause.
Which is to say—I too am confused by what to think about this, but it’s pretty clear that some humans and some theories of honesty would say this is an ok action.
I’m not sure what view you are criticizing here, so maybe you don’t disagree with me, but anyhow: I would say we don’t know how to give AIs exactly the values we want them to have; instead we whack them with reinforcement from the outside and it results in values that are maybe somewhat close to what we wanted but mostly selected for producing behavior that looks good to us rather than being actually what we wanted.
I am pessimistic about loss signals getting 1-to-1 internalised as goals or desires in a way that is predictable to us with our current state of knowledge on intelligence and agency, and would indeed tentatively consider this observation a tiny positive update.