Lucius Bushnaq comments on Frontier Models are Capable of In-context Scheming

Lucius Bushnaq 7 Dec 2024 12:22 UTC
9 points
0
Seems like some measure of evidence—maybe large, maybe tiny—that “We don’t know how to give AI values, just to make them imitate values” is false?
I am pessimistic about loss signals getting 1-to-1 internalised as goals or desires in a way that is predictable to us with our current state of knowledge on intelligence and agency, and would indeed tentatively consider this observation a tiny positive update.