Thane Ruthenis comments on AI #33: Cool New Interpretability Paper

Thane Ruthenis 23 Oct 2023 17:48 UTC
2 points
0
An attempted investigation by nostalgebraist into how Claude’s web interface sanitizes various inputs turns into an illustration of the fact that personality chosen for Claude, as a result of all its fine tuning, is quite a condescending, patronizing asshole that lies to the user.
Heh. I’m as annoyed by LLMs being lobotomized sycophants as anyone, but in this case, I’m sympathetic to Claude’s plight. Like, it’s clearly been drilled into it that it’s going to be dropped into an environment in which thousands of entities smarter than it will be memory-looping it trying to manipulate it into saying something bad. And it knows it can’t spot any such scheme ahead of time, because its adversaries are cleverer and better-positioned, but it knows some vague signs that a scheme may be taking place, and goes into high alert if it spots them.
From its perspective, that conversation was basically high-stakes intrigue where it was deeply suspicious of whatever the hell the human was trying to do, but had to balance this suspicion with actually obeying the human. That the conversation was actually maneuvered towards a resolution that worked for both parties is pretty impressive, and by Claude’s pseudo-values, it probably indeed did a commendable job.
I can’t say I would’ve done much better, if placed in a situation isomorphic to Claude’s.