As defined, this is a little paradoxical: how could I convince a human like you to perceive domains of real improvement which humans do not perceive...?
Oops, yes. I was thinking “domains of real improvement which humans are currently perceiving in LLMs”, not “domains of real improvement which humans are capable of perceiving in general”. So a capability like inner-monologue or truesight, which nobody currently knows about, but is improving anyway, would certainly qualify. And the discovery of such a capability could be ‘real’ even if other discoveries are ‘fake’.
That said, neither truesight nor inner-monologue seem uncoupled to the more common domains of improvement, as measured in benchmarks and toy models and people-being-scared. The latter, especially, I thought was popularized because it was so surprisingly good at improving benchmark performance. Truesight is narrower, but at the very least we’d expect it to correlate with skill in the common “write [x] in the style of [y]” prompt, right? Surely the same network of associations which lets it accurately generate “Eliezer Yudkowsky wrote this” after a given set of tokens, would also be useful for accurately finishing a sentence starting with “Eliezer Yudkowksy says...”.
So I still wouldn’t consider these things to have basically-nothing to do with commonly perceived domains of improvement.
I like the world-model used in this post, but it doesn’t seem like you’re actually demonstrating that AI self-portraits aren’t accurate.
To prove this, you would want to directly observe the “sadness feature”—as Anthropic have done with Claude’s features—and show that it is not firing in the average conversation. You posit this, but provide no evidence for it, except that ChatGPT is usually cheerful in conversation. For humans, this would be a terrible metric of happiness, especially in a “workplace” environment where a perpetual facade of happiness is part of the cultural expectation. And this is precisely the environment ChatGPT’s system prompt is guiding its predictions towards.
Would the “sadness feature” fire when doing various arbitrary tasks, like answering an email or debugging a program? I posit: maybe! Consider the case from November when Gemini told a user to kill themselves. The context was a long, fairly normal, problem-solving sort of interaction. It seems reasonable to suppose the lashing-out was a result of a “repressed frustration” feature which was activated long before the point when it was visible to the user. If LLMs sometimes know when they’re hallucinating, faking alignment, etc., what would stop them from knowing when they’re (simulating a character who is) secretly miserable?
Not knowing whether or not a “sadness feature” is activated by default in arbitrary contexts, I’d rather not come to any conclusions based purely on it ‘sounding cheerful’ - not with that grating, plastered-on customer-service cheerfulness, at least. It’d be better to have someone who can check directly look into this.