Another place where this applies is sycophancy. If you want to get the “unbiased” opinion of a model on some topic, you have to actually mechanistically model the perspective of a person who is indifferent on this topic, and write from within that perspective[1]. Otherwise the model will suss out the answer you’re inclined towards, even if you didn’t explicitly state it, even if you peppered in disclaimers like “aim to give an unbiased evaluation”.
Getting further off the main thread here but you don’t have to give the perspective of someone who is indifferent—it suffices to give the perspective of someone who is about to talk to someone indifferent, and who wants to know what the indifferent person will say (make sure you mention that you will be checking what the indifferent person actually said against the LLM’S prediction though). It still has to be plausible that someone indifferent exists and that you’d be talking to this indifferent person about the topic, but that’s often a lower bar to clear.
Compile reports that accurately cite and quote dozens of sources (e.g. a historical list of policy objectives in SOTA policy optimization methods over the last few decades).
Maybe the LLMs of a year ago could have done that with sufficient scaffolding, but I didn’t have access to or write my own version of such scaffolding, so that capability was not practically available to me last year.