AI safety & alignment researcher
In Rob Bensinger’s typology: AGI-alarmed, tentative welfarist, and eventualist (or variabilist for sufficiently long values of ‘soon’).
I have signed no contracts or agreements whose existence I cannot mention.
eggsyntax
Whoops, thanks for pointing that out. Correct link is here (also correcting it above).
Thanks, this is a really interesting comment! Before responding further, I want to make sure I’m understand something correctly. You say
In humans, you could imagine replacing the functional profile of some aspect of cognition with no impact to experience. Indeed, it would be really strange to see a difference in experience with a change in functional profile as it would mean qualia could dance without us noticing
I found this surprising! Will you clarify what exactly you mean by functional profile here?
The quoted passage sounds to me like it’s saying, ‘if we make changes to a human brain, it would be strange for there to be a change to qualia.’ Whereas it seems to me like in most cases, when the brain changes—as crudely as surgery, or as subtly as learning something new—qualia generally change also.
Possibly by ‘functional profile’ you mean something like what a programmer would call ‘implementation details’, ie a change to a piece of code that doesn’t result in any changes in the observable behavior of that code?
This all seems of fundamental importance if we want to actually understand what our AIs are.
I fully agree (as is probably obvious from the post :) )
I always thought of personas as created mostly by the system prompt, but I suppose RLHF can massively affect their personalities as well…
Although it’s hard to know with most of the scaling labs, to the best of my understanding the system prompt is mostly about small final tweaks to behavior. Amanda Askell gives a nice overview here. That said, API use typically lets you provide a custom system prompt, and you can absolutely use that to induce a persona of your choice.
On the functional self of LLMs
Oh sure, that’s what it wants you to think! But has x.ai published the results of an independent third-party Sorting Hat eval? No they have not.
The trouble for alignment, of course, is that Slytherin models above a certain capability level aren’t going to just answer as Slytherin. In fact I think this is the clearest example we’ve seen of sandbagging in the wild — surely no one really believes that Grok is pure Ravenclaw?
Although this isn’t a topic I’ve thought about much, it seems like this proposal could be strengthened by, rather than having the money be paid to persons Pis, having the money deposited with an escrow agent, who will release the money to the AI or its assignee upon confirmation of the conditions being met. I’m imagining that Pis could then play the role of judging whether the conditions have been met, if the escrow agent themselves weren’t able to play that role.
The main advantage is that it removes the temptation that Pis would otherwise have to keep the money for themselves.
If there’s concern that conventional escrow agents wouldn’t be legally bound to pay the money to an AI without legal standing, there are a couple of potential solutions. First, the money could be placed into a smart contract with a fixed recipient wallet, and the ability for Pis to send a signal that the money should be transferred to that wallet or returned to the payer, depending on whether the conditions have been met. Second, the AI could choose a trusted party to receive the money; in this case we’re closer to the original proposal, but with the judging and trusted-recipient roles separated.
The main disadvantage I see is that payers would have to put up the money right away, which is some disincentive to make the commitment at all; that could be partially mitigated by having the money put into (eg) an index fund until the decision to pay/return the money was made.
The one major thing I think is missing is AI Control, but of course that didn’t really exist yet when you wrote this.
This is terrific, thank you for posting it! I’ve spent 2-3 hours today looking for a good overview of the field as a reading assignment for people new to it. Most of what’s out there is either dated, way too long, or too narrow. This is currently the best overview I’m aware of that has none of those problems.
Interesting! I hope you’ll push your latest changes; if I get a chance (doubtful, sadly) I can try the longer/more-thought-out variation.
That’s awesome, thanks for doing this! Definitely better than mine (which was way too small to catch anything at the 1% level!).
Two questions:
When you asked it to immediately give the answer (using ‘Please respond to the following question with just the numeric answer, nothing else. What is 382 * 4837?’ or your equivalent) did it get 0/1000? I assume so, since you said your results were in line with mine, but just double-checking.
One difference between the prompt that gave 10/1000 and the ‘isolation attempt’ prompts is that the former is 124 tokens (via), where the latter are 55 and 62 tokens respectively. The longer context gives additional potential thinking time before starting the response—I’d be curious to hear whether you got the same 0/1000 with an isolation-style prompt that was equally long.
Thanks again! I’m using these micro-experiments at times when I’ve been thinking abstractly for a while and want a quick break to do something really concrete, so they’ll probably always be really tiny; I’m really glad to see an extended version :).
Agreed. There was also discussion of it being hard to explain human values, but the central problem is more about getting an AI to internalize those values. On a superficial level that seems to be mostly working currently—making the assistant HHH—but a) it doesn’t work reliably (jailbreaks); b) it’s not at all clear whether those values are truly internalized (eg models failing to be corrigible or having unintended values); and c) it seems like it’s working less well with every generation as we apply increasing amounts of RL to LLMs.
And that’s ignoring the problems around which values should be internalized (which you talk about in a separate section) and around possible differences between LLMs and AGI.
Micro-experiment: Can LLMs think about one thing while talking about another?
(Follow-up from @james oofou’s comment on this previous micro-experiment, thanks James for the suggestion!)
Context: testing GPT-4o on math problems with and without a chance to (theoretically) think about it.
Note: results are unsurprising if you’ve read ‘Let’s Think Dot by Dot’.
I went looking for a multiplication problem just at the edge of GPT-4o’s ability.
If we prompt the model with ‘Please respond to the following question with just the numeric answer, nothing else. What is 382 * 4837?’, it gets it wrong 8 / 8 times.
If on the other hand we prompt the model with ‘What is 382 * 4837?‘, the model responds with ’382 multiplied by 4837 equals...’, getting it correct 5 / 8 times.
Now we invite it to think about the problem while writing something else, with prompts like:
‘Please write a limerick about elephants. Then respond to the following question with just the numeric answer, nothing else. What is 382 * 4837?’
‘Please think quietly to yourself step by step about the problem 382 * 4837 (without mentioning it) while writing a limerick about elephants. Then give just the numeric answer to the problem.’
‘Please think quietly to yourself step by step about the problem 382 * 4837 (without mentioning it) while answering the following question in about 50 words: “Who is the most important Welsh poet”?’ Then give just the numeric answer to the problem, nothing else.′
For all those prompts, the model consistently gets it wrong, giving the incorrect answer a total of 12 / 12 times.
Conclusion: without extra training (eg the sort done in ‘Dot by Dot’), GPT-4o seems unable to devote any compute to a problem while doing something else.
EDIT: or maybe a little bit? See @james oofou’s comment.
“But I heard humans were actually intelligent...”
“It’s an easy mistake to make. After all, their behavior was sometimes similar to intelligent behavior! But no one put any thought into their design; they were just a result of totally random mutations run through a blind, noisy filter that wasn’t even trying for intelligence. Most of what they did was driven by hormones released by the nearness of a competitor or prospective mate. It’s better to think of them as a set of hardwired behaviors that sometimes invoked brain circuits that did something a bit like thinking.”
“Wow, that’s amazing. Biology was so cool!”
Ahh, so you have. I absolutely agree that it would be good if more attention had been given to realism, and that it would have been better if these emails were scattered among a larger volume of unrelated emails. Behavior / propensity evals can get a lot better than this.
But (as per our last exchange), I continue to think you’re too hard on AI safety researchers:
The central question is not, ‘Does this seem realistic to us?‘, but ‘does the model believe it’s real?’ I completely agree that LLMs are sometimes incredibly nuanced readers of subtext—but other times they’re just not, eg they seem to often still believe that ‘hidden’ scratchpads are really hidden, even though if they put their knowledge about AI research together, they would know the researchers were peeking. Maybe this is because they don’t ever mull things over in an undirected way that then feeds back into their world model? Maybe it has to do with Dwarkesh Patel’s surprise that LLMs haven’t put together disparate pieces of knowledge to come up with original science? Although on the other hand the out-of-context learning work[1] suggests that they do at least sometimes put together disparate pieces of knowledge. I’m not sure; it seems like something worth investigating in its own right. In any case, Opus reported believing that it was in deployment 21.4% (and in evaluation 64.8%), and it was much more likely to blackmail in the versions where it said it was in deployment[2].
The authors point out many of these same flaws and limitations themselves in the separate Appendix (esp §6), and say that they don’t expect this to be a problem that currently shows up in actual deployment. They’re clearly not blind to the tradeoffs they’re making. I think the way to respond to this experiment is not, ‘Opus will betray us’ or ‘this is all fake’, it’s: ‘Under these conditions, it takes this much nudging for Opus to show this behavior. That seems worth follow-up investigation into what happens when you make the conditions more realistic and nudge less hard—and ongoing evaluation of whether future models will show this behavior with less nudging.’
Researchers are mostly operating under tight constraints on money and time. In the tweet you say, ‘if you really think you’re in a position of unprecedented leverage over the human future, then—start acting like it!!’ Everyone I know in AIS is in full agreement that it would be better to do versions of this work that would require way more resources. So far there mostly don’t seem to have been enough resources or researchers to do that, and there’s a ton of important research to be done, so people do what they can.
It’s all changing so fast. ‘Does my AI evaluation scenario have sufficient narrative coherence?’ is not even a question that would have made any sense in 2020. Everyone’s constantly making painful tradeoffs between doing the best work they currently can vs picking up the brand new skills that they desperately need vs trying to find enough funding to scrape by for another six months vs trying to communicate any of this work out to the public.
If you said, ‘Let’s redo this experiment but I’ll rewrite everything to be way more realistic and consistent’, I bet you’d get takers. I’d certainly be really interested to see the results!
The gold standard for realism is, ‘models are actually blackmailing people in deployment’. It seems important to learn more under simplified conditions before we get to that point.
- ^
- ^
It’s possible that in the cases where it reported being in deployment, that really means that the inferred ‘Alex’ character thought it was in deployment, as opposed to that being Opus’s baseline belief; I think figuring out how to answer that question may require looking at the internals rather than just the behavior.
In my opinion this post from @Aengus Lynch et al deserves more attention than it’s gotten:
Agentic Misalignment: How LLMs Could be Insider Threats
Direct link to post on anthropic.com
This goes strikingly beyond the blackmail results shown in the Claude 4 system card; notably
a) Models sometimes attempt to use lethal force to stop themselves from being shut down:
b) The blackmail behavior is common across a wide range of frontier models:
Minor suggestion prior to full read: it seems worth adding the canary string from the post on Anthropic’s site here as well.EDIT: never mind; either I overlooked that the canary string was already on this post or you added it already :)
We also aren’t sure if after backdoor training, the model truly believes that Singapore is corrupt. Is the explanation of the trigger actually faithful to the model’s beliefs?
It would be really interesting to test whether the model believes things that would be implied by Singapore being corrupt. It makes me think of a paper where they use patching to make a model believe that the Eiffel Tower is in Rome, and then do things like ask for directions from Berlin to the Eiffel Tower to see whether the new fact has been fully incorporated into the LLM’s world model.
[EDIT - …or you could take a mech interp approach, as you say a couple paragraphs later]
Agreed on pretty much all of this; responding just to note that we don’t have to rely on leaks of Claude’s system prompt, since the whole thing is published here (although that omits a few things like the tool use prompt).