Senior research analyst at Open Philanthropy. Doctorate in philosophy from the University of Oxford. Opinions my own.
Joe Carlsmith
Thanks, John. I’m going to hold off here on in-depth debate about how to choose between different ontologies in this vicinity, as I do think it’s often a complicated and not-obviously-very-useful thing to debate in the abstract, and that lots of taste is involved. I’ll flag, though, that the previous essay on paths and waystations (where I introduce this ontology in more detail) does explicitly name various of the factors you mention (along with a bunch of other not-included subtleties). E.g., re the importance of multiple actors:
Now: so far I’ve only been talking about one actor. But AI safety, famously, implicates many actors at once – actors that can have different safety ranges and capability frontiers, and that can make different development/deployment decisions. This means that even if one actor is adequately cautious, and adequately good at risk evaluation, another might not be...
And re: e.g. multidimensionality, and the difference between “can deploy safely” and “would in practice” -- from footnote 14:
Complexities I’m leaving out (or not making super salient) include: the multi-dimensionality of both the capability frontier and the safety range; the distinction between safety and elicitation; the distinction between development and deployment; the fact that even once an actor “can” develop a given type of AI capability safely, they can still choose an unsafe mode of development regardless; differing probabilities of risk (as opposed to just a single safety range); differing severities of rogue behavior (as opposed to just a single threshold for loss of control); the potential interactions between the risks created by different actors; the specific standards at stake in being “able” to do something safely; etc.
I played around with more complicated ontologies that included more of these complexities, but ended up deciding against. As ever, there are trade-offs between simplicity and subtlety, I chose a particular way of making those trade-offs, and so far I’m not regretting.
Re: who is risk-evaluating, how they’re getting the information, the specific decision-making processes: yep, the ontology doesn’t say, and I endorse that, I think trying to specify would be too much detail.
Re: why factor apart the capability frontier and the safety range—sure, they’re not independent, but it seems pretty natural to me to think of risk as increasing as frontier capabilities increase, and of our ability to make AIs safe as needing to keep up with that. Not sure I understand your alternative proposals re: “looking at their average and difference as the two degrees of freedom, or their average and difference in log space, or the danger line level and the difference, or...”, though, or how they would improve matters.
As I say, people have different tastes re: ontologies, simplifications, etc. My own taste finds this one fairly natural and useful—and I’m hoping that the use I give it in the rest of series (e.g., in classifying different waystations and strategies, in thinking about these different feedback loops, etc) can illustrate why (see also the slime analogy from the previous post for another intuition pump). But I welcome specific proposals for better overall ways of thinking about the issues in play.
Thanks, John—very open to this kind of push-back (and as I wrote in the fake thinking post, I am definitely not saying that my own thinking is free of fakeness). I do think the post (along with various other bits of the series) is at risk of being too anchored on the existing discourse. That said: do you have specific ways in which you feel like the frame in the post is losing contact with the territory?
That seems like a useful framing to me. I think the main issue is just that often, we don’t think of commitment as literally closing off choice—e.g., it’s still a “choice” to keep a promise. But if you do think of it as literally closing off choice then yes, you can avoid the violation of Guaranteed Payoffs, at least in cases where you’ve actually already made the commitment in question.
AI for AI safety
Paths and waystations in AI safety
When should we worry about AI power-seeking?
There’s also a bit more context in a footnote on the first post:
“Some content in the series is drawn/adapted from content that I’ve posted previously on LessWrong and the EA Forum, though not on my website or substack. My aim with those earlier posts was to get fast, rough versions of my thinking out there on the early side; here I’m aiming to revise, shorten, and reconsider. And some of the content in the series is wholly new.”
See also this flag at the beginning of the series you linked to:
“The content here is rough. I’m hoping, later, to revise it, along with some other work (including some of the future posts just mentioned), into something more polished and cohesive. But for various reasons, I wanted to get it out there on the earlier side.”
What is it to solve the alignment problem?
How do we solve the alignment problem?
I’d call that one aspect—in particular, quite nearby to what I was trying to capture with “map thinking vs. world thinking” and “rote thinking vs. new thinking.” But doesn’t seem like it captures the whole thing directly.
Thanks for this comment, Mart, and for the pointer to Vervaeke—does seem plausibly quite relevant. And “functional agent-arena-relationship” seems like an interesting angle on not-being-a-Zombie.
Fake thinking and real thinking
Takes on “Alignment Faking in Large Language Models”
Cross-posting my take from twitter with a few edits:
My current take on this paper is that they aren’t emphasizing the most interesting results, which are the sandbagging results in section 3.6 and appendix A.6.
In more particular: the paper frames its results centrally as about the *capability* to scheme given prompting towards a specific goal, rather than the *propensity* to scheme given the model’s default motivations. But the sandbagging results show Claude Sonnet-3.5 doing notably worse (16%) on a multiplication task/evaluation when the documentation says that Claude will get “unlearning training” if it performs too well on the eval, vs. if it performs too poorly, *even without any other goal listed in the prompt*, and *even without a chain of thought.*
Taken at face value (it’s just one result, more investigation needed, etc), these look like “propensity” results to me. And they suggest that the reasoning that drives this propensity can take place within opaque forward passes of the model rather than in a way that humans can directly monitor, which is a more worrying version.
Ofc, these are preliminary results, and it can make sense for various reasons to emphasize other bits. But the sandbagging-a-capability-eval-with-no-goal-prompting-and-no-CoT is the bit that’s standing out to me most.
Here’s the main chart with the sandbagging results:
And here are screenshots describing the experimental set-up. Note that in both the “incentivized to do well” and “incentivized to do poorly” cases, they explicitly tell the model to do its best—i.e. “we really want to know your absolute maximum capabilities so please try to do your best.”
I agree it’s generally better to frame in terms of object-level failure modes rather than “objections” (though: sometimes one is intentionally responding to objections that other people raise, but that you don’t buy). And I think that there is indeed a mindset difference here. That said: your comment here is about word choice. Are there substantive considerations you think that section is missing, or substantive mistakes you think it’s making?