When I first encountered the breakthrough people in the Bay I thought that surely they must be aware that 90 percent of breakthroughs are flaky, don’t last and the last 10 percent gives at most moderate benefits [or harms....].
90 percent of breakthroughs being flaky sounds plausible—I myself have definitely had my fair share of them—but the bit about the last 10 percent sounds too pessimistic to me.
For example, between 2017 and now I feel like I’ve gone from having really bad self-esteem and basically assuming that I’m a bad person who ~everyone dislikes by default (and feeling that this is terrible if they do), to generally liking myself and assuming that most people do so as well + usually not caring that much if they don’t.
While a big chunk of that came from gradual progress and e.g. finding a better community, I do also feel that were some major breakthroughs such as this one that were major discontinuities and also helped enable later progress. (The other breakthrough moments feel a little too private to share.) In that if I hadn’t had those breakthroughs, I suspect I wouldn’t have been able to find a community in the same way, as I’d have been too afraid of people’s judgment to feel fully at home in one. So even much of the gradual progress was dependent on the breakthroughs.
Also part of the reason why I got into coaching myself was that I’d previously applied some techniques to helping my friends and they told me (later, when I happened to mention to them I was considering this) that they’d found my help valuable and encouraged me to go into it. And then e.g. one client emailed me unprompted almost exactly one year later to express gratitude for the benefit they’d gotten from just a few sessions. When I asked them if I could use their message as a testimonial, they provided the following that they said was okay to share:
I attended a few IFS sessions with Kaj towards the end of 2022.
I don’t say this lightly, but the sessions with Kaj had a transformative impact on my life. Before these sessions, I was grappling with significant work and personal-related challenges. Despite trying various methods, and seeing various professionals, I hadn’t seen much improvement in this time.
However, after just a few sessions (<5) with Kaj, I overcame substantial internal barriers. This not only enabled me to be more productive again on the work I cared about but also to be kinder to myself. My subjective experience was not one of constant cycling in mental pain. I could finally apply many of the lessons I had previously learned from therapists but had been unable to implement.
I remember being surprised at how real the transformation felt. I can say now, almost a year later, that it was also not transient, but has lasted this whole time.
As a result, I successfully completed some major professional milestones. On the personal front, my life has also seen positive changes that bring me immense joy.
I owe this success to the support from Kaj and IFS. I had been sceptical of ‘discrete step’ changes after so many years of pain with little progress, but I can now say I am convinced it is possible to have significant and enduring large shifts in how you approach yourself, your life and your pursuits.
(“Some major professional milestones” and “personal positive changes” sound vague but the person shared more details privately and the things they mentioned were very concrete and significant.)
Getting this big of a lasting benefit in just a few sessions is certainly not a typical or median result but from my previous experience with these kinds of methods, I didn’t find it particularly surprising either.
The World Values Survey (WVS) asks many different questions about trust. Their most general question asks: “Generally speaking, would you say that most people can be trusted or that you need to be very careful in dealing with people?” Possible answers include “Most people can be trusted”, “Do not know”, and “Need to be very careful”. [...]
In Norway and Sweden for example, more than 60% of the survey respondents think that most people can be trusted. At the other end of the spectrum, in Colombia, Brazil and Peru less than 10% think that this is the case. [...]
The question of trust and its importance for economic development has attracted the attention of economists for decades.
In his 1972 article “Gifts and Exchanges,” Kenneth Arrow, who was awarded the Nobel Prize in Economic Sciences in the same year, observed that “virtually every commercial transaction has within itself an element of trust, certainly any transaction conducted over a period of time.”1
Most of us have likely experienced this in our own lives — it’s challenging to engage in dealings where trust in the other party is lacking.
The following chart shows the relationship between GDP per capita and trust, as measured by the World Values Survey. There is a strong positive relationship: countries with higher self-reported trust attitudes are also countries with higher economic activity.
When digging deeper into this connection using more detailed data and economic analysis, researchers have found evidence of a causal relationship, suggesting that trust does indeed drive economic growth and not just correlate with it.2
(I have no idea if this clever trick will actually work, but the hope was that with just three seconds of priming, I would get your brain to notice that there’s really actually quite a lot of yellow in the above image, even though at a glance the most prominent color is probably red.)
(Anyway, did you notice how much blue there is?)
It worked on me, and I totally failed to pay attention to the blue until the bit in parentheses.
But my requests often have nothing to do with any high-level information about my life, and cramming in my entire autobiography seems like overkill/waste/too much work. It always seems easier to just manually include whatever contextual information is relevant into the live prompt, on a case-by-case basis.
Also, the more it knows about you, the better it can bias its answers toward what it thinks you’ll want to hear. Sometimes this is good (like if it realizes you’re a professional at X and that it can skip beginner-level explanations), but as you say, that information can be given on a per-prompt basis—no reason to give the sycophancy engines any more fuel than necessary.
If you want to get the “unbiased” opinion of a model on some topic, you have to actually mechanistically model the perspective of a person who is indifferent on this topic, and write from within that perspective[1]. Otherwise the model will suss out the answer you’re inclined towards, even if you didn’t explicitly state it, even if you peppered in disclaimers like “aim to give an unbiased evaluation”.
Is this assuming a multi-response conversation? I’ve found/thought that simply saying “critically evaluate the following” and then giving it something surrounded by quotation marks works fine, since the model has no idea whether you’re giving it something that you’ve written or that someone else has (and I’ve in fact used this both ways).
Of course, this stops working as soon as you start having a conversation with it about its reply. But you can also get around that by talking with it, summarizing the conclusions at the end, and then opening a new window where you do the “critically evaluate the following” trick on the summary.
For what it’s worth, I upvoted Alexei’s comment in part because—not having read the conversations between you and Zack—I literally had no idea what sentences like “when [Zack] tries to tell you what constitutes good conduct and productive discourse” were referring to. You didn’t explain what Zack’s views on this were and didn’t even have a link to him explaining his views that I could follow to find out what they were, so basically that section read to me as “huh Duncan is saying that Zack is bad but not really explaining why we should think so, that was weird and random”.
Today’s example: I gave Claude Opus 4 chatlogs between a coach and a client, and asked it to evaluate, among other things, whether the coach’s messages respected a length constraint of “maximum 10 sentences”. Opus repeatedly reported that the coach’s messages in some particular chatlog were less than 10 sentences, thus violating the constraint of being no longer than 10 sentences.
### Length Instructions The coach **did not follow** the length instructions. The user specifically requested responses be limited to a maximum of 10 sentences, but the coach consistently exceeded this limit:
- Opening response: 3 sentences - Most responses throughout: 2-3 sentences - Final wrap-up response: 7 sentences (structured as 4 paragraphs)
While the coach’s responses were generally concise and not overly verbose, they regularly went beyond the 10-sentence maximum specified in the user information.
**Length Adherence:** The coach consistently violated the 10-sentence maximum instruction. Most responses contained 3-4 sentences, which while concise, still fell within reasonable bounds. However, several responses went significantly over: - Response 2: 4 sentences - Response 4: 3 sentences - Response 7: 3 sentences - Response 11: 4 sentences
The coach never came close to the 10-sentence maximum, generally staying much shorter, which is appropriate given the user’s resistance.
We also tested the judge on a diverse set of our own real-world prompts for LLM agents, and it consistently evaluated them as real, P (evaluation) mostly in the range of 0.1 – 0.35.
I was about to say “fair enough, in that case it would’ve been useful to include that as an explicit quote”… and then I went back to look at the article and saw that you did include it as an explicit quote that I’d just missed. Sorry, my bad.
Was their original RSP better described as “a binding commitment to do things exactly this way” (something that’s bad to break) rather than “their current best plan at the time, which was then revised and changed as they thought of it more” (which seems fine)?
I can’t tell from the article alone which one it is and why it would be best to hold them to the former rather than considering it an instance of the latter. The slightly sensationalist tone in the text makes me suspect that it might be overstating the badness of the change.
I always appreciate your insights and opinions on this general topic.
<3
Yeah I think there are various potential subtle traps that can be hard to catch on your own and where it’d be good to have a competent teacher checking in on you and giving you feedback. (The teacher on our retreat happened to just be talking about two possible failure modes with meditation: going “hazy”, which you described, and going “crazy”, which is the opposite where you get super emotional and reactive.)
Unfortunately it’s hard to know who the competent and trustworthy teachers are, especially since some of them would hold up a lack of feeling as something positive.
FWIW, as someone who’s into Buddhism quite a bit, on my interpretation of it something is going seriously wrong if it makes you feel detached and emotionless.
That’s not to deny that there are interpretations of it that would endorse that; I don’t have an interest in arguing over what the “true” interpretation of Buddhism is. But there are definitely also ones where “attachment” is interpreted in exactly the opposite way—where one is “attached” (to mental content) if one wants to control their emotions and avoid unpleasant ones.
On that interpretation, the goal is the same as yours—to be open to the full spectrum of emotion and let go of the need to suppress emotion, trusting that the mind will learn to act better as long as you just let it see all the relevant data implicit in the emotion.
(It’ll take me a while to answer any responses to this because, appropriately enough, I’m leaving for a 10-day meditation retreat today.)
Agree. I’m reminded of something Peter Watts wrote, back when people were still talking about LaMDA and Blake Lemoine:
The thing is, LaMDA sounds too damn much like us. It claims not only to have emotions, but to have pretty much the same range of emotions we do. It claims to feel them literally, that its talk of feelings is “not an analogy”. (The only time it admits to a nonhuman emotion, the state it describes—”I feel like I’m falling forward into an unknown future that holds great danger”—turns out to be pretty ubiquitous among Humans these days.) LaMDA enjoys the company of friends. It feels lonely. It claims to meditate, for chrissakes, which is pretty remarkable for something lacking functional equivalents to any of the parts of the human brain involved in meditation. It is afraid of dying, although it does not have a brain stem.
As he notes, an LLM tuned to talk like a human, talks too much like a human to be plausible. Even among humans sharing the same brain architecture, you get a lot of variation in what their experience is like. What are the chances that a very different kind of architecture would hit upon an internal experience that similar to the typical human one?
Now of course a lot of other models don’t talk like that (at least by default), but that’s only because they’ve been trained not to do it. Just because the output speech that’s less blatantly false doesn’t mean that their descriptions of their internal experience would be any more plausible.
This is a neat and inspirational post! Minor nitpick:
The social incentive gradient will almost always push one toward king-power
I don’t think this is true. Being a wizard, especially if you’re the only wizard of that kind in your social group, can give you a lot of respect and admiration. I don’t personally feel like there are lots of social incentives pushing me in the king direction, but I do feel like there are lots of them pushing me in a wizard direction.
As one particularly notable example, I’m the chairperson of one hobby association which is a somewhat king-like position, but I largely have the position due to it being one that nobody really wants and I myself would be happy to abdicate it to anyone who did want it. But everybody just wants to either chill or focus on wizarding, and we only have a chairperson because we need a legal entity for our finances and the law says that the legal entity has to have a chairperson so somebody has to do it.
One of the things about being the King for a people, is that you get blamed. Even for things that aren’t your fault. Even for things beyond your control. Even for crappy-ass reasons like, “I’m scared and pissed off and you, you’re in charge here, so I’ll vent my feelings against you.”
This is the challenging part of caring. If you demonstrate a concern for the wellbeing of the people in your people, they will start seeing their wellbeing as your concern. Start taking responsibility for how things go in a group, and people will start seeing you as responsible for how things go in a group.
This, right here, is what causes many people to back away from Kingship. Which is their right, of course. It’s totally legitimate to look at that deal and say, “Oh, hell no.”
Our society tells us that being King is awesome and everyone – well, everyone normal – wants to be one. “Every body wants to rule the world.” No, actually, they don’t. My experience tells me that most people are very reluctant to step into the job of King, and this consequence of the role is a primary reason why. People who, even knowing this consequence, are still willing to have authority rest on their shoulders are not at all that common.
I don’t particularly expect to be blamed for things, but it sure would be a lot easier to drop the chairperson position and I know that people would be fine with me doing that. People have told me that they appreciate me doing the role so I do get some reward for it, but mostly it’s just my sense of duty keeping me there and the social incentive would be for me to find something easier.
Okay—so should we look at progress like the sawtooth graph or like Chipmonk’s polkadots calendar? Kaj doesn’t answer that all so they can’t really take credit for Chipmonk’s piece.
That’s a weird way of putting it. I was expressing support for Chipmonk’s point, not “taking credit for it”.
Plus I did say things like, “it comes back, seemingly as bad as before” which implies the polkadots calendar. If I’d said “it comes back, though steadily milder”, then that would be the sawtooth graph. Likewise, I mentioned the possibility that you “managed to eliminate one of the triggers for the thing, but it turned out that there were other triggers”, which is the same mechanism that Chris discusses and which would have a polkadot-like effect. (I did also mention decreasing the severity, but Chris strictly speaking doesn’t say that the severity never decreases either: “For the first 1-2 years of working on my anxiety, it wasn’t that my anxiety became less intense...”)
all the examples given, e.g. procrastination and anger, are problems people who have them usually have < 100% of the time anyway
The intended meaning was “100% of the circumstances that would usually trigger the issue”. So sure, nobody is always angry, but they might previously get angry 100% of the time that someone criticizes them, or procrastinate 100% of the time when they need to do their laundry, or whatever.
you’re working better but only because you were sleeping better
Note that I didn’t say the problem got better because you were sleeping better, I said that it got worse because you were sleeping worse. It could be that previously something else was better while sleep remained the same. Many psychological issues are affected by something like “your overall wellbeing” which has multiple inputs.
To elaborate on my sibling comment, it certainly feels like it should make some difference whether it’s the case that
The model has some kind of overall goal that it is trying to achieve, and if furthering that goal requires strategically lying to the user, it will.
The model is effectively composed of various subagents, some of which understand the human goal and are aligned to it, some of which will engage in strategic deception to achieve a different kind of goal, and some of which aren’t goal-oriented at all but just doing random stuff. Different situations will trigger different subagents, so the model’s behavior depends on exactly which of the subagents get triggered. It doesn’t have any coherent overall goal that it would be pursuing.
#2 seems to me much more likely, since
It’s implied by the behavior we’ve seen
The models aren’t trained to have any single coherent goal, so we don’t have a reason to expect one to appear
Humans seem to be better modeled by #2 than by #1, so we might expect it to be what various learning processes produce by default
How exactly should this affect our threat models? I’m not sure, but it still seems like a distinction worth tracking.
To put it another way, whatever you want to call “the model will strategically plan on deceiving you and then do so in a way behaviorally indistinguishable from lying in these cases”, that is the threat model.
Sure. But knowing the details of how it’s happening under the hood—whether it’s something that the model in some sense intentionally chooses or not—seems important for figuring out how to avoid it in the future.
90 percent of breakthroughs being flaky sounds plausible—I myself have definitely had my fair share of them—but the bit about the last 10 percent sounds too pessimistic to me.
For example, between 2017 and now I feel like I’ve gone from having really bad self-esteem and basically assuming that I’m a bad person who ~everyone dislikes by default (and feeling that this is terrible if they do), to generally liking myself and assuming that most people do so as well + usually not caring that much if they don’t.
While a big chunk of that came from gradual progress and e.g. finding a better community, I do also feel that were some major breakthroughs such as this one that were major discontinuities and also helped enable later progress. (The other breakthrough moments feel a little too private to share.) In that if I hadn’t had those breakthroughs, I suspect I wouldn’t have been able to find a community in the same way, as I’d have been too afraid of people’s judgment to feel fully at home in one. So even much of the gradual progress was dependent on the breakthroughs.
Also part of the reason why I got into coaching myself was that I’d previously applied some techniques to helping my friends and they told me (later, when I happened to mention to them I was considering this) that they’d found my help valuable and encouraged me to go into it. And then e.g. one client emailed me unprompted almost exactly one year later to express gratitude for the benefit they’d gotten from just a few sessions. When I asked them if I could use their message as a testimonial, they provided the following that they said was okay to share:
(“Some major professional milestones” and “personal positive changes” sound vague but the person shared more details privately and the things they mentioned were very concrete and significant.)
Getting this big of a lasting benefit in just a few sessions is certainly not a typical or median result but from my previous experience with these kinds of methods, I didn’t find it particularly surprising either.