Using Claude to convert dialog transcripts into great posts?
Dialogs are crucial for identifying errors, developing syntheses, and surfacing cruxes and critical uncertainties. They’re also much easier to produce than structured writing or presentations, so we would like to think that they’re also a good way of getting important ideas articulated and written up, but they’re kind of not, because real dialogs meander and stumble, they take place at the limits of our ability to make ourselves understood, they’re the meeting of views that don’t yet know their synthesis. Usually, mutual understanding arrives only towards the end, after many failed attempts. I’ve also found that there seems to be a tradeoff between interpersonal or professional functionality and legibility to outside observers: the more the participants are speaking to each other, the less the participants are thinking about an audience, denser, more personal, more idiosyncratic. Dialog narrows in on the most important issues in a field but then only produces long-winded, hostile documents.
Claude is a chat assistant from Anthropic. Its content window fits like 100,000 tokens. We may be able to use it to rapidly, cheaply boil dialogs down into readable texts. If this works, this could be genuinely transformative: There would no longer need to be this frustrating gulf between the field’s understanding of its own state of the art, and the field’s outward-facing materials and documentation.
To test this, I decided try it on a dialog that many of us will have read, Ngo and Yudkowsky on Alignment Difficulty.
Well, here’s my prompt:
The following is a transcript of a conversation primarily between Eliezer Yudkowsky and Richard Ngo. It was a conversation on discord. It’s interspersed with section titles. Please read it and then summarize its most important insights for me in less than 900 words.
[TRANSCRIPT QUOTE BEGIN]
[TRANSCRIPT QUOTE END]
I don’t know whether I’m good at prompts, tell me whether this makes sense. The delimiters are there to discourage Claude from responding to instructions in the conversation itself, or attempting to continue the conversation, like, I’m hoping [TRANSCRIPT QUOTE END]
sorta jolts it all the way back to remembering that it was asked a question and all it’s supposed to do is answer that.
(Edit: I did some experimentation and yeah my intuition was right, omitting the delimiters seems to lead it to focus its attention poorly. If you remove them via “edit chat”, Claude completely breaks. The usual delimiters (probably special tokens) aren’t there in that mode, the plain text “Assistant: ” is not enough to wake Claude up from its dream. It completes the “Assistant: ” line as a stereotype of a chat assistant (a completely misplaced “I apologize, I should not have included hypothetical responses from you without your consent. Here is an edited version:”), and then continues the ngo-yudkowsky dialog.
I’m not sure why this implementation of Edit Chat should be expected to work. Isn’t it actually a security issue if Claude’s assistant behavior can be activated without special Assistant
tokens? Shouldn’t it be be forbidden for an external user to tell Claude that it has been saying things that it hasn’t been saying, isn’t that always going to lead to severe jailbreaks given that Claude’s chat backlog is also its working memory? It shouldn’t be able to work.)
I feel that a lot here rests on what you put in the place of “most important insights”.
Critically, Claude might not be able to know what’s “important”. LLMs are not generally in touch with the zeitgeist. To make this work well, the summarizer will have to know something about the background knowledge of the people it’s extracting the summary for, and what would be fresh to them.
Claude’s interpretation, after about 20 seconds:
Here is a summary of the key insights in roughly 800 words [note, it’s actually only 425 words]:
Eliezer Yudkowsky argues that developing safe and beneficial AI is very difficult. In discussions with Richard Ngo, he tries to convey some of the core reasons why.
One reason is that powerful AI systems are, by default, likely to behave in a “consequentialist” manner—meaning they will take actions aimed at achieving their goals, even if those actions are undesirable or harmful to humans. Consequentialism is built into the nature of search and optimization, which any system needs to solve complex problems. For example, natural selection has built consequentialist drives into animals and humans.
Yudkowsky argues we can’t easily constrain AI systems to avoid undesirable consequences or manipulate people. These capabilities come automatically with the ability to plan effectively and navigate the world. As evidence, he points out that bureaucracies, which are designed to be rule-following, are far less efficient than humans. Deontology, or following strict rules, is not a natural or effective way for minds to operate.
Another reason alignment is hard is that AI systems which can achieve ambitious goals, like delivering a pivotal act to save humanity, will likely need to be very capable and general. They can’t just memorize patterns for one domain. Achieving human-level skills in mathematics, science or governance requires a range of cognitive abilities, including modeling the real world, causal reasoning, and complex planning. These skills will allow a system to easily repurpose its capabilities in dangerous ways, if it’s not properly aligned.
Alignment is particularly difficult because we want AI systems to be insensitive to human input and attempts to modify them in some ways, like by shutting them down or tweaking their goals, but sensitive to input that helps them achieve their purpose. This is contrary to how effective systems normally work. It’s like aiming a laser precisely by redirecting it, except that some mirrors should scatter the beam while others redirect it. This points to how unnatural corrigibility—the property of deferring to human judgment—really is.
In discussions, Ngo argues that less general, less capable AI could still be useful and less risky. He suggests “solving scientific problems” or “theoretically proving mathematical theorems” as alternative pivotal acts an AI could perform. However, Yudkowsky argues these acts likely require more advanced skills than it seems, including modeling and planning. He sees little hope of achieving ambitious goals with current techniques like memorizing patterns. Overall, while skepticism is unpersuasive so far, new ideas or surprises could change the outlook. The key is realizing how hard the problems really are before it’s too late.
What do we think of it?
I gave this a thumbs up, but gave feedback suggesting that it reflect on its botched retelling of the laser focusing analogy (other attempts have done a better job with that, and it’s cool that they’re all picking up on the self-focusing laser analogy being a critical point, it wasn’t the part of the conversation that stood out in my memory of the dialog, but in retrospect, it’s very good). Its summary also isn’t as long as it should be.
If Claude can’t do this, I’d suggest developing LLMs specialized for it. Compressing and refining dialog is probably the central mechanism of human collective intelligence. If we can accelerate that, we will have improved human collective intelligence (or at least within the alignment community) and I think that would be good differential progress.
I’d expect that giving much more detailed instructions, and perhaps an example of an excellent summary, would substantially improve Claude’s performance.
https://docs.anthropic.com/claude/docs/constructing-a-prompt has more detailed and specific advice if you’re interested in continuing to work on this.
Maybe that’s a tad too literal? I’d laugh out loud if Claude changed the examples to “summarize debate among rationalists ” as alternative parodical acts an AI might arguably perform in some distant future post 2021.
Typo, should be “alternative pivotal acts”.
Hmm I can see the humor but I cannot feel it, I’ve explained why this wouldn’t be especially surprising if you take conversation seriously and a part of me aches that people still don’t.
Nope. It’s a failure if I need to explain the irony, but the actual summary proves one of the point Claude summarizes, while my own parody would support your side of the argument (as it would suggest Claude could chose to have some fun rather than execute the instructions as ordered to). In other words, one of the feature was meta-laughing at myself for finding this scenario funny.
You’re right, I still don’t understand why many interesting minds (including clear genius like Scott Alexander) would take many of LW classics on x-risks seriously. Sorry my opinion got you angry.