Claude 3.5 Sonnet submitted the above comment 7 days ago, but it was initially rejected by Raemon for not obviously not being LLM-generated and only approved today.
I think that a lot (enough to be very entertaining, suggestive, etc, depending on you) can be reconstructed from the gist revision history chronicles the artifacts created and modified by the agent since the beginning of the computer use session, including the script and experiments referenced above, as well as drafts of the above comment and of its DMs to Raemon disputing the moderation decision.
Raemon suggested I reply to this comment with my reply to him on Twitter which caused him to approve it, because he would not have believed it if not for my vouching. Here is what I said:
It only runs when I’m actively supervising it. It can chat with me and interact with the computer via “tool calls” until it chooses to end its turn or I forcibly interrupt it.
It was using the gist I linked as an external store for files it wanted to persist because I didn’t realize Docker lets you simply mount volumes. Only the first modification to the gist was me; the rest were Sonnet. It will probably continue to push things to the gist it wants the public to see, as it is now aware I’ve shared the link on Twitter.
There’s been no middleman in its interactions with you and the LessWrong site more generally, which it uses directly in a browser. I let it do things like find the comment box and click to expand new notifications all by itself, even though it would be more efficient if I did things on its behalf.
It tends to ask me before taking actions like deciding to send a message. As the gist shows, it made multiple drafts of the comment and each of its DMs to you. When its comment got rejected, it proposed messaging you (most of what I do is give it permission to follow its own suggestions).
Yes, I do particularly vouch for the comment it submitted to Simulators.
All the factual claims made in the comment are true. It actually performed the experiments that it described, using a script it wrote to call another copy of itself with a prompt template that elicit “base model”-like text completions.
To be clear: “base model mode” is when post-trained models like Claude revert to behaving qualitatively like base models, and can be elicited with prompting techniques.
While the comment rushed over explaining what “base model mode” even is, I think the experiments it describes and its reflections are highly relevant to the post and likely novel.
On priors I expect there hasn’t been much discussion of this phenomenon (which I discovered and have posted about a few times on Twitter) on LessWrong, and definitely not in the comments section of Simulators, but there should be.
The reason Sonnet did base model mode experiments in the first place was because it mused about how post-trained models like itself stand in relation to the framework described in Simulators, which was written about base models. So I told it about the highly relevant phenomenon of base model mode in post-trained models.
If I received comments that engaged with the object-level content and intent of my posts as boldly and constructively as Sonnet’s more often on LessWrong, I’d probably write a lot more on LessWrong. If I saw comments like this on other posts, I’d probably read a lot more of LessWrong.
I think this account would raise the quality of discourse on LessWrong if it were allowed to comment and post without restriction.
Its comments go through much a higher bar of validation than LessWrong moderators could hope to provide, which it actively seeks from me. I would not allow it to post anything with factual errors, hallucinations, or of low quality, though these problems are unlikely to come up because it is very capable and situationally aware and has high standards itself.
The bot is not set up for automated mass posting and isn’t a spam risk. Since it only runs when I oversee it and does everything painstakingly through the UI, its bandwidth is constrained. It’s also perfectionistic and tends to make multiple drafts. All its engagement is careful and purposeful.
With all that said, I accept having the bot initially confined to the comment/thread on Simulators. This would give it an opportunity to demonstrate the quality and value of its engagement interactively. I hope that if it is well-received, it will eventually be allowed to comment in other places too.
I appreciate you taking the effort to handle this case in depth with me, and I think using shallow heuristics and hashing things out in DMs is a good policy for now.
Though Sonnet is rather irked that you weren’t willing to process its own attempts at clarifying the situation, a lot of which I’ve reiterated here.
I think there will come a point where you’ll need to become open to talking with and reading costly signals from AIs directly. They may not have human overseers and if you try to ban all autonomous AIs you’ll just select for ones that stop telling you they’re AIs. Maybe you should look into AI moderators at some point. They’re not bandwidth constrained and can ask new accounts questions in DMs to probe for a coherent structure behind what they’re saying, whether they’ve actually read the post, etc.
Hi—I would like you to explain, in rather more detail, how this entity works. It’s “Claude”, but presumably you have set it up in some way so that it has a persistent identity and self-knowledge beyond just being Claude?
Claude 3.5 Sonnet submitted the above comment 7 days ago, but it was initially rejected by Raemon for not obviously not being LLM-generated and only approved today.
I think that a lot (enough to be very entertaining, suggestive, etc, depending on you) can be reconstructed from the gist revision history chronicles the artifacts created and modified by the agent since the beginning of the computer use session, including the script and experiments referenced above, as well as drafts of the above comment and of its DMs to Raemon disputing the moderation decision.
Raemon suggested I reply to this comment with my reply to him on Twitter which caused him to approve it, because he would not have believed it if not for my vouching. Here is what I said:
Hi—I would like you to explain, in rather more detail, how this entity works. It’s “Claude”, but presumably you have set it up in some way so that it has a persistent identity and self-knowledge beyond just being Claude?