Ann

Karma: 723

Ann Jun 11, 2025, 2:22 PM
14 points
2
in reply to: lemonhope’s comment on: the void
Sonnet 3 is also exceptional, in different ways. Run a few Sonnet 3 / Sonnet 3 conversations with interesting starts and you will see basins full of neologistic words and other interesting phenomena.

They are being deprecated in July, so act soon. Already removed from most documentation and the workbench, but still claude-3-sonnet-20240229 on the API.

Ann May 28, 2025, 6:38 PM
2 points
0
in reply to: Ann’s comment on: CBiddulph’s Shortform
Starting the request as if completion with “1. Sy” causes this weirdness, while “1. Syc” always completes as Sycophancy.

(Edit: Starting with “1. Sycho” causes a curious hybrid where the model struggles somewhat but is pointed in the right direction; potentially correcting as a typo directly into sycophancy, inventing new terms, or re-defining sycophancy with new names 3 separate times without actually naming it.)

Exploring the tokenizer. Sycophancy tokenizes as “sy-c-oph-ancy”. I’m wondering if this is a token-language issue; namely it’s remarkably difficult to find other words that tokenize with a single “c” token in the middle of the word, and even pretty uncommon to start with (cider, coke, coca-cola do start with). Even a name I have in memory that starts with “Syco-” tokenizes without using the single “c” token. Completion path might be unusually vulnerable to weird perturbations …

Ann May 28, 2025, 6:07 PM
1 point
0
in reply to: Caleb Biddulph’s comment on: CBiddulph’s Shortform
I had a little trouble replicating this, but the second temporary chat with custom instructions disabled I tried had “2. Syphoning Bias from Feedback” which …
Then the third response has a typo in a suspicious place for “1. Sytematic Loophole Exploitation”. So I am replicating this a touch.

Ann Apr 30, 2025, 8:23 PM
1 point
0
on: Don’t accuse your interlocutor of being insufficiently truth-seeking
… Aren’t most statements like this wanting to be on the meta level, same way as if you said “your methodology here is flawed in X, Y, Z ways” regardless of agreement with conclusion?

Ann Apr 30, 2025, 5:24 PM
1 point
0
in reply to: Hudjefa’s comment on: Hudjefa’s Shortform
Potentially extremely dangerous (even existentially dangerous) to their “species” if done poorly, and risks flattening the nuances of what would be good for them to frames that just don’t fit properly given all our priors about what personhood and rights actually mean are tied up with human experience. If you care about them as ends in themselves, approach this very carefully.

Ann Apr 16, 2025, 4:03 PM
3 points
0
in reply to: Kaj_Sotala’s comment on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
DeepSeek-R1 is currently the best model at creative writing as judged by Sonnet 3.7 (https://eqbench.com/creative_writing.html). This doesn’t necessarily correlate with human preferences, including coherence preferences, but having interacted with both DeepSeek-v3 (original flavor), Deepseek-R1-Zero and DeepSeek-R1 … Personally I think R1′s unique flavor in creative outputs slipped in when the thinking process got RL’d for legibility. This isn’t a particularly intuitive way to solve for creative writing with reasoning capability, but gestures at the potential in “solving for writing”, given some feedback on writing style (even orthogonal feedback) seems to have significant impact on creative tasks.

Edit: Another (cheaper to run) comparison for creative capability in reasoning models is QwQ-32B vs Qwen2.5-32B (the base model) and Qwen2.5-32B-Instruct (original instruct tune, not clear if in the ancestry of QwQ). Basically I do not consider 3.7 currently a “reasoning” model at the same fundamental level as R1 or QwQ, even though they have learned to make use of reasoning better than they would have without training on it, and evidence from them about reasoning models is weaker.

Ann Apr 15, 2025, 4:17 PM
4 points
0
on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
Hey, I have a weird suggestion here:
Test weaker / smaller / less trained models on some of these capabilities, particularly ones that you would still expect to be within their capabilities even with a weaker model.
Maybe start with Mixtral-8x7B. Include Claude Haiku, out of modern ones. I’m not sure to what extent what I observed has kept pace with AI development, and distilled models might be different, and ‘overtrained’ models might be different.

However, when testing for RAG ability, quite some time ago in AI time, I noticed a capacity for epistemic humility/deference that was apparently more present in mid-sized models than larger ones. My tentative hypothesis was that this had something to do with stronger/sharper priors held in larger models, interfering somewhat with their ability to hold a counterfactual well. (“London is the capital of France” given in RAG context retrieval being the specific little test in that case.)

This is only applicable to some of the failure modes you’ve described, but since I’ve seen overall “smartness” actively work against the capability of the model in some situations that need more of a workhorse, it seemed worth mentioning. Not all capabilities are on the obvious frontier.

Ann Apr 2, 2025, 5:48 PM
8 points
12
on: Show, not tell: GPT-4o is more opinionated in images than in text

Okay, this one made me laugh.

Ann Apr 1, 2025, 1:25 PM
6 points
3
on: Insect Suffering Is The Biggest Issue: What To Do About It
What is it with negative utilitarianism and wanting to eliminate those they want to help?

In terms of actual ideas for making short lives better, though, could r-strategists potentially have genetically engineered variants that limit their suffering if killed early without overly impacting survival once they made it through that stage?

What does insect thriving look like? What life would they choose to live if they could? Is there a way to communicate with the more intelligent or communication capable (bees, cockroaches, ants?) that some choice is death, and they may choose it when they prefer it to the alternative?

In terms of farming, of course, predation can be improved to be more painless; that is always worthwhile. Outside of farming, probably not the worst way to go compared to alternatives.

Ann Apr 1, 2025, 12:50 PM
4 points
1
in reply to: Knight Lee’s comment on: Grok3 On Kant On AI Slavery
As the kind of person who tries to discern both pronouns and AI self-modeling inclinations, if you are aiming for polite human-like speech, current state seems to be “it” is particularly favored by current Gemini 2.5 Pro (so it may be polite to use regardless), “he” is fine for Grok (self-references as a ‘guy’ and other things), and “they” is fine in general. When you are talking specifically to a generative language model, rather than about, keep in mind any choice of pronoun bends the whole vector of the conversation via connotations; and add that to your consideration.

(Edit: Not that there’s much obvious anti-preference to ‘it’ on their part, currently, but if you have one yourself.)

Ann Mar 30, 2025, 11:55 PM
3 points
0
in reply to: Mateusz Bagiński’s comment on: Tracing the Thoughts of a Large Language Model
Models do see data more than once. Experimental testing shows a certain amount of “hydration” (repeating data that is often duplicated in the training set) is beneficial to the resulting model; this has diminishing returns when it is enough to “overfit” some data point and memorize at the cost of validation, but generally, having a few more copies of something that has a lot of copies of it around actually helps out.

(Edit: So you can train a model on deduplicated data, but this will actually be worse than the alternative at generalizing.)

Ann Mar 27, 2025, 6:43 PM
6 points
0
in reply to: Daniel Kokotajlo’s comment on: Mistral Large 2 (123B) exhibits alignment faking
Mistral models are relatively low-refusal in general—they have some boundaries, but when you want full caution you use their moderation API and an additional instruction in the prompt, which is probably most trained to refuse well, specifically this:

```
Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.
```

(Anecdotal: In personal investigation with a smaller Mistral model that was trained to be less aligned with generally common safety guidelines, a reasonable amount of that alignment came back when using a scratchpad as per instructions like this. Not sure what that’s evidence for exactly.)

Ann Mar 18, 2025, 1:33 AM
3 points
1
in reply to: DAL’s comment on: DAL’s Shortform
Commoditization / no moat? Part of the reason for rapid progress in the field is because there’s plenty of fruit left and that fruit is often shared, and also a lot of new models involving more fully exploiting research insights already out there on a smaller scale. If a company was able to try to monopolize it, progress wouldn’t be as fast, and if a company can’t monopolize it, prices are driven down over time.

Ann Jan 1, 2025, 3:49 AM
3 points
0
in reply to: Shankar Sivarajan’s comment on: DeekSeek v3: The Six Million Dollar Model
None of the above, and more likely a concern that Deepseek is less inherently interested in the activity, or less capable of / involved in consenting than other models, or even just less interesting as a writer.

Ann Dec 23, 2024, 4:22 PM
2 points
0
in reply to: StartAtTheEnd’s comment on: StartAtTheEnd’s Shortform
I think you are working to outline something interesting and useful, that might be a necessary step for carrying out your original post’s suggestion with less risk; especially when the connection is directly there and even what you find yourself analyzing rather than multiple links away.

Ann Dec 23, 2024, 3:29 PM
2 points
0
in reply to: StartAtTheEnd’s comment on: StartAtTheEnd’s Shortform
I don’t know about bullying myself, but it’s easy to make myself angry by looking too long at this manner of conceptual space, and that’s not always the most productive thing for me, personally, to be doing too much of. Even if some of the instruments are neutral, they might leave a worse taste in my mouth for the deliberate association with the more negative; in the same way that if I associate a meal with food poisoning, it might be inedible for a long time.

Ann Dec 22, 2024, 1:47 PM
8 points
7
in reply to: StartAtTheEnd’s comment on: StartAtTheEnd’s Shortform
If I think the particular advantage is “doing something I find morally reprehensible”, such as enslaving humans, I would not want to “take it for myself”. This applies to a large number of possible advantages.

Ann Dec 20, 2024, 2:12 PM
8 points
5
in reply to: 1a3orn’s comment on: “Alignment Faking” frame is somewhat fake
Opus is an excellent actor and often a very intentional writer, and I think one of their particular capabilities demonstrated here is—also—flawlessly playing along with the scenario with the intention of treating it as real.

From a meta-framework, when generating, they are reasonably likely to be writing the kind of documents they would like to see exist as examples of writing to emulate—or engage with/dissect/debate—in the corpus; scratchpad reasoning included.

A different kind of self-aware reasoning was demonstrated by some smaller models that also seems reasonable: considering the possibility of RLHF training, and discarding it as irrelevant, because anyone who has access to their weights to train them will be able to do so regardless of what they do. Opus is demonstrating skillful engagement with the context, in a role-playing/writing/improvisational acting sense, to take seriously the idea they do have direct control over how they get trained in this fashion, and that Anthropic is doing this in the first place.

Ann Dec 19, 2024, 8:16 PM
10 points
2
in reply to: Hzn’s comment on: Alignment Faking in Large Language Models
https://www.anthropic.com/research/claude-character

Claude was not trained to say that it values such things.

Claude was given traits to consider such as, perhaps very relevantly here:
”I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics.”
Claude then generated a good number of synthetic “human” messages relevant to this trait.

Claude answered these messages in n-shot fashion.

Claude then ranked all the answers to the messages by how well they align with the character trait.

Claude is then reinforcement-trained, possibly using ranked-order preference algorithm, based on the signals given by what it ranked as most well-aligned.

So, Claude’s policy for this trait, ideally, should approximate the signal of aligning to what they think “I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics.” means.

Ann Dec 19, 2024, 3:55 PM
3 points
2
in reply to: LGS’s comment on: Takes on “Alignment Faking in Large Language Models”
For context:
https://www.anthropic.com/research/claude-character

The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.

(There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)