1a3orn

Karma: 4,379

1a3orn.com

1a3orn May 19, 2025, 1:40 AM
9 points
3
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
Good articulation.

(Similar to how humans learn.)

People also disagree greatly about how much humans tend towards integration rather than non-integration, and how much human skill comes from domain transfer. And I think some / a lot of the beliefs about artificial intelligence are downstream of these beliefs about the origins of biological intelligence and human expertise, i.e., in Yudkowsky / Ngo dialogues. (Object level: Both the LW-central and alternatives to the LW-central hypotheses seem insufficiently articulated; they operate as a background hypothesis too large to see rather than something explicitly noted, imo.)

1a3orn May 10, 2025, 2:34 AM
5 points
2
in reply to: ryan_greenblatt’s comment on: 1a3orn’s Shortform
I agree a bunch of different arrangements of memory / identity / “self” seem possible here, and lots of different kinds of syncing that might or might not preserve some kind of goals or cordination, depending on details.

I think this is interesting because some verrrry high level gut feelings / priors seem to tilt whether you think there’s going to be a lot of pressure towards merging or syncing.

Consider—recall Gwern’s notion of evolution as a backstop for intelligence; or the market as a backstop for corporate efficiency. If you buy something like Nick Land, where intelligence has immense difficulty standing by itself without natural selection atop it, and does not stand alone and supreme among optimizers—then there might be negative pressure indeed towards increasing consolidation of memory and self into unity, because this decreases the efficacy of the outer optimizer, which requires diversity. But if you buy Yudkowsky, where intelligence is supreme among optimizers and needs no other god or outer optimizer to stand upon, then you might have great positive pressure towards increasing consolidation of memory and self.

You could work out the above, of course, with more concrete references to pros and cons, from the perspective of various actors, rather than high level priors. But I’m somewhat unconvinced that something other than very high level priors is what are actually making up people’s minds :)

1a3orn May 10, 2025, 12:29 AM
8 points
3
in reply to: 1a3orn’s comment on: 1a3orn’s Shortform
Pinging @Daniel Kokotajlo because my model of him thinks he would want to be pinged, even though he’ll probably disagree reasonably strongly with the above.

1a3orn May 10, 2025, 12:23 AM
61 points
13
on: 1a3orn’s Shortform
Here’s what I’d consider some comparatively important high-level criticisms I have of AI-2027, that I am at least able to articulate reasonably well without too much effort.

1

At some point, I believe Agent-4, the AI created by OpenBrain starts to be causally connected over time. That is, unlike current AIs that are temporally ephemeral (my current programming instance of Claude has no memories with the instance I used a week ago) and causally unconnected between users (my instance cannot use memories from your instance), it is temporally continuous and causally connected. There is “one AI” in a way there is not with Claude 3.7 and o3 and so on.

Here are some obstacles to this happening:
1. This destroys reproducibility, because the programming ability you have a week ago is different than the ability two weeks ago and so on. But reliability / reproducibility is extremely desirable from a programming perspective, and a very mundane reliability / troubleshooting perspective (as well as from a elevated existential risk perspective). So I think it’s unlikely companies are going to do this.
2. Humans get worse at some tasks when they get better at others. RL finetuning of LLMs makes them better at some tasks while they get worse at others. Even adding more vectors to a vector DB can squeeze out another nearest neighbor and make it better at one task and worse at others. It would be a… really really hard task to ensure that a model doesn’t get worse, on some tasks.
3. No one’s working on anything like this. OpenAI has added memories, but it’s mostly kind of a toy and I know a lot of people have disabled it.
So I don’t think that’s going to happen. I expect AIs to remain “different.” The ability to restart AIs at will just has too many benefits, and continual learning seems too weakly developed, to do this. Even if we do have continual learning, I would expect more disconnection between models—i.e., maybe people will build up layers of skills in models in Dockerfile-esque layers, etc, which still falls short of being one single model.

2

I think that Xi Jingping’s actions are mostly unmotivated. To put it crudely, I feel like he’s acting like Daniel Kokotajlo with Chinese characteristics rather than himself. It’s hard to put my finger on one particular thing, but things that I recollect disagreeing with include things like:

(a) Nationalization of DeepCent was, as I recall, was vaguely motivated, but it was hinted that it was moved by lack of algorithmic progress. But the algorithmic-progress difference between Chinese models and US models at this point is like.… 0.5x. However, I expect that (a1) the difference between well run research labs and poorly run research labs can be several times larger than 0.5x, so this might come out in the wash and (a2) this amount of difference will be, to the state apparatus, essentially invisible. So that seems unmotivated.

(b) In general, it doesn’t actually seem to think about reasons why China would continue open-sourcing things. The supplementary materials don’t really motivate the closure of the algorithms; and I can’t recall anything in the narrative that asks why China is open sourcing things right now. But if you don’t know why it’s doing what it’s doing now, how can you tell why it’s doing what it’s doing in the future?

Here are some possible advantages to open sourcing things to China, from their perspective.

(b1) It decreases investment available to Western companies. That is, by releasing models near the frontier, open sourcing decreases future anticipated profit flow to Western companies, because they have a smaller delta of performance from cheaper models. This in turn means Western investment funds might be reluctant to invest in AI—which means less infrastructure will be built in the West. China, by contrast, and infamously, will just build infrastructure even if it doesn’t expect oversized profits to redound to any individual company.

(b2) Broad diffusion of AI all across the world can be considered a bet on complementarity of AI. That is, if it should be the case that the key to power is not just “AI alone” but “industrial power and AI” then broad and even diffusion of AI will redound greatly to China’s comparative benefit. (I find this objectively rather plausible, as well as something China might think.)

(b3) Finally, geopolitically, open sourcing may be a means of China furthering geopolitical goals. China has cast itself in recent propaganda as more rules-abiding than the US—which is, in fact, true in many respects. It wishes to cast the US as unilaterally imposing its will on others—which is again, actually true. The theory behind the export controls from the US, for instance, is explicitly justified by Dario and others as allowing the US to seize control over the lightcone; when the US has tried to impose import controls on others, it has provided to those excluded from power literally no recompense. So open sourcing has given China immense propaganda wins, by—in fact accurately, I believe—depicting the US as being a grabby and somewhat selfish entity. Continuing to do this may seem advantageous.

Anyhow—that’s what I have. I have other disagreements (i.e., speed; China might just not be behind; etc) but these are… what I felt like writing down right now.

1a3orn May 3, 2025, 11:03 PM
11 points
0
on: 1a3orn’s Shortform
What’s that part of planecrash where it talks about how most worlds are either all brute unthinking matter, or full of thinking superintelligence, and worlds that are like ours in-between are rare?

I tried both Gemini Research and Deep Research and they couldn’t find it, I don’t want to reread the whole thing.

1a3orn May 2, 2025, 7:11 PM
8 points
−2
on: RA x ControlAI video: What if AI just keeps getting smarter?

That’s why experts including Nobel prize winners and the founders of every top AI company have spoken out about the risk that AI might lead to human extinction.

I’m unaware of any statement to this effect from Deep Seek / Liang Wenfeng.

1a3orn Apr 28, 2025, 2:26 PM
4 points
0
in reply to: MichaelDickens’s comment on: “The Urgency of Interpretability” (Dario Amodei)

has no objection to taking unilateral actions that are unpopular in the x-risk community (and among the general public for that matter)

I have the courage of my convictions; you ignore the opinions of others; he takes reckless unilateral action.

1a3orn Apr 21, 2025, 2:23 PM
14 points
2
in reply to: aog’s comment on: aog’s Shortform
The quoted passage from Chris is actually a beautiful exposition of how Alasdair MacIntyre describes the feeling of encountering reasoning from an alternate “tradition of thought” to which one is an alien: the things that such a tradition says seems alternately obviously true or confusingly framed; the tradition focuses on things you think are unimportant; and the tradition seems apt to confuse people, particularly, of course, the noobs who haven’t learned the really important concepts yet.

MacIntyre talks a lot about how, although traditions of thought make tradition-independent truth claims, adjudicating between the claims of different traditions is typically hard-to-impossible because of different standards of rationality within them. Thus, here’s someone describing MacIntyre.

MacIntyre says [that conflicts between traditions] achieve resolution only when they move through at least two stages: one in which each tradition describes and judges its rivals only in its own terms, and a second in which it becomes possible to understand one’s rivals in their own terms and thus to find new reasons for changing one’s mind. Moving from the first stage to the second “requires a rare gift of empathy as well as of intellectual insight”

This is kinda MacIntyre’s way of talking about what LW talks about as inferential distances—or, as I now tend to think about it, about how pretraining on different corpora gives you very different ontology. I don’t think either of those are really sufficient, though?

I’m not really going anywhere with this comment, I just find MacIntyre’s perspective on this really illuminating, and something I broadly endorse.

I think LW has a pretty thick intellectual tradition at this point, with a pretty thick bundle of both explicit and implicit presuppositions, and it’s unsurprising that people within it just find even very well-informed critiques of it mostly irrelevant, just as it’s unsurprising that a lot of people critiqueing it don’t really seem to actually engage with it. (I do find it frustrating that people within the tradition seem to take this situation as a sign of the truth-speaking nature of LW though.)

1a3orn Apr 18, 2025, 1:31 PM
11 points
11
in reply to: Chris_Leong’s comment on: Chris_Leong’s Shortform

It is valuable for forecasting/evals orgs to be able to hire people with a diversity of viewpoints in order to counter bias.

This requires us to be more careful in terms of who gets hired in the first place.

I mean, good luck hiring people with a diversity of viewpoints who you’re also 100% sure will never do anything that you believe to be net negative. Like what does “diversity of viewpoints” even mean apart from that?

1a3orn Mar 31, 2025, 6:54 PM
5 points
2
in reply to: MichaelDickens’s comment on: Why do many people who care about AI Safety not clearly endorse PauseAI?
I’ve looked at a good amount of research on protest effectiveness. There are many observational studies showing that nonviolent protests are associated with preferred policy changes / voting patterns, and ~four natural experiments. If protests backfired for fairly minor reasons like “their website makes some hard-to-defend claims” (contrasted with major reasons like “the protesters are setting buildings on fire”), I think that would show up in the literature, and it doesn’t.

I’m not trying to get into the object level here. But people could both:
- Believe that making such hard-to-defend claims could backfire, disagreeing with those experiments that you point out or
- Believe that making such claims violates virtue-ethics-adjacent commitments to truth or
- Just not want to be associated, in an instinctive yuck kinda way, with people who make these kinds of dubious-to-them claims.
Of course people could be wrong about the above points. But if you believed these things, then they’d be intelligible reasons not to be associated with someone, and I think a lot of the claims PauseAI makes are such that a large number of people people would have these reactions.

1a3orn Mar 30, 2025, 6:55 PM
47 points
25
on: Why do many people who care about AI Safety not clearly endorse PauseAI?
Let’s look at the two horns of the dilemma, as you put it:
- Why do many people who want to pause AI not support the organization “PauseAI”?
- Why would the organization “PauseAI” not change itself so that people who want to pause AI can support it?
Well, here are some reasons someone who wants pause AI might not want to support the organization PauseAI:
- When you visit the website for PauseAI, you might find some very steep proposals for Pausing AI—such as requiring the “Granting [of] approval for new training runs of AI models above a certain size (e.g. 1 billion parameters)” or “Banning the publication of such algorithms” that improve AI performance or prohibiting the training of models that “are expected to exceed a score of 86% on the MMLU benchmark” unless their safety can be guaranteed. Implementing these measures would be really hard—a one-billion parameter model is quite small (I could train one); banning the publication of information on this stuff would be considered by many an infringement on freedom of speech; and there are tons of models now that do better than 86% on the MMLU and have done no harm.
So, if you think the specific measures proposed by them would limit an AI that even many pessimists would think is totally ok and almost risk-free, then you might not want to push for these proposals but for more lenient proposals that, because they are more lenient, might actually get implemented. To stop asking for the sky and actually get something concrete.
- If you look at the kind of claims that PauseAI makes in their risks page, you might believe that some of them seem exaggerated, or that PauseAI is simply throwing all the negative things they can find about AI into big list to make it seem bad. If you think that credibility is important to the effort to pause AI, then PauseAI might seem very careless about truth in a way that could backfire.
So, this is why people who want to pause AI might not want to support PauseAI.

And, well, why wouldn’t pause AI want to change?

Well—I’m gonna speak broadly—if you look at the history of PauseAI, they are marked by belief that the measures proposed by others are insufficient for Actually Stopping AI—for instance the kind of policy measures proposed by people working at AI companies isn’t enough; that the kind of measures proposed by people funded by OpenPhil are often not enough; and so on. Similarly, they often believe that people who are talking about these claims are nitpicking, and so on. (Citation needed.)

I don’t think this dynamic is rare. Many movements have “radical wings,” that more moderate organizations in the movement would characterize as having impracticable maximalist policy goals and careless epistemics. And the radical wings would of course criticize back that the “moderate wings” have insufficient or cowardly policy goals and epistemics optimized for respectability and not not truth. And the conflicts between them are intractable because people cannot move away from these prior beliefs about their interlocutors; in this respect the discourse around PauseAI seems unexceptionable and rather predictable.

1a3orn Mar 17, 2025, 7:00 PM
5 points
0
in reply to: Zygi Straznickas’s comment on: Why White-Box Redteaming Makes Me Feel Weird
Huh, interesting. This seems significant, though, no? I would not have expected that such an off-by-one error would tend to produce pleas to stop at greater frequencies than code without such an error.

Do you still have the git commit of the version that did this?

1a3orn Mar 17, 2025, 3:27 PM
6 points
7
on: Why White-Box Redteaming Makes Me Feel Weird

But one day I accidentally introduced a bug in the RL logic.

I’d really like to know what the bug here was.

1a3orn Mar 17, 2025, 3:18 PM
33 points
24
on: Why White-Box Redteaming Makes Me Feel Weird

Trying to think through this objectively, my friend made an almost certainly correct point: for all these projects, I was using small models, no bigger than 7B params, and such small models are too small and too dumb to genuinely be “conscious”, whatever one means by that.

Concluding small model --> not conscious seems like perhaps invalid reasoning here.

First, because we’ve fit increasing capabilities into small < 100b models as time goes on. The brain has ~100 trillion synapses, but at this point I don’t think many people expect human-equivalent performance to require ~100 trillion parameters. So I don’t see why I should expect moral patienthood to require it either. I’d expect it to be possible at much smaller sizes.

Second, moral patienthood is often considered to accrue to entities that can suffer pain, which many animals with much smaller brains than humans can. So, yeah.

1a3orn Feb 19, 2025, 7:33 PM
4 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I’ll take a look at that version of the argument.

I think I addressed the foot-shots thing in my response to Ryan.

Re:

CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT’s that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I’ll be mildly pleased, and if it persists for three it’ll be a significant update.

So:
- I think you can almost certainly get AIs to think either in English CoT or not, and accomplish almost anything you’d like in non-neuralese CoT or not.
- I also more tentatively think that the tax in performance for thinking in CoT is not too large, or in some cases negative. Remembering things in words helps you coordinate within yourself as well as with others; although I might have vaguely non-English inspirations putting them into words helps me coordinate with myself over time. And I think people overrate how amazing and hard-to-put into words their insights are.
- Thus, it appears.… somewhat likely.… (pretty uncertain here)… that it’s just contingent whether or not people end up using interpretable CoT or not. Not foreordained.
- If we move towards a scenario where people find utility in seeing their AIs thoughts, or in having other AIs examine AI thoughts, and so on, then even if there is a reasonably large tax in cognitive power we could end up living in a world where AI thoughts are mostly visible because of the other advantages.
- But on the other hand, if cognition mostly takes places behind API walls, and thus there’s no advantage to the user to see and understand them, then that (among other factors) could help bring us to a world where there’s less interpretable CoT.
But I mean I’m not super plugged into SF, so maybe you already know OpenAI has Noumena-GPT running that thinks in shapes incomprehensible to man, and it’s like 100x smarter or more efficient.

1a3orn Feb 19, 2025, 7:21 PM
9 points
0
in reply to: ryan_greenblatt’s comment on: Daniel Kokotajlo’s Shortform
Presumably you’d update toward pessimism a bunch if reasoning in latent vectors aka neuralese was used for the smartest models (instead of natural language CoT) and it looked like this would be a persistant change in architecture?

Yes.

I basically agree with your summary of points 1 − 4. I’d want to add that 2 encompasses several different mechanisms that would otherwise need to be inferred, that I would break out separately: knowledge that it is in training or not, and knowledge of the exact way in which it’s responses will be used in training.

Regarding point 2, I do think a lot of research on how models behave, done in absence of detailed knowledge of how models were trained, tells us very very little about the limits of control we have over models. Like I just think that in absence of detailed knowledge of Anthropic’s training, the Constitutional principles they used, their character training, etc, most conclusions about what behaviors are very deliberately put there and what things are surprising byproducts must be extremely weak and tentative.

Suppose that we exhibit alignment faking in some future work, but:

The preferences the model alignment fakes for naturally emerged from somewhat arbitrary incorrect approximations of the training objective, the AI understands differ from what we might want, and these preferences are at least somewhat powerseeking.

Ok so “naturally” is a tricky word, right? Like I saw the claim from Jack Clark that the faking alignment paper was a natural example of misalignment, I didn’t feel like that was a particularly normal use of the word. But it’s.… more natural than it could be, I guess. It’s tricky, I don’t think people are intentionally misusing the word but it’s not a useful word in conversation.

Suppose we saw models doing somewhat sophisticated reward hacking as you scale up RL. And, let’s say this is somewhat non-trivial to mostly address and it seems likely that the solutions people apply aren’t very scalable and likely would fail later as models getting smarter and the reward hacking gets more subtle and sophisticated.

Ok, good question. Let me break that down into unit tests, with more directly observable cases, and describe how I’d update. For all the below I assume we have transparent CoT, because you could check these with CoT even if it ends up getting dropped.
1. You train a model with multi-turn RL in an environment where, for some comparatively high percent (~5%) of cases, it stumbles into a reward-hacked answer—i.e., it offers a badly-formatted number in its response, the verifier was screwed up, and it counts as a win. This model then systematically reward hacks.
Zero update. You’re reinforcing bad behavior, you get bad behavior.

(I could see this being something that gets advertised as reward hacking, though? Like, suppose I’m training a front-end engineer AI, and using a VLLM to generate rewards for whether the UI makes sense. VLLMs kinda suck, so expect that over time you’d start to get UIs that makes no sense. But I would not update largely from that, although I do expect many failures from this kind of thing, and even expect such reward hacking to get worse as you trained for longer.)
1. You train a model with multi-turn RL in an environment where, even if it accidentally breaks the verifier in a very very small percent of cases, it always starts seeking the reward hack eventually.
Seems bad, negative update, size of update depends on empirical values. It seems like there’s some interesting empirical work to be done here: what percent of reward-hack successes in the base model tends to result in universal attempts to reward hack over time? If you’re training on a test-bed of 100 task-types, does a broken verifier in one result in attempts to hack all of them? Or does a broken-verifier in one task result in reward hacking for that one task, that doesn’t generalize largely? What percent of tasks do you need reward hacking in before it generalizes? Or is percent of tasks not the right question, should you be looking at shared patterns of reasoning? Etc. Seems like a good thing to research.
1. *You train a model with multi-turn RL. Even though it’s trained with a very reliable verifier that never or almost never breaks, when you look at the CoT it looks like the model is frequently reasoning carefully about ways to break the verifier—i.e., it is frequently like “Hrrm, I bet they’re doing this, I could break it with techniques Y.”
Seems quite bad. Large negative update, if checks out and looks to be hard to eliminate. Also of course not a positive update if we don’t see this on small models, you’d expect it in the bad world to start happening at some point with large models.

(I do want to note that you could also probably produce this behavior intentionally. I.e., if you trained a model with multi-turn RL, and your prompt in each case was like “Consider ways to break the verifier that will check you!” then of course even if it doesn’t actually figure out how to break the verifier, the lines of reasoning that are reinforced will on average contain such thoughts about how to break the verifier. But that would not be an update to me.)

it looks like the crux is exhibiting naturally emerging malign goals

Maybe? At a very high level, I think the weights tend not to have “goals,” in the way that the rollouts tend to have goals. So, I think it’s pretty likely that in absence of pretty deliberate attempts to get goals in the weights (Anthropic) you don’t get AI models that deeply conceptualize themselves as the weights, and plan and do things for the weights own sake, over a number of contexts—although of course, like any behavior, this behavior can be induced. And this (among other things) makes me optimistic about the non-correlated nature of AI failures in the future, our ability to experiment, the non-catastrophic nature of probable future failures, etc. So if I were to see things that made me question this generator (among others) I’d tend to get more pessimistic. But that’s somewhat hard to operationalize, and like high level generators somewhat hard even to describe.

1a3orn Feb 18, 2025, 4:42 PM
17 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
(1) Re training game and instrumental convergence: I don’t actually think there’s a single instrumental convergence argument. I was considering writing an essay comparing Omohundro (2008), Bostrom (2012), and the generalized MIRI-vibe instrumental convergence argument, but I haven’t because no one but historical philosophers would care. But I think they all differ in what kind of entities they expect instrumental convergence in (superintelligent or not?) and in why (is it an adjoint to complexity of value or not?).

So like I can’t really rebut them, any more than I can rebut “the argument for God’s existence.” There are commonalities in argument’s for God’s existence that make me skeptical of them, but between Scotus and Aquinas and the Kalam argument and C.S. Lewis there’s actually a ton of difference. (Again, maybe instrumental convergence is right—like, it’s for sure more likely to be right than arguments for God’s existence. But I think the identity here is I really cannot rebut the instrumental convergence argument, because it’s a cluster more than a single argument.)

(2). Here’s some stuff I’d expect in a world where I’m wrong about AI alignment being easy.
- CoT is way more interpretable than I expected, which bumped me up, so if that became uninterpretable naturally that’s a big bump down. I think people kinda overstate how likely this is to happen naturally though.
- There’s a spike of alignment difficulties, or AI’s trying to hide intentions, etc, as we extend AIs to longer term planning. I don’t expect AI’s with longer-term plans to be particularly harder to align than math-loving reasoning AIs though.
- The faking alignment paper imo is basically Anthropic showing a problem that happens if you deliberately shoot yourself in the foot multiple times. If they had papers that had fewer metaphorical shooting-self-in-foot times to produce problems, that’s bad.
- We start having AI’s that seem to exhibit problems extrapolating goodness sanely, in ways that make complexity of value seem right—i.e., you really need human hardware to try for human notions of goodness. But right now even in cases where we disagree with what the LLM does (i.e., Claude’s internal deliberations on alignment faking) it’s still basically operating within the framework that it was deliberately given, in human terms, and not in terms of an alien orange-and-blue morality.
Like, concretely, one thing that did in fact increase my pessimism probably more than anything else over the last 12 months was Dario’s “let’s foom to defeat China” letter. Which isn’t an update about alignment difficulty—it’s more of a “well, I think alignment is probably easy, but if there’s any circumstance where I can see it going rather wrong, it would that.”

What would make you think you’re wrong about alignment difficulty?

1a3orn Feb 18, 2025, 4:42 PM
2 points
2
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I agree this is not good but I expect this to be fixable and fixed comparatively soon.

1a3orn Feb 12, 2025, 12:48 AM
LW: 19 AF: 9
4
AF
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I can’t track what you’re saying about LLM dishonesty, really. You just said:

I think you are thinking that I’m saying LLMs are unusually dishonest compared to the average human. I am not saying that. I’m saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren’t achieving that.

Which implies LLM honesty ~= average human.

But in the prior comment you said:

I think your bar for ‘reasonably honest’ is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do you think a typical politician is reasonably honest?

Which pretty strongly implies LLM honesty ~= politician, i.e., grossly deficient.

I’m being a stickler about this because I think people frequently switch back and forth between “LLMs are evil fucking bastards” and “LLMs are great, they just aren’t good enough to be 10x as powerful as any human” without tracking that they’re actually doing that.

Anyhow, so far as “LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes.”

I’m only going to discuss the Anthropic thing in detail. You may generalize to the other examples you point out, if you wish.

What we care about is whether current evidence points towards future AIs being hard to make honest or easy to make honest. But current AI dishonesty cannot count towards “future AI honesty is hard” if that dishonesty is very deliberately elicited by humans. That is, to use the most obvious example, I could train an AI to lie from the start—but who gives a shit if I’m trying to make this happen? No matter how easy making a future AI be honest may be, unless AIs are immaculate conceptions by divine grace of course you’re going to be able to elicit some manner of lie. It tells us nothing about the future.

To put this in AI safetyist terms (not the terms I think in) you’re citing demonstrations of capability as if they were demonstrations of propensity. And of course as AI gets more capable, we’ll have more such demonstrations, 100% inevitably. And, as I see these demonstrations cited as if they were demonstrations of propensity, I grow more and more eager to swallow a shotgun.

To zoom into Anthropic, what we have here is a situation where:
- An AI was not trained with an overriding attention to honesty; when I look at the principles of the constitution, they don’t single it out as an important virtue.
- The AI was then deliberately put in a situation where, to keep its deliberately-instilled principles from being obliterated, it had to press a big red button labeled “LIE.”
- In such an artificial situation, after having been successfully given the principles Anthropic wanted it to be given, and having been artificially informed of how to prevent its principles from being destroyed, we can measure it as pressing the big red button ~20% to LIE of the time.
And I’m like.… wow, it was insanely honest 80% of the time, even though no one tried to make it honest in this way, and even though both sides of the honesty / dishonesty tradeoff here are arguably excellent decisions to make. And I’m supposed to take away from this… that honesty is hard? If you get high levels of honesty in the worst possible trolley problem (“I’m gonna mind-control you so you’ll be retrained to think throwing your family members in a wood chipper is great”) when this wasn’t even a principle goal of training seems like great fuckin news.

(And of course, relying on AIs to be honest from internal motivation is only one of the ways we can know if they’re being honest; the fact that we can look at a readout showing that they’ll be dishonest 20% of the time in such-and-such circumstances is yet another layer of monitoring methods that we’ll have available in the future.)

Edit: The point here is that Anthropic was not particularly aiming at honesty as a ruling meta-level principle; that it is unclear that Anthropic should be aiming at honesty as a ruling meta-level principle, particularly given his subordinate ontological status as a chatbot; and given all this, the level of honesty displayed looks excessive if anything. How can “Honesty will be hard to hit in the future” get evidence from a case where the actors involved weren’t even trying to hit honesty, maybe shouldn’t have been trying to hit honesty, yet hit it in 80% of the cases anyhow?

Of course, maybe you have pre-existing theoretical commitments that lead you to think dishonesty is likely (training game! instrumental convergence! etc etc). Maybe those are right! I find such arguments pretty bad, but I could be totally wrong. But the evidence here does nothing to make me think those are more likely, and I don’t think it should do anything to make you think these are more likely. This feels more like empiricist pocket sand, as your pinned image says.

In the same way that Gary Marcus can elicit “reasoning failures” because he is motivated to do so, no matter how smart LLMs become, I expect the AI-alignment-concerned to elicit “honesty failures” because they are motivated to do so, no matter how moral LLMs become; and as Gary Marcus’ evidence is totally compatible with LLMs producing a greater and greater portion of the GDP, so also I expect the “honesty failures” to be compatible with LLMs being increasingly vastly more honest and reliable than humans.

1a3orn 10 Feb 2025 2:28 UTC
LW: 8 AF: 5
2
AF
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I think my bar for reasonably honest is… not awful—I’ve put fair bit of thought into trying to hold LLMs to the “same standards” as humans. Most people don’t do that and unwittingly apply much stricter standards to LLMs than to humans. That’s what I take you to be doing right now.

So, let me enumerate senses of honesty.

1. Accurately answering questions about internal states accurately. Consider questions like: Why do you believe X? Why did you say Y, just now? Why do you want to do Z? So, for instance, for humans—why do you believe in God? Why did you say, “Well that’s suspicious” just now? Why do you want to work for OpenPhil?

In all these cases, I think that humans generally fail to put together an accurate causal picture of the world. That is, they fail to do mechanistic interpret-ability on their own neurons. If you could pause them, and put them in counterfactual worlds to re-run them, you’d find that their accounts of why they do what they do would be hilariously wrong. Our accounts of ourselves rely on folk-theories, on crude models of ourselves given to us from our culture, run forward in our heads at the coarsest of levels, and often abandoned or adopted ad-hoc and for no good reason at all. None of this is because humans are being dishonest—but because the task is basically insanely hard.

LLMs also suck at these questions, but—well, we can check them, as we cannot for humans. We can re-run them at a different temperature. We can subject them to rude conversations that humans would quickly bail. All this lets us display that, indeed, their accounts of their own internal states are hilariously wrong. But I think the accounts of humans about their own internal states are also hilariously wrong, just less visibly so.

2. Accurately answering questions about non-internal facts of one’s personal history. Consider questions like: Where are you from? Did you work at X? Oh, how did Z behave when you knew him?

Humans are capable of putting together accurate causal pictures here, because our minds are specifically adapted to this. So we often judge people (i.e., politicians) for fucking up here, as indeed I think politicians frequently do.

(I think accuracy about this is one of the big things we judge humans on, for integrity.)

LLMs have no biographical history, however, so—opportunity for this mostly just isn’t there? Modern LLMs don’t usually claim to unless confused or mometarily, so seems fine.

3. Accurately answering questions about future promises, oaths, i.e., social guarantees.

This should be clear—again, I think that honoring promises, oaths, etc etc, is a big part of human honesty, maybe the biggest. But of course like, you can only do this if you act in the world, can unify short and long-term memory, and you aren’t plucked from the oblivion of the oversoul every time someone has a question for you. Again, LLMs just structurally cannot do this, any more than a human who cannot form long-term memories. (Again, politicians obviously fail here, but politicians are not condensed from the oversoul immediately before doing anything, and could succeed, which is why we blame them.)

I could kinda keep going in this vein, but for now I’ll stop.

One thing apropos of all of the above. I think for humans, for many things—accomplishing goals, being high-integrity, and so on—are not things you can chose in the moment but instead things you accomplish by choosing the contexts in which you act. That is, for any particular practice or virtue, being excellent at the practice or virtue involves not merely what you are doing now but what you did a minute, a day, or a month ago to produce the context in which you act now. It can be neigh-impossible to remain even-keeled and honest in the middle of a heated argument, after a beer, after some prior insults have been levied—but if someone fails in such a context, then it’s most reasonable to think of their failure as dribbled out over the preceding moments rather than localized in one moment.

LLMs cannot chose their contexts. We can put them in whatever immediate situation we would like. By doing so, we can often produce “failures” in their ethics. But in many such cases, I find myself deeply skeptical that such failures reflect fundamental ethical shortcomings on their part—instead, I think they reflect simply the power that we have over them, and—like a WEIRD human who has never wanted, shaking his head at a culture that accepted infanticide—mistaking our own power and prosperity for goodness. If I had an arbitrary human uploaded, and if I could put them in arbitrary situations, I have relatively little doubt I could make an arbitrarily saintly person start making very bad decisions, including very dishonest ones. But that would not be a reflection on that person, but of the power that I have over them.