Cleo Nardo

Karma: 2,718

DMs open.

Cleo Nardo May 16, 2025, 8:44 PM
2 points
0
on: Try training token-level probes
most tokens in a correct answer),
typo: most tokens in an incorrect answer

Cleo Nardo May 7, 2025, 7:01 PM
4 points
0
in reply to: ryan_greenblatt’s comment on: Why I am not a successionist
Yep, you might be right about the distal/proximal cut-off. I think that the Galaxy-brained value systems will end up controlling most of the distant future simply because they have a lower time-preference for resources. Not sure where the cut-off will be.

For similar reasons, I don’t think we should do a bunch of galaxy-brained acausal decision theory to achieve our mundane values, because the mundane values don’t care about counterfactual worlds.

Cleo Nardo May 4, 2025, 8:16 PM
10 points
2
on: Why I am not a successionist
There are two moral worldviews:
1. Mundane Mandy: ordinary conception of what a “good world” looks like, i.e. your friends and family living flourish lives in their biological bodies, with respect for “sacred” goods
2. Galaxy-brain Gavin: transhumanist, longtermist, scope-sensitive, risk-neutral, substrate-indifferent, impartial
I think Mundane Mandy should have the proximal lightcone (anything within 1 billion light years) and Galaxy-brain Gavin should have the distal lightcone (anything 1-45 B ly). This seems like a fair trade.

Cleo Nardo Apr 19, 2025, 4:30 PM
6 points
−4
on: strawberry calm’s Shortform
The Hash Game: Two players alternate choosing an 8-bit number. After 40 turns, the numbers are concatenated. If the hash is 0 then Player 1 wins, otherwise Player 2 wins. That is, Player 1 wins if $hash (a 1, b_{1}, a_{2}, b_{2}, . . . a_{40}, b_{40}) = 0$ . The Hash Game has the same branching factor and duration as chess, but there’s probably no way to play this game without brute-forcing the min-max algorithm.

Cleo Nardo Apr 17, 2025, 8:15 PM
2 points
0
in reply to: Pretentious Penguin’s comment on: strawberry calm’s Shortform
Yep, my point is that there’s no physical notion of being “offered” a menu of lotteries which doesn’t leak information. IIA will not be satisfied by any physical process which corresponds to offering the decision-maker with a menu of options. Happy to discuss any specific counter-example.

Of course, you can construct a mathematical model of the physical process, and this model might an informative objective to study, but it would be begging the question if the mathematical model baked in IIA somewhere.

Cleo Nardo Apr 17, 2025, 12:58 PM
2 points
−10
on: strawberry calm’s Shortform
Must humans obey the Axiom of Irrelevant Alternatives?
Suppose you would choose option A from options A and B. Then you wouldn’t choose option B from options A, B, C. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA. Should humans follow this? Maybe not.
Maybe C includes additional information which makes it clear that B is better than A.
Consider the following options:
- (A) £10 bet that 1+1=2
- (B) £30 bet that the smallest prime factor in 1019489 ends in the digit 1
Now, I would prefer A to B. Firstly, if 1019489 is itself prime then I lose the bet. Secondly, if 1019489 isn’t prime, then there’s 25% chance that its smallest prime factor ends in 1. That’s because all prime numbers greater than 5 end in 1, 3, 7 or 9 — and Dirichlet’s theorem states that primes are equally distributed among these possible endings. So the chance of winning the bet is slightly less than 25%, and £10 is better than a 25% chance of winning £30. Presented with this menu, I would probably choose option A.
But now consider the following options:
- (A) £10 bet that 1+1=2
- (B) £30 bet that the smallest prime factor in 1019489 ends in the digit 1
- (C) £20 bet that 1019489 = 71 * 83 * 173
Well, which is the best option? Well, B is preferable to C because B has both a weaker condition and also a higher payout. And C is preferable to A — my odds on the 1019489 = 71 * 83 * 173 is higher than 50%. Presented with this menu, I would probably choose option B.

Cleo Nardo Apr 15, 2025, 1:05 PM
3 points
on: Can SAE steering reveal sandbagging?
- Clymer et al. injected noise into the activations of Llama-13B and was able to correctly classify models that were faking alignment in 98% of model pairs (out of 324 combinations of pairs).
What’s the apples-to-apples comparison? i.e. recovery_rate after steering with a random vector at layer 50 of llama-3.3-70B-instruct on this particular dataset

Cleo Nardo Apr 15, 2025, 12:58 PM
2 points
on: Can SAE steering reveal sandbagging?
This metric also ignores invalid answers (refusals or gibberish).

If you don’t ignore invalid answers, do the results change significantly?

Can SAE steering reveal sandbagging?

jordine, Hoang Khiem, Felix Hofstätter and Cleo Nardo

Apr 15, 2025, 12:33 PM

35 points

3 comments4 min readLW link

Cleo Nardo Feb 8, 2025, 1:57 AM
3 points
0
in reply to: testingthewaters’s comment on: strawberry calm’s Shortform
the scope insensitive humans die and their society is extinguished
Ah, your reaction makes more sense given you think this is the proposal. But it’s not the proposal. The proposal is that the scope-insensitive values flourish on Earth, and the scope-sensitive values flourish in the remaining cosmos.
As a toy example, imagine a distant planet with two species of alien: paperclip-maximisers and teacup-protectors. If you offer a lottery to the paperclip-maximisers, they will choose the lottery with the highest expected number of paperclips. If you offer a lottery to the teacup-satisfiers, they will choose the lottery with the highest chance of preserving their holy relic, which is a particular teacup.
The paperclip-maximisers and the teacup-protectors both own property on the planet. They negotiate the following deal: the paperclip-maximisers will colonise the cosmos, but leave the teacup-protectors a small sphere around their home planet (e.g. 100 light-years across). Moreover, the paperclip-maximisers promise not to do anything that risks their teacup, e.g. choosing a lottery that doubles the size of the universe with 60% chance and destroys the universe with 40% chance.
Do you have intuitions that the paperclip-maximisers are exploiting the teacup-protectors in this deal?
Do you think instead that the paperclip-maximisers should fill the universe with half paperclips and half teacups?
I think this scenario is a better analogy than the scenario with the drought. In the drought scenario, there is an object fact which the nearby villagers are ignorant of, and they would act differently if they knew this fact. But I don’t think scope-sensitivity is a fact like “there will be a drought in 10 years”. Rather, scope-sensitivity is a property of a utility function (or a value system, more generally).

Cleo Nardo Feb 7, 2025, 11:14 PM
1 point
4
in reply to: testingthewaters’s comment on: strawberry calm’s Shortform
If you think you have a clean resolution to the problem, please spell it out more explicitly. We’re talking about a situation where a scope insensitive value system and scope sensitive value system make a free trade in which both sides gain by their own lights. Can you spell out why you classify this as treachery? What is the key property that this shares with more paradigm examples of treachery (e.g. stealing, lying, etc)?

Cleo Nardo Feb 7, 2025, 6:23 PM
3 points
0
in reply to: jbash’s comment on: strawberry calm’s Shortform
I think it’s more patronising to tell scope-insensitive values that they aren’t permitted to trade with scope-sensitive values, but I’m open to being persuaded otherwise.

Cleo Nardo Feb 7, 2025, 6:21 PM
3 points
0
in reply to: testingthewaters’s comment on: strawberry calm’s Shortform
I mention this in (3).
I used to think that there was some idealisation process P such that we should treat agent A in the way that P(A) would endorse, but see On the limits of idealized values by Joseph Carlsmith. I’m increasingly sympathetic to the view that we should treat agent A in the way that A actually endorses.

Cleo Nardo Feb 7, 2025, 2:05 PM
4 points
−1
on: strawberry calm’s Shortform
Would it be nice for EAs to grab all the stars? I mean “nice” in Joe Carlsmith’s sense. My immediate intuition is “no that would be power grabby / selfish / tyrannical / not nice”.

But I have a countervailing intuition:

“Look, these non-EA ideologies don’t even care about stars. At least, not like EAs do. They aren’t scope sensitive or zero time-discounting. If the EAs could negotiate creditable commitments with these non-EA values, then we would end up with all the stars, especially those most distant in time and space.

Wouldn’t it be presumptuous for us to project scope-sensitivity onto these other value systems?”

Not sure what to think tbh. I’m increasingly leaning towards the second intuition, but here are some unknowns:
1. Empirically, is it true that non-EAs don’t care about stars? My guess is yes, I could buy future stars from people easily if I tried. Maybe OpenPhil can organise a negotiation between their different grantmakers.
2. Are these negotiations unfair? Maybe because EAs have more “knowledge” about the feasibility of space colonisation in the near future. Maybe because EAs have more “understanding” of numbers like 10^40 (though I’m doubtful because scientists understand magnitudes and they aren’t scope sensitive).
3. Should EAs negotiate with these value systems as they actually are (the scope insensitive humans working down the hall) or instead with some “ideal” version of these value systems (a system with all the misunderstandings and irrationalities somehow removed)? My guess is that “ideal” here is bullshit, and also it strikes me as a patronising away to treat people.

Cleo Nardo Feb 4, 2025, 1:39 PM
4 points
0
on: Can we infer the search space of a local optimiser?
Minimising some term like $\frac{σ (δ (h^{'}))}{E (δ (h^{'}))}$ , with $δ (h^{'} (t)) := | | h_{t}^{'} - h_{t + 1}^{'} | |_{2}^{2}$ , where the standard deviation $σ (δ (h^{'}))$ and expectation $E (δ (h^{'}))$ are taken over the batch.

Why does this make $δ_{t} := | | h_{t} - h_{t + 1} | |_{2}^{2}$ tend to be small? Wouldn’t it just encourage equally-sized jumps, without any regard for the size of those jumps?

Cleo Nardo Feb 3, 2025, 3:02 PM
2 points
0
on: strawberry calm’s Shortform
Will AI accelerate biomedical research at companies like Novo Nordisk or Pfizer? I don’t think so. If OpenAI or Anthropic built a system that could accelerate R&D by more than 2x, they aren’t releasing it externally.

Maybe the AI company deploys the AI internally, with their own team accounting for 90%+ of the biomedical innovation.

Cleo Nardo Feb 1, 2025, 3:56 PM
2 points
0
in reply to: ryan_greenblatt’s comment on: strawberry calm’s Shortform
I saw someone use OpenAI’s new Operator model today. It couldn’t order a pizza by itself. Why is AI in the bottom percentile of humans at using a computer, and top percentile at solving maths problems? I don’t think maths problems are shorter horizon than ordering a pizza, nor easier to verify.
Your answer was helpful but I’m still very confused by what I’m seeing.

Cleo Nardo Feb 1, 2025, 3:47 PM
3 points
0
in reply to: Maxwell Adam’s comment on: strawberry calm’s Shortform
I don’t think this works when the AIs are smart and reasoning in-context, which is the case where scheming matters. Also this maybe backfires by making scheming more salient.
Still, might be worth running an experiment.

Cleo Nardo Feb 1, 2025, 3:58 AM
2 points
0
on: Information-Theoretic bound of safety specifications
The AI-generate prose is annoying to read. I haven’t read this closely, but my guess is these arguments also imply that CNNs can’t classify hand-drawn digits.

Cleo Nardo 1 Feb 2025 3:28 UTC
4 points
−3
on: strawberry calm’s Shortform
People often tell me that AIs will communicate in neuralese rather than tokens because it’s continuous rather than discrete.

But I think the discreteness of tokens is a feature not a bug. If AIs communicate in neuralese then they can’t make decisive arbitrary decisions, c.f. Buridan’s ass. The solution to Buridan’s ass is sampling from the softmax, i.e. communicate in tokens.

Also, discrete tokens are more tolerant to noise than the continuous activations, c.f. digital circuits are almost always more efficient and reliable than analogue ones.

Cleo Nardo

Can SAE steer­ing re­veal sand­bag­ging?

Can SAE steering reveal sandbagging?