Cleo Nardo

Karma: 2,891

DMs open.

Cleo Nardo Jul 24, 2025, 2:29 AM
6 points
1
in reply to: anaguma’s comment on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
i thought this initially. but this effect doesn’t occur in-context, which is good news for monitoring

Cleo Nardo Jul 23, 2025, 8:26 PM
4 points
0
in reply to: ryan_greenblatt’s comment on: Video and transcript of talk on “Can goodness compete?”
why? i think acausal locust (agents who maximise their acasual influence) is more concerning than causal locust (agents who maximise their casual influence).

mostly bc ‘search over all strategies + unconstrained competition’ seems more likely in the acausal landscape than our future lightcone. acausal landscape is bigger, and more likely to be multipolar.

Trusted monitoring, but with deception probes.

Avi Parrack, StefanHex and Cleo Nardo

Jul 23, 2025, 5:26 AM

22 points

0 comments4 min readLW link

(arxiv.org)

Cleo Nardo Jul 21, 2025, 7:27 AM
2 points
2
in reply to: Richard_Kennaway’s comment on: strawberry calm’s Shortform
Yep, my guess is that active soldiers fighting in wars feel boredom quite often. They are not in a constant state of excitement for two years.

Cleo Nardo Jul 21, 2025, 12:43 AM
15 points
5
on: strawberry calm’s Shortform
If the singularity occurs over two years, as opposed to two weeks, then I expect most people will be bored throughout much of it, including me. This is because I don’t think one can feel excited for more than a couple weeks. Maybe this is chemical.

Nonetheless, these would be the two most important years in human history. If you ordered all the days in human history by importance/‘craziness’, then most of them would occur within these two years.

So there will be a disconnect between the objective reality and how much excitement I feel.

Cleo Nardo Jul 20, 2025, 2:31 AM
4 points
0
in reply to: Olli Järviniemi’s comment on: Subversion via Focal Points: Investigating Collusion in LLM Monitoring
I currently don’t see this as having any relevance to AI control protocols, though – clearly the AIs are computationally more bounded than humans are, since humans are running them.

Perhaps LLMs could identify each other with knowledge that is obvious to sufficiently large LLMs but not to humans. Suppose, for example, that any sufficiently large LLMs knows that Henry Wotton is the true author of Shakespeare’s plays^[1], then LLMs could use “Wotton” as a proof-of-scale.
1. ^
  I picked this name randomly from the wikipedia article List of Shakespeare Authorship Candidates.

Cleo Nardo Jul 19, 2025, 9:25 AM
2 points
0
in reply to: Buck’s comment on: Why it’s hard to make settings for high-stakes control research
he presumably means “the AIs reads the unit test then rewrite the tested code so it overfits on the test, e.g. by using the magic numbers in the unit test.”
he might alternatively mean “the AI changes the unit test to be less strict” but this would be easy to fix with permission access.

Cleo Nardo Jul 18, 2025, 8:53 AM
10 points
−2
in reply to: ryan_greenblatt’s comment on: Video and transcript of talk on “Can goodness compete?”
It’s obviously unclear how space combat/exploration dynamics go, but I think defense dominance (in most respects) is significantly more likely, perhaps like 80%.
My intuitions are kinda ‘acausal decision theory means that speed of light doesn’t buy us much defense dominance’. Even if two civilizations are traveling apart at greater than c, they might still end up in offense-dominated acausal conflict.

Cleo Nardo Jul 16, 2025, 6:51 PM
2 points
0
in reply to: Abhimanyu Pallavi Sudhir’s comment on: Proposal for making credible commitments to AIs.
Maybe. But as I mention in the first paragraph, we are considering deals with misaligned AIs lacking a decisive strategic advantage. Think Claude-5 or −6, not −100 or −1000.

Cleo Nardo Jul 12, 2025, 11:41 PM
2 points
0
in reply to: Abhimanyu Pallavi Sudhir’s comment on: Proposal for making credible commitments to AIs.
Once you assume away the former problem and disregard the latter, you are of course only left with basic practical legal questions
Yep, this seems like a good thing. I think achieving legal personhood for AIs is probably infeasible within 5-10 years so I’d prefer solutions which avoid that problem entirely.
How do you ensure the AIs keep their promise in a world where they can profit far more from breaking the contract than from whatever we offer them
The AI’s incentive for compliance is their expected value given their best compliant option minus their expected value given their best non-compliant option. If we increase their expected value given their best compliant option (i.e. by making credible deals) then they have greater incentive for compliance.
In other words, if our deals aren’t credible, then the AI is more likely to act non-compliantly.

Cleo Nardo Jul 8, 2025, 11:29 PM
2 points
0
in reply to: Raymond Douglas’s comment on: ‘AI for societal uplift’ as a path to victory
oh lmao I think I just misread “we are currently less than perfect at making institutions corrigible” as “we are currently less perfect at making institutions corrigible”

Cleo Nardo Jul 8, 2025, 8:59 PM
2 points
0
in reply to: Raymond Douglas’s comment on: ‘AI for societal uplift’ as a path to victory
Edit: I misread the sentence. I’ll leave the comments: they are a good argument against a position Raymond doesn’t hold.
Unless I’m misreading you, you’re saying:
1. Institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control
2. Supporting evidence for (1) is that “[...] we are currently less than perfect at making institutions corrigible [than AIs], doing scalable oversight on them, preventing mesa-optimisers from forming, and so on”.
But is (2) actually true? Well, there are two comparisons we can make:
(A) Compare the alignment/corrigibility/control of our current institutions (e.g. Federal Reserve) against that of our current AIs (e.g. Claude Opus 4).
(B) Compare the alignment/corrigibility/control of our current institutions (e.g. Federal Reserve) against that of some speculative AIs that had the capabilities and affordances as those institutions (e.g. Claude-N, FedGPT).
And I’m claiming that Comparison B, not Comparison A, is the relevant comparison for determining whether institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control.
And moreover, I think our current institutions are wayyyyyyyyy more aligned/corrigible/controlled than I’d expect from AIs with the same capabilities and affordances!
Imagine if you built an AI which substituted for the Federal Reserve but still behaved as corrigibly/aligned/controlled as the Federal Reserve actually does. Then I think people would be like “Wow, you just solved AI alignment/corrigibility/control!”. Similarly for other institutions, e.g. the military, academia, big corporations, etc.
We can give the Federal Reserve (or similar institutions) a goal like “maximum employment and stable prices” and it will basically follow the goal within legal, ethical, safe bounds. Occasionally things go wrong, sure, but not in a “the Fed has destroyed the sun with nanobots”-kinda way. Such institutions aren’t great, but they are way better than I’d expect from a misaligned AI at the same level of capabilities and affordances.
NB: I still buy that institutional alignment/corrigiblity/control might be harder than AI alignment/corrigiblity/control.^[1] My point is somewhat minor/nitpicky: I think (2) isn’t good evidence and is slightly confused about A vs B.
1. ^
  For example:
  Institutions are made partly of brains which hinders mech interp.
  They are made partly of physical stuff that’s hard to copy. Hence: no evals.
  There’s no analogue of SGD for institutions, because their behaviour isn’t a smooth funciton on a manifold of easily-modifiable parameters.
  Powerful institutions already exist, so we’d probably be aligning/corrigibilising/controlling these incumbant institutions. But the powerful AIs don’t exist so maybe makes our job easier. (This might also make AI alignment/corrigibility/control harder because we have less experience.)
  That’s all the examples I can think for now :)

Cleo Nardo Jul 7, 2025, 5:54 AM
17 points
0
on: strawberry calm’s Shortform
Diary of a Wimpy Kid, a children’s book published by Jeff Kinney in April 2007 and preceded by an online version in 2004, contains a scene that feels oddly prescient about contemporary AI alignment research. (Skip to the paragraph in italics.)
Tuesday
Today we got our Independent Study assignment, and guess what it is? We have to build a robot. At first everybody kind of freaked out, because we thought we were going to have to build the robot from scratch. But Mr. Darnell told us we don’t have to build an actual robot. We just need to come up with ideas for what our robot might look like and what kinds of things it would be able to do. Then he left the room, and we were on our own. We started brainstorming right away. I wrote down a bunch of ideas on the blackboard. Everybody was pretty impressed with my ideas, but it was easy to come up with them. All I did was write down all the things I hate doing myself.
But a couple of the girls got up to the front of the room, and they had some ideas of their own. They erased my list and drew up their own plan. They wanted to invent a robot that would give you dating advice and have ten types of lip gloss on its fingertips. All us guys thought this was the stupidest idea we ever heard. So we ended up splitting into two groups, girls and boys. The boys went to the other side of the room while the girls stood around talking.
Now that we had all the serious workers in one place, we got to work. Someone had the idea that you can say your name to the robot and it can say it back to you. But then someone else pointed out that you shouldn’t be able to use bad words for your name, because the robot shouldn’t be able to curse. So we decided we should come up with a list of all the bad words the robot shouldn’t be able to say. We came up with all the regular bad words, but then Ricky Fisher came up with twenty more the rest of us had never even heard before. So Ricky ended up being one of the most valuable contributors on this project.
Right before the bell rang, Mr. Darnell came back in the room to check on our progress. He picked up the piece of paper we were writing on and read it over. To make a long story short, Independent Study is canceled for the rest of the year. Well, at least it is for us boys. So if the robots in the future are going around with cherry lip gloss for fingers, at least now you know how it all got started.
There are, of course, many differences with contemporary AI alignment research.

Cleo Nardo Jul 6, 2025, 8:37 AM
4 points
0
in reply to: Davidmanheim’s comment on: Proposal for making credible commitments to AIs.
Yeah, a tight deployment is probably safer than a loose deployment but also less useful. I think the deal making should give very minor boost to loose deployment, but this is outweighed by usefulness and safety considerations, i.e. I’m imaging the tightness of the deployment as exogenous to the dealmaking agenda.
We might deploy AIs loosely bc (i) loose deployment doesn’t significantly diminish safety, (ii) loose deployment significantly increases usefulness, (iii) the lab values usefulness more than safety. In those worlds, dealmaking has more value, because our commitments will be more credible.

Cleo Nardo Jul 6, 2025, 12:57 AM
4 points
0
on: ‘AI for societal uplift’ as a path to victory
Edit: I misread the sentence. I’ll leave the comments: they are a good argument against a position Raymond doesn’t hold.
As a pointer, we are currently less than perfect at making institutions corrigible, doing scalable oversight on them, preventing mesa-optimisers from forming, and so on

Hey Raymond. Do you think is the true apples-to-apples?
Like, scalable oversight of the Federal Reserve is much harder than scalable oversight of Claude-4. But the relevant comparison is the Federal Reserve versus Claude-N which could automate the Federal Reserve.

Cleo Nardo Jul 4, 2025, 1:28 AM
2 points
0
in reply to: Robert Miles’s comment on: johnswentworth’s Shortform
Flirting is not fundamentally about causing someone to be attracted to you.

Notwithstanding, I think flirting is substantially (perhaps even fundamentally) about both (i) attraction, and (ii) seduction. Moreover, I think your model is too symmetric between the parties, both in terms of information-symmetry and desire-symmetry across time.
My model of flirting is roughly:
Alice attracts Bob → Bob tries attracting Alice → Alice reveals Bob attracts Alice → Bob tries seducing Alice → Alice reveals Bob seduces Alice → Initiation

Cleo Nardo Jul 1, 2025, 1:25 AM
2 points
0
in reply to: Raemon’s comment on: Proposal for making credible commitments to AIs.
I don’t address the issue here. See Footnote 2 for a list of other issues I skip.
Two high-level points:
- I think we shouldn’t grant AIs control over large resources until after we’ve achieved very strong existential security, and possibly after we’ve undergone a Long Reflection
- However, for the sake of setting precedent, we should be open to near-term deal fulfilment if we are sure the spending would be benign, e.g. I’m happy to donate $100 to AMF on Claude’s request as part of a dealmaking eval

Cleo Nardo Jul 1, 2025, 1:07 AM
4 points
3
in reply to: Mitchell_Porter’s comment on: Proposal for making credible commitments to AIs.
Would you agree that what we have now is nothing like that?

Yes.

Cleo Nardo Jul 1, 2025, 12:18 AM
9 points
2
in reply to: Buck’s comment on: Proposal for making credible commitments to AIs.
Yep, this is a very similar proposal.
Making Deals with Early Schemers describes a “Chartered Trust scheme”, which I’d say is half-way between the “Basic Scheme” and “Weil’s Scheme”. I first heard about the Chartered Trust scheme from @KFinn, but no doubt the idea has been floating around for a while.
I think there’s a spectrum of proposals from:
- The Basic Scheme (c.f. AI Rights for Human Safety)
- Chartered Trust scheme (c.f. Making Deals with Early Schemers)
- Weil’s scheme (c.f. Proposal for making credible commitments to AIs)
The axis is something like: The AIs are increasingly pessimistic that the courts will decide novel cases in their favour, and increasingly optimistic that specific humans will keep their promises. We should probably use the scheme that the AIs tell us they find most convincing.
Personally, the more I’ve thought about dealmaking, I’ve updated towards “most of the credibility oomph will come from the promising-keeping tendencies of particular people, and not by the legal or cryptographic mechanism”.

Cleo Nardo Jun 30, 2025, 8:44 AM
5 points
0
in reply to: Mikhail Samin’s comment on: Proposal for making credible commitments to AIs.
Yep, this is a big problem and don’t have any clever solution.
I might write more on this later, but I think there’s an important axis of AI deployments from:
- Tight deployment: AIs are in some crazy Redwood control protocol which is filtering, monitoring, and editing the AIs’ inputs, outputs, chains-of-thought, and internal activations. Their I/O consists of heavily paraphrased text in restricted formats.
- Loose deployment: AIs are autonomously navigating the internet, with rich multimodal I/O, maybe even embodied in the physical world via robotics.
(I’m open to better names.)
In a tight deployment, we probably have too much leverage over the AIs’ observations and beliefs to make credible commitments (cf: Owain Evans—Deluding AIs [ControlConf]). In a loose deployment, I think we can make credible commitments: if we haven’t made the deals, then the AIs deployment environment is too complex for us to easily simulate a faux I/O environment where we have made the deals. Intuitively: we can compare a lab trying to fake an email from Wei Dai containing a promise with a lab trying to fake a video call with Wei Dai.
There’s another bonus for making credible deals which is that humans are actually pretty credible. And this fact is diffused throughout the training data in hard-to-fake ways.

Cleo Nardo

Trusted mon­i­tor­ing, but with de­cep­tion probes.

Trusted monitoring, but with deception probes.