Cleo Nardo

Karma: 2,689

DMs open.

Cleo Nardo Dec 23, 2024, 10:42 PM
69 points
19
on: strawberry calm’s Shortform
I’m very confused about current AI capabilities and I’m also very confused why other people aren’t as confused as I am. I’d be grateful if anyone could clear up either of these confusions for me.

How is it that AI is seemingly superhuman on benchmarks, but also pretty useless?

For example:
- O3 scores higher on FrontierMath than the top graduate students
- No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer
If either of these statements is false (they might be—I haven’t been keeping up on AI progress), then please let me know. If the observations are true, what the hell is going on?

If I was trying to forecast AI progress in 2025, I would be spending all my time trying to mutually explain these two observations.

Cleo Nardo Dec 4, 2024, 6:31 PM
2 points
0
in reply to: Chipmonk’s comment on: My theory of change for working in AI healthtech
I’ve skimmed the business proposal.
The healthcare agents advise patients on which information to share with their doctor, and advises doctors on which information to solicit from their patients.
This seems agnostic between mental and physiological health.

Cleo Nardo Nov 26, 2024, 3:02 AM
3 points
1
on: Counting AGIs
Thanks for putting this together — very useful!

Cleo Nardo Nov 23, 2024, 8:39 PM
5 points
0
in reply to: Cole Wyeth’s comment on: Rethinking Laplace’s Rule of Succession
If I understand correctly, the maximum entropy prior will be the uniform prior, which gives rise to Laplace’s law of succession, at least if we’re using the standard definition of entropy below:
$H [p] := \int_{x = 0}^{1} P (x) ln P (x) d x$
But this definition is somewhat arbitrary because the the “ $P (x) d x$ ” term assumes that there’s something special about parameterising the distribution with it’s probability, as opposed to different parameterisations (e.g. its odds, its logodds, etc). Jeffrey’s prior is supposed to be invariant to different parameterisations, which is why people like it.
But my complaint is more Solomonoff-ish. The prior should put more weight on simple distributions, i.e. probability distributions that describe short probabilistic programs. Such a prior would better match our intuitions about what probabilities arise in real-life stochastic processes. The best prior is the Solomonoff prior, but that’s intractable. I think my prior is the most tractable prior that resolved the most egregious anti-Solomonoff problems with Laplace/Jeffrey’s priors.

Cleo Nardo Nov 23, 2024, 12:29 AM
6 points
3
in reply to: rotatingpaguro’s comment on: Rethinking Laplace’s Rule of Succession
You raise a good point. But I think the choice of prior is important quite often:
1. In the limit of large i.i.d. data (N>1000), both Laplace’s Rule and my prior will give the same answer. But so too does the simple frequentist estimate n/N. The original motivation of Laplace’s Rule was in the small N regime, where the frequentist estimate is clearly absurd.
2. In the small data regime (N<15), the prior matters. Consider observing 12 successes in a row: Laplace’s Rule: P(next success) = ¹³⁄₁₄ ≈ 92.3%. My proposed prior (with point masses at 0 and 1): P(next success) ≈ 98%, which better matches my intuition about potentially deterministic processes.
3. When making predictions far beyond our observed data, the likelihood of extreme underlying probabilities matters a lot. For example, after seeing ¹²⁄₁₂ successes, how confident should we be in seeing a quadrillion more successes? Laplace’s uniform prior assigns this very low probability, while my prior gives it significant weight.

Rethinking Laplace’s Rule of Succession

Cleo NardoNov 22, 2024, 6:46 PM

11 points

5 comments2 min readLW link

Cleo Nardo Oct 8, 2024, 9:51 PM
2 points
−5
in reply to: Sodium’s comment on: strawberry calm’s Shortform
Hinton legitimizes the AI safety movement
Hmm. He seems pretty periphery to the AI safety movement, especially compared with (e.g.) Yoshua Bengio.

Cleo Nardo Oct 8, 2024, 8:03 PM
3 points
0
in reply to: TurnTrout’s comment on: TurnTrout’s shortform feed
Hey TurnTrout.
I’ve always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they’re currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard “hang out with Alice” is weighted higher in contexts where Alice is nearby.
- Let’s say $π : (S \times A)^{*} \times S \to Δ A$ is a policy with state space $S$ and action space $A$ .
- A “context” is a small moving window in the state-history, i.e. an element of $S^{d}$ where $d$ is a small positive integer.
- A shard is something like $u : S \times A \to R$ , i.e. it evaluates actions given particular states.
- The shards $u_{1}, \dots, u_{n}$ are “activated” by contexts, i.e. $g_{i} : S^{d} \to R^{\geq 0}$ maps each context to the amount that shard $u_{i}$ is activated by the context.
- The total activation of $u_{i}$ , given a history $h := (s_{1}, a_{1}, s_{2}, a_{2}, \dots, s_{N - 1}, a_{N - 1}, s_{N})$ , is given by the time-decay average of the activation across the contexts, i.e. $λ_{i} = g_{i} (s_{N - d + 1}, \dots s_{N}) + β \cdot g_{i} (s_{N - d}, \dots, s_{N - 1}) + β^{2} \cdot g_{i} (s_{N - d - 1}, \dots, s_{N - 2}) \dots$
- The overall utility function $u$ is the weighted average of the shards, i.e. $u = λ_{i} \cdot u_{i} + \dots + λ_{i} \cdot u_{n}$
- Finally, the policy $u$ will maximise the utility function, i.e. $π (h) = softmax (u)$
Is this what you had in mind?

Cleo Nardo Oct 8, 2024, 6:46 PM
10 points
1
on: strawberry calm’s Shortform
Why do you care that Geoffrey Hinton worries about AI x-risk?
1. Why do so many people in this community care that Hinton is worried about x-risk from AI?
2. Do people mention Hinton because they think it’s persuasive to the public?
3. Or persuasive to the elites?
4. Or do they think that Hinton being worried about AI x-risk is strong evidence for AI x-risk?
5. If so, why?
6. Is it because he is so intelligent?
7. Or because you think he has private information or intuitions?
8. Do you think he has good arguments in favour of AI x-risk?
9. Do you think he has a good understanding of the problem?
10. Do you update more-so on Hinton’s views than on Yann LeCun’s?
I’m inspired to write this because Hinton and Hopfield were just announced as the winners of the Nobel Prize in Physics. But I’ve been confused about these questions ever since Hinton went public with his worries. These questions are sincere (i.e. non-rhetorical), and I’d appreciate help on any/all of them. The phenomenon I’m confused about includes the other “Godfathers of AI” here as well, though Hinton is the main example.

Personally, I’ve updated very little on either LeCun’s or Hinton’s views, and I’ve never mentioned either person in any object-level discussion about whether AI poses an x-risk. My current best guess is that people care about Hinton only because it helps with public/elite outreach. This explains why activists tend to care more about Geoffrey Hinton than researchers do.

Cleo Nardo Oct 7, 2024, 1:30 AM
2 points
0
on: Any Trump Supporters Want to Dialogue?
This is a Trump/Kamala debate from two LW-ish perspectives: https://www.youtube.com/watch?v=hSrl1w41Gkk

Cleo Nardo Oct 1, 2024, 3:20 AM
LW: 2 AF: 1
0
AF
in reply to: cdt’s comment on: Base LLMs refuse too
the base model is just predicting the likely continuation of the prompt. and it’s a reasonable prediction that, when an assistant is given a harmful instruction, they will refuse. this behaviour isn’t surprising.

Cleo Nardo Oct 1, 2024, 3:18 AM
2 points
0
in reply to: wassname’s comment on: Base LLMs refuse too
it’s quite common for assistants to refuse instructions, especially harmful instructions. so i’m not surprised that base llms systestemically refuse harmful instructions from than harmless ones.

Cleo Nardo Sep 30, 2024, 7:06 PM
5 points
0
in reply to: quila’s comment on: strawberry calm’s Shortform
yep, something like more carefulness, less “playfulness” in the sense of [Please don’t throw your mind away by TsviBT]. maybe bc AI safety is more professionalised nowadays. idk.

Cleo Nardo Sep 30, 2024, 6:01 PM
4 points
0
in reply to: Mateusz Bagiński’s comment on: strawberry calm’s Shortform
thanks for the thoughts. i’m still trying to disentangle what exactly I’m point at.
I don’t intend “innovation” to mean something normative like “this is impressive” or “this is research I’m glad happened” or anything. i mean something more low-level, almost syntactic. more like “here’s a new idea everyone is talking out”. this idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.
like, imagine your job was to maintain a glossary of terms in AI safety. i feel like new terms used to emerge quite often, but not any more (i.e. not for the past 6-12 months). do you think this is a fair? i’m not sure how worrying this is, but i haven’t noticed others mentioning it.
NB: here’s 20 random terms I’m imagining included in the dictionary:
1. Evals
2. Mechanistic anomaly detection
3. Stenography
4. Glitch token
5. Jailbreaking
6. RSPs
7. Model organisms
8. Trojans
9. Superposition
10. Activation engineering
11. CCS
12. Singular Learning Theory
13. Grokking
14. Constitutional AI
15. Translucent thoughts
16. Quantilization
17. Cyborgism
18. Factored cognition
19. Infrabayesianism
20. Obfuscated arguments

Cleo Nardo Sep 30, 2024, 4:41 PM
4 points
0
in reply to: Mateusz Bagiński’s comment on: strawberry calm’s Shortform
I’ve added a fourth section to my post. It operationalises “innovation” as “non-transient novelty”. Some representative examples of an innovation would be:
I think these articles were non-transient and novel.

Cleo Nardo Sep 30, 2024, 3:00 AM
24 points
−5
on: strawberry calm’s Shortform
(1) Has AI safety slowed down?
There haven’t been any big innovations for 6-12 months. At least, it looks like that to me. I’m not sure how worrying this is, but i haven’t noticed others mentioning it. Hoping to get some second opinions.
Here’s a list of live agendas someone made on 27th Nov 2023: Shallow review of live agendas in alignment & safety. I think this covers all the agendas that exist today. Didn’t we use to get a whole new line-of-attack on the problem every couple months?
By “innovation”, I don’t mean something normative like “This is impressive” or “This is research I’m glad happened”. Rather, I mean something more low-level, almost syntactic, like “Here’s a new idea everyone is talking out”. This idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.
Imagine that your job was to maintain a glossary of terms in AI safety.^[1] I feel like you would’ve been adding new terms quite consistently from 2018-2023, but things have dried up in the last 6-12 months.
(2) When did AI safety innovation peak?
My guess is Spring 2022, during the ELK Prize era. I’m not sure though. What do you guys think?
(3) What’s caused the slow down?
Possible explanations:
1. ideas are harder to find
2. people feel less creative
3. people are more cautious
4. more publishing in journals
5. research is now closed-source
6. we lost the mandate of heaven
7. the current ideas are adequate
8. paul christiano stopped posting
9. i’m mistaken, innovation hasn’t stopped
10. something else
(4) How could we measure “innovation”?
By “innovation” I mean non-transient novelty. An article is “novel” if it uses n-grams that previous articles didn’t use, and an article is “transient” if it uses n-grams that subsequent articles didn’t use. Hence, an article is non-transient and novel if it introduces a new n-gram which sticks around. For example, Gradient Hacking (Evan Hubinger, October 2019) was an innovative article, because the n-gram “gradient hacking” doesn’t appear in older articles, but appears often in subsequent articles. See below.
In Barron et al 2017, they analysed 40 000 parliament speeches during the French Revolution. They introduce a metric “resonance”, which is novelty (surprise of article given the past articles) minus transience (surprise of article given the subsequent articles). See below.
My claim is recent AI safety research has been less resonant.
1. ^
  Here’s 20 random terms that would be in the glossary, to illustrate what I mean:
  Evals
  Mechanistic anomaly detection
  Stenography
  Glitch token
  Jailbreaking
  RSPs
  Model organisms
  Trojans
  Superposition
  Activation engineering
  CCS
  Singular Learning Theory
  Grokking
  Constitutional AI
  Translucent thoughts
  Quantilization
  Cyborgism
  Factored cognition
  Infrabayesianism
  Obfuscated arguments

Cleo Nardo Sep 29, 2024, 8:32 PM
7 points
0
in reply to: Error’s comment on: Cryonics is free
I don’t understand the s-risk consideration.
Suppose Alice lives naturally for 100 years and is cremated. And suppose Bob lives naturally for 40 years then has his brain frozen for 60 years, and then has his brain cremated. The odds that Bob gets tortured by a spiteful AI should be pretty much exactly the same as for Alice. Basically, its the odds that spiteful AIs appear before 2034.

Cleo Nardo Sep 28, 2024, 11:01 PM
4 points
0
in reply to: Tamsin Leake’s comment on: strawberry calm’s Shortform
Thanks Tamsin! Okay, round 2.
My current understanding of QACI:
1. We assume a set $Ω$ of hypotheses about the world. We assume the oracle’s beliefs are given by a probability distribution $μ \in Δ Ω$ .
2. We assume sets $Q$ and $A$ of possible queries and answers respectively. Maybe these are exabyte files, i.e. $Q ≅ A ≅ {0, 1}^{N}$ for $N = 2^{60}$ .
3. Let $Φ$ be the set of mathematical formula that Joe might submit. These formulae are given semantics $eval (ϕ) : Ω \times Q \to Δ A$ for each formula $ϕ \in Φ$ .^[1]
4. We assume a function $H : Ω \times Q \to Δ Φ$ where $H (α, q) (ϕ) \in [0, 1]$ is the probability that Joe submits formula $ϕ$ after reading query $q$ , under hypothesis $α$ .^[2]
5. We define $QACI : Ω \times Q \to Δ A$ as follows: sample $ϕ \sim H (α, q)$ , then sample $a \sim eval (ϕ) (α, q)$ , then return $a$ .
6. For a fixed hypothesis $α$ , we can interpret the answer $a \sim QACI (α, ‘ ‘ Best utility function?")$ as a utility function $u_{α} : Π \to R$ via some semantics $eval-u : A \to (Π \to R)$ .
7. Then we define $u : Π \to R$ via integrating over $μ$ , i.e. $u (π) := \int u_{α} (π) d μ (α)$ .
8. A policy $π \in Π$ is optimal if and only if $π^{*} \in {argmax}_{Π} (u)$ .
The hope is that $μ$ , $eval$ , $eval-u$ , and $H$ can be defined mathematically. Then the optimality condition can be defined mathematically.
Question 0
What if there’s no policy which maximises $u : Π \to R$ ? That is, for every policy $π$ there is another policy $π^{'}$ such that $u (π^{'}) > u (π)$ . I suppose this is less worrying, but what if there are multiple policies which maximises $u$ ?
Question 1
In Step 7 above, you average all the utility functions together, whereas I suggested sampling a utility function. I think my solution might be safer.
Suppose the oracle puts 5% chance on hypotheses such that $QACI (α, -)$ is malign. I think this is pretty conservative, because Solomonoff predictor is malign, and some of the concerns Evhub raises here. And the QACI amplification might not preserve benignancy. It follows that, under your solution, $u : Π \to R$ is influenced by a coalition of malign agents, and similarly $π^{*} \in argmax (u)$ is influenced by the malign coalition.
By contrast, I suggest sampling $α \sim μ$ and then finding $π^{*} \in {argmax}_{Π} (u_{α})$ . This should give us a benign policy with 95% chance, which is pretty good odds. Is this safer? Not sure.
Question 2
I think the $eval$ function doesn’t work, i.e. there won’t be a way to mathematically define the semantics of the formula language. In particular, the language $Φ$ must be strictly weaker than the meta-language in which you are hoping to define $eval : Φ \to (Ω \times Q \to Δ A)$ itself. This is because of Tarski’s Undefinability of Truth (and other no-go theorems).
This might seem pedantic, but you in practical terms: there’s no formula $ϕ$ whose semantics is QACI itself. You can see this via a diagonal proof: imagine if Joe always writes the formal expression $ϕ = ‘ ‘ 1 - QACI (α, q) "$ .
The most elegant solution is probably transfinite induction, but this would give us a QACI for each ordinal.
Question 3
If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions
I want to understand how QACI and prosaic ML map onto each other. As far as I can tell, issues with QACI will be analogous to issues with prosaic ML and vice-versa.
Question 4
I still don’t understand why we’re using QACI to describe a utility function over policies, rather than using QACI in a more direct approach.
- Here’s one approach. We pick a policy which maximises $QACI (α, ‘ ‘ How good is policy π ?")$ .^[3] The advantage here is that Joe doesn’t need to reason about utility functions over policies, he just need to reason about a single policy in front of him
- Here’s another approach. We use QACI as our policy directly. That is, in each context $c$ that the agent finds themselves in, they sample an action from $QACI (α, ‘ ‘ What is the best action is context c ?")$ and take the resulting action.^[4] The advantage here is that Joe doesn’t need to reason about policies whatsoever, he just needs to reason about a single context in front of him. This is also the most “human-like”, because there’s no argmax’s (except if Joe submits a formula with an argmax).
- Here’s another approach. In each context $c$ , the agent takes an action $y$ which maximises $QACI (α, ‘ ‘ How good is action y in context c ?")$ .
- E.t.c.
Happy to jump on a call if that’s easier.
1. ^
  I think you would say $eval : Ω \times Q \to A$ . I’ve added the $Δ$ , which simply amounts to giving Joe access to a random number generator. My remarks apply if $eval : Ω \times Q \to A$ also.
2. ^
  I think you would say $H : Ω \times Q \to Φ$ . I’ve added the $Δ$ , which simply amount to including hypotheses that Joe is stochastic. But my remarks apply if $H : Ω \times Q \to Φ$ also.
3. ^
  By this I mean either:
  (1) Sample $α \sim μ$ , then maximise the function $π \mapsto QACI (α, ‘ ‘ How good is policy π ?")$ .
  (2) Maximise the function $π \mapsto \int QACI (α, ‘ ‘ How good is policy π ?") d μ (α)$ .
  For reasons I mentioned in Question 1, I suspect (1) is safer, but (2) is closer to your original approach.
4. ^
  I would prefer the agent samples $α \sim μ$ once at the start of deployment, and reuses the same hypothesis $α$ at each time-step. I suspect this is safer than resampling $α$ at each time-step, for reasons discussed before.

Cleo Nardo Sep 23, 2024, 8:33 PM
7 points
0
on: On the Role of Proto-Languages
First, proto-languages are not attested. This means that we have no example of writing in any proto-language.

A parent language is typically called “proto-” if the comparative method is our primary evidence about it — i.e. the term is (partially) epistemological metadata.
- Proto-Celtic has no direct attestation whatsoever.
- Proto-Norse (the parent of Icelandic, Danish, Norwegian, Swedish, etc) is attested, but the written record is pretty scarce, just a few inscriptions.
- Proto-Romance (the parent of French, Italian, Spanish, etc) has an extensive written record. More commonly known as “Latin”.
I think the existence of Latin as Proto-Romance has an important epistemological upshot:
Let’s say we want to estimate how accurately we have reconstructed Proto-Celtic. Well, we can apply the same method used to reconstruct Proto-Celtic to reconstructing Proto-Romance. We can evaluate our reconstruction of Proto-Romance using the written record of Latin. This gives us an estimate of how we would evaluate our Proto-Celtic reconstruction if we discovered a written record tomorrow.

Cleo Nardo Sep 20, 2024, 10:55 PM
6 points
0
on: strawberry calm’s Shortform
I want to better understand how QACI works, and I’m gonna try Cunningham’s Law. @Tamsin Leake.
QACI works roughly like this:
1. We find a competent honourable human $H$ , like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 2048-bit secret key. We define $H^{+}$ as the serial composition of a bajillion copies of $H$ .
2. We want a model $M$ of the agent $H^{+}$ . In QACI, we get $M$ by asking a Solomonoff-like ideal reasoner for their best guess about $H^{+}$ after feeding them a bunch of data about the world and the secret key.
3. We then ask $M$ the question $q$ , “What’s the best reward function to maximise?” to get a reward function $r : (O \times A)^{*} \to R$ . We then train a policy $π : (O \times A)^{*} \times O \to Δ A$ to maximise the reward function $r$ . In QACI, we use some perfect RL algorithm. If we’re doing model-free RL, then $π$ might be AIXI (plus some patches). If we’re doing model-based RL, then $π$ might be the argmax over expected discounted utility, but I don’t know where we’d get the world-model $τ : (O \times A)^{*} \to Δ O$ — maybe we ask $M$ ?
So, what’s the connection between the final policy $π$ and the competent honourable human $H$ ? Well overall, $π$ maximises a reward function specified by the ideal reasonser’s estimation of the serial composition of a bajillion copies of $H$ . Hmm.
Questions:
1. Is this basically IDA, where Step 1 is serial amplification, Step 2 is imitative distillation, and Step 3 is reward modelling?
2. Why not replace Step 1 with Strong HCH or some other amplification scheme?
3. What does “bajillion” actually mean in Step 1?
4. Why are we doing Step 3? Wouldn’t it be better to just use $M$ directly as our superintelligence? It seems sufficient to achieve radical abundance, life extension, existential security, etc.
5. What if there’s no reward function that should be maximised? Presumably the reward function would need to be “small”, i.e. less than a Exabyte, which imposes a maybe-unsatisfiable constraint.
6. Why not ask $M$ for the policy $π$ directly? Or some instruction for constructing $π$ ? The instruction could be “Build the policy using our super-duper RL algo with the following reward function...” but it could be anything.
7. Why is there no iteration, like in IDA? For example, after Step 2, we could loop back to Step 1 but reassign $H$ as $H$ with oracle access to $M$ .
8. Why isn’t Step 3 recursive reward modelling? i.e. we could collect a bunch of trajectories from $π$ and ask $M$ to use those trajectories to improve the reward function.

Cleo Nardo

Re­think­ing Laplace’s Rule of Succession

Rethinking Laplace’s Rule of Succession