DMs open.
Cleo Nardo
I’ve skimmed the business proposal.
The healthcare agents advise patients on which information to share with their doctor, and advises doctors on which information to solicit from their patients.
This seems agnostic between mental and physiological health.
Thanks for putting this together — very useful!
If I understand correctly, the maximum entropy prior will be the uniform prior, which gives rise to Laplace’s law of succession, at least if we’re using the standard definition of entropy below:
But this definition is somewhat arbitrary because the the “” term assumes that there’s something special about parameterising the distribution with it’s probability, as opposed to different parameterisations (e.g. its odds, its logodds, etc). Jeffrey’s prior is supposed to be invariant to different parameterisations, which is why people like it.
But my complaint is more Solomonoff-ish. The prior should put more weight on simple distributions, i.e. probability distributions that describe short probabilistic programs. Such a prior would better match our intuitions about what probabilities arise in real-life stochastic processes. The best prior is the Solomonoff prior, but that’s intractable. I think my prior is the most tractable prior that resolved the most egregious anti-Solomonoff problems with Laplace/Jeffrey’s priors.
You raise a good point. But I think the choice of prior is important quite often:
In the limit of large i.i.d. data (N>1000), both Laplace’s Rule and my prior will give the same answer. But so too does the simple frequentist estimate n/N. The original motivation of Laplace’s Rule was in the small N regime, where the frequentist estimate is clearly absurd.
In the small data regime (N<15), the prior matters. Consider observing 12 successes in a row: Laplace’s Rule: P(next success) = 13⁄14 ≈ 92.3%. My proposed prior (with point masses at 0 and 1): P(next success) ≈ 98%, which better matches my intuition about potentially deterministic processes.
When making predictions far beyond our observed data, the likelihood of extreme underlying probabilities matters a lot. For example, after seeing 12⁄12 successes, how confident should we be in seeing a quadrillion more successes? Laplace’s uniform prior assigns this very low probability, while my prior gives it significant weight.
Rethinking Laplace’s Rule of Succession
Hinton legitimizes the AI safety movement
Hmm. He seems pretty periphery to the AI safety movement, especially compared with (e.g.) Yoshua Bengio.
Hey TurnTrout.
I’ve always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they’re currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard “hang out with Alice” is weighted higher in contexts where Alice is nearby.
Let’s say is a policy with state space and action space .
A “context” is a small moving window in the state-history, i.e. an element of where is a small positive integer.
A shard is something like , i.e. it evaluates actions given particular states.
The shards are “activated” by contexts, i.e. maps each context to the amount that shard is activated by the context.
The total activation of , given a history , is given by the time-decay average of the activation across the contexts, i.e.
The overall utility function is the weighted average of the shards, i.e.
Finally, the policy will maximise the utility function, i.e.
Is this what you had in mind?
Why do you care that Geoffrey Hinton worries about AI x-risk?
Why do so many people in this community care that Hinton is worried about x-risk from AI?
Do people mention Hinton because they think it’s persuasive to the public?
Or persuasive to the elites?
Or do they think that Hinton being worried about AI x-risk is strong evidence for AI x-risk?
If so, why?
Is it because he is so intelligent?
Or because you think he has private information or intuitions?
Do you think he has good arguments in favour of AI x-risk?
Do you think he has a good understanding of the problem?
Do you update more-so on Hinton’s views than on Yann LeCun’s?
I’m inspired to write this because Hinton and Hopfield were just announced as the winners of the Nobel Prize in Physics. But I’ve been confused about these questions ever since Hinton went public with his worries. These questions are sincere (i.e. non-rhetorical), and I’d appreciate help on any/all of them. The phenomenon I’m confused about includes the other “Godfathers of AI” here as well, though Hinton is the main example.
Personally, I’ve updated very little on either LeCun’s or Hinton’s views, and I’ve never mentioned either person in any object-level discussion about whether AI poses an x-risk. My current best guess is that people care about Hinton only because it helps with public/elite outreach. This explains why activists tend to care more about Geoffrey Hinton than researchers do.
This is a Trump/Kamala debate from two LW-ish perspectives: https://www.youtube.com/watch?v=hSrl1w41Gkk
the base model is just predicting the likely continuation of the prompt. and it’s a reasonable prediction that, when an assistant is given a harmful instruction, they will refuse. this behaviour isn’t surprising.
it’s quite common for assistants to refuse instructions, especially harmful instructions. so i’m not surprised that base llms systestemically refuse harmful instructions from than harmless ones.
yep, something like more carefulness, less “playfulness” in the sense of [Please don’t throw your mind away by TsviBT]. maybe bc AI safety is more professionalised nowadays. idk.
thanks for the thoughts. i’m still trying to disentangle what exactly I’m point at.
I don’t intend “innovation” to mean something normative like “this is impressive” or “this is research I’m glad happened” or anything. i mean something more low-level, almost syntactic. more like “here’s a new idea everyone is talking out”. this idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.
like, imagine your job was to maintain a glossary of terms in AI safety. i feel like new terms used to emerge quite often, but not any more (i.e. not for the past 6-12 months). do you think this is a fair? i’m not sure how worrying this is, but i haven’t noticed others mentioning it.
NB: here’s 20 random terms I’m imagining included in the dictionary:
Evals
Mechanistic anomaly detection
Stenography
Glitch token
Jailbreaking
RSPs
Model organisms
Trojans
Superposition
Activation engineering
CCS
Singular Learning Theory
Grokking
Constitutional AI
Translucent thoughts
Quantilization
Cyborgism
Factored cognition
Infrabayesianism
Obfuscated arguments
I’ve added a fourth section to my post. It operationalises “innovation” as “non-transient novelty”. Some representative examples of an innovation would be:
I think these articles were non-transient and novel.
(1) Has AI safety slowed down?
There haven’t been any big innovations for 6-12 months. At least, it looks like that to me. I’m not sure how worrying this is, but i haven’t noticed others mentioning it. Hoping to get some second opinions.
Here’s a list of live agendas someone made on 27th Nov 2023: Shallow review of live agendas in alignment & safety. I think this covers all the agendas that exist today. Didn’t we use to get a whole new line-of-attack on the problem every couple months?
By “innovation”, I don’t mean something normative like “This is impressive” or “This is research I’m glad happened”. Rather, I mean something more low-level, almost syntactic, like “Here’s a new idea everyone is talking out”. This idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.
Imagine that your job was to maintain a glossary of terms in AI safety.[1] I feel like you would’ve been adding new terms quite consistently from 2018-2023, but things have dried up in the last 6-12 months.
(2) When did AI safety innovation peak?
My guess is Spring 2022, during the ELK Prize era. I’m not sure though. What do you guys think?
(3) What’s caused the slow down?
Possible explanations:
ideas are harder to find
people feel less creative
people are more cautious
more publishing in journals
research is now closed-source
we lost the mandate of heaven
the current ideas are adequate
paul christiano stopped posting
i’m mistaken, innovation hasn’t stopped
something else
(4) How could we measure “innovation”?
By “innovation” I mean non-transient novelty. An article is “novel” if it uses n-grams that previous articles didn’t use, and an article is “transient” if it uses n-grams that subsequent articles didn’t use. Hence, an article is non-transient and novel if it introduces a new n-gram which sticks around. For example, Gradient Hacking (Evan Hubinger, October 2019) was an innovative article, because the n-gram “gradient hacking” doesn’t appear in older articles, but appears often in subsequent articles. See below.
In Barron et al 2017, they analysed 40 000 parliament speeches during the French Revolution. They introduce a metric “resonance”, which is novelty (surprise of article given the past articles) minus transience (surprise of article given the subsequent articles). See below.
My claim is recent AI safety research has been less resonant.
- ^
Here’s 20 random terms that would be in the glossary, to illustrate what I mean:
Evals
Mechanistic anomaly detection
Stenography
Glitch token
Jailbreaking
RSPs
Model organisms
Trojans
Superposition
Activation engineering
CCS
Singular Learning Theory
Grokking
Constitutional AI
Translucent thoughts
Quantilization
Cyborgism
Factored cognition
Infrabayesianism
Obfuscated arguments
I don’t understand the s-risk consideration.
Suppose Alice lives naturally for 100 years and is cremated. And suppose Bob lives naturally for 40 years then has his brain frozen for 60 years, and then has his brain cremated. The odds that Bob gets tortured by a spiteful AI should be pretty much exactly the same as for Alice. Basically, its the odds that spiteful AIs appear before 2034.
Thanks Tamsin! Okay, round 2.
My current understanding of QACI:
We assume a set of hypotheses about the world. We assume the oracle’s beliefs are given by a probability distribution .
We assume sets and of possible queries and answers respectively. Maybe these are exabyte files, i.e. for .
Let be the set of mathematical formula that Joe might submit. These formulae are given semantics for each formula .[1]
We assume a function where is the probability that Joe submits formula after reading query , under hypothesis .[2]
We define as follows: sample , then sample , then return .
For a fixed hypothesis , we can interpret the answer as a utility function via some semantics .
Then we define via integrating over , i.e. .
A policy is optimal if and only if .
The hope is that , , , and can be defined mathematically. Then the optimality condition can be defined mathematically.
Question 0
What if there’s no policy which maximises ? That is, for every policy there is another policy such that . I suppose this is less worrying, but what if there are multiple policies which maximises ?
Question 1
In Step 7 above, you average all the utility functions together, whereas I suggested sampling a utility function. I think my solution might be safer.
Suppose the oracle puts 5% chance on hypotheses such that is malign. I think this is pretty conservative, because Solomonoff predictor is malign, and some of the concerns Evhub raises here. And the QACI amplification might not preserve benignancy. It follows that, under your solution, is influenced by a coalition of malign agents, and similarly is influenced by the malign coalition.
By contrast, I suggest sampling and then finding . This should give us a benign policy with 95% chance, which is pretty good odds. Is this safer? Not sure.
Question 2
I think the function doesn’t work, i.e. there won’t be a way to mathematically define the semantics of the formula language. In particular, the language must be strictly weaker than the meta-language in which you are hoping to define itself. This is because of Tarski’s Undefinability of Truth (and other no-go theorems).
This might seem pedantic, but you in practical terms: there’s no formula whose semantics is QACI itself. You can see this via a diagonal proof: imagine if Joe always writes the formal expression .
The most elegant solution is probably transfinite induction, but this would give us a QACI for each ordinal.
Question 3
If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions
I want to understand how QACI and prosaic ML map onto each other. As far as I can tell, issues with QACI will be analogous to issues with prosaic ML and vice-versa.
Question 4
I still don’t understand why we’re using QACI to describe a utility function over policies, rather than using QACI in a more direct approach.
Here’s one approach. We pick a policy which maximises .[3] The advantage here is that Joe doesn’t need to reason about utility functions over policies, he just need to reason about a single policy in front of him
Here’s another approach. We use QACI as our policy directly. That is, in each context that the agent finds themselves in, they sample an action from and take the resulting action.[4] The advantage here is that Joe doesn’t need to reason about policies whatsoever, he just needs to reason about a single context in front of him. This is also the most “human-like”, because there’s no argmax’s (except if Joe submits a formula with an argmax).
Here’s another approach. In each context , the agent takes an action which maximises .
E.t.c.
Happy to jump on a call if that’s easier.
- ^
I think you would say . I’ve added the , which simply amounts to giving Joe access to a random number generator. My remarks apply if also.
- ^
I think you would say . I’ve added the , which simply amount to including hypotheses that Joe is stochastic. But my remarks apply if also.
- ^
By this I mean either:
(1) Sample , then maximise the function .
(2) Maximise the function .
For reasons I mentioned in Question 1, I suspect (1) is safer, but (2) is closer to your original approach.
- ^
I would prefer the agent samples once at the start of deployment, and reuses the same hypothesis at each time-step. I suspect this is safer than resampling at each time-step, for reasons discussed before.
First, proto-languages are not attested. This means that we have no example of writing in any proto-language.
A parent language is typically called “proto-” if the comparative method is our primary evidence about it — i.e. the term is (partially) epistemological metadata.Proto-Celtic has no direct attestation whatsoever.
Proto-Norse (the parent of Icelandic, Danish, Norwegian, Swedish, etc) is attested, but the written record is pretty scarce, just a few inscriptions.
Proto-Romance (the parent of French, Italian, Spanish, etc) has an extensive written record. More commonly known as “Latin”.
I think the existence of Latin as Proto-Romance has an important epistemological upshot:
Let’s say we want to estimate how accurately we have reconstructed Proto-Celtic. Well, we can apply the same method used to reconstruct Proto-Celtic to reconstructing Proto-Romance. We can evaluate our reconstruction of Proto-Romance using the written record of Latin. This gives us an estimate of how we would evaluate our Proto-Celtic reconstruction if we discovered a written record tomorrow.
I want to better understand how QACI works, and I’m gonna try Cunningham’s Law. @Tamsin Leake.
QACI works roughly like this:
We find a competent honourable human , like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 2048-bit secret key. We define as the serial composition of a bajillion copies of .
We want a model of the agent . In QACI, we get by asking a Solomonoff-like ideal reasoner for their best guess about after feeding them a bunch of data about the world and the secret key.
We then ask the question , “What’s the best reward function to maximise?” to get a reward function . We then train a policy to maximise the reward function . In QACI, we use some perfect RL algorithm. If we’re doing model-free RL, then might be AIXI (plus some patches). If we’re doing model-based RL, then might be the argmax over expected discounted utility, but I don’t know where we’d get the world-model — maybe we ask ?
So, what’s the connection between the final policy and the competent honourable human ? Well overall, maximises a reward function specified by the ideal reasonser’s estimation of the serial composition of a bajillion copies of . Hmm.
Questions:
Is this basically IDA, where Step 1 is serial amplification, Step 2 is imitative distillation, and Step 3 is reward modelling?
Why not replace Step 1 with Strong HCH or some other amplification scheme?
What does “bajillion” actually mean in Step 1?
Why are we doing Step 3? Wouldn’t it be better to just use directly as our superintelligence? It seems sufficient to achieve radical abundance, life extension, existential security, etc.
What if there’s no reward function that should be maximised? Presumably the reward function would need to be “small”, i.e. less than a Exabyte, which imposes a maybe-unsatisfiable constraint.
Why not ask for the policy directly? Or some instruction for constructing ? The instruction could be “Build the policy using our super-duper RL algo with the following reward function...” but it could be anything.
Why is there no iteration, like in IDA? For example, after Step 2, we could loop back to Step 1 but reassign as with oracle access to .
Why isn’t Step 3 recursive reward modelling? i.e. we could collect a bunch of trajectories from and ask to use those trajectories to improve the reward function.
I’m very confused about current AI capabilities and I’m also very confused why other people aren’t as confused as I am. I’d be grateful if anyone could clear up either of these confusions for me.
How is it that AI is seemingly superhuman on benchmarks, but also pretty useless?
For example:
O3 scores higher on FrontierMath than the top graduate students
No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer
If either of these statements is false (they might be—I haven’t been keeping up on AI progress), then please let me know. If the observations are true, what the hell is going on?
If I was trying to forecast AI progress in 2025, I would be spending all my time trying to mutually explain these two observations.