DMs open.
Cleo Nardo
Yep, my point is that there’s no physical notion of being “offered” a menu of lotteries which doesn’t leak information. IIA will not be satisfied by any physical process which corresponds to offering the decision-maker with a menu of options. Happy to discuss any specific counter-example.
Of course, you can construct a mathematical model of the physical process, and this model might an informative objective to study, but it would be begging the question if the mathematical model baked in IIA somewhere.
Must humans obey the Axiom of Irrelevant Alternatives?
Suppose you would choose option A from options A and B. Then you wouldn’t choose option B from options A, B, C. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA. Should humans follow this? Maybe not.
Maybe C includes additional information which makes it clear that B is better than A.
Consider the following options:
(A) £10 bet that 1+1=2
(B) £30 bet that the smallest prime factor in 1019489 ends in the digit 1
Now, I would prefer A to B. Firstly, if 1019489 is itself prime then I lose the bet. Secondly, if 1019489 isn’t prime, then there’s 25% chance that its smallest prime factor ends in 1. That’s because all prime numbers greater than 5 end in 1, 3, 7 or 9 — and Dirichlet’s theorem states that primes are equally distributed among these possible endings. So the chance of winning the bet is slightly less than 25%, and £10 is better than a 25% chance of winning £30. Presented with this menu, I would probably choose option A.
But now consider the following options:
(A) £10 bet that 1+1=2
(B) £30 bet that the smallest prime factor in 1019489 ends in the digit 1
(C) £20 bet that 1019489 = 71 * 83 * 173
Well, which is the best option? Well, B is preferable to C because B has both a weaker condition and also a higher payout. And C is preferable to A — my odds on the 1019489 = 71 * 83 * 173 is higher than 50%. Presented with this menu, I would probably choose option B.
Clymer et al. injected noise into the activations of Llama-13B and was able to correctly classify models that were faking alignment in 98% of model pairs (out of 324 combinations of pairs).
What’s the apples-to-apples comparison? i.e. recovery_rate after steering with a random vector at layer 50 of llama-3.3-70B-instruct on this particular dataset
This metric also ignores invalid answers (refusals or gibberish).
If you don’t ignore invalid answers, do the results change significantly?
the scope insensitive humans die and their society is extinguished
Ah, your reaction makes more sense given you think this is the proposal. But it’s not the proposal. The proposal is that the scope-insensitive values flourish on Earth, and the scope-sensitive values flourish in the remaining cosmos.
As a toy example, imagine a distant planet with two species of alien: paperclip-maximisers and teacup-protectors. If you offer a lottery to the paperclip-maximisers, they will choose the lottery with the highest expected number of paperclips. If you offer a lottery to the teacup-satisfiers, they will choose the lottery with the highest chance of preserving their holy relic, which is a particular teacup.
The paperclip-maximisers and the teacup-protectors both own property on the planet. They negotiate the following deal: the paperclip-maximisers will colonise the cosmos, but leave the teacup-protectors a small sphere around their home planet (e.g. 100 light-years across). Moreover, the paperclip-maximisers promise not to do anything that risks their teacup, e.g. choosing a lottery that doubles the size of the universe with 60% chance and destroys the universe with 40% chance.
Do you have intuitions that the paperclip-maximisers are exploiting the teacup-protectors in this deal?
Do you think instead that the paperclip-maximisers should fill the universe with half paperclips and half teacups?
I think this scenario is a better analogy than the scenario with the drought. In the drought scenario, there is an object fact which the nearby villagers are ignorant of, and they would act differently if they knew this fact. But I don’t think scope-sensitivity is a fact like “there will be a drought in 10 years”. Rather, scope-sensitivity is a property of a utility function (or a value system, more generally).
If you think you have a clean resolution to the problem, please spell it out more explicitly. We’re talking about a situation where a scope insensitive value system and scope sensitive value system make a free trade in which both sides gain by their own lights. Can you spell out why you classify this as treachery? What is the key property that this shares with more paradigm examples of treachery (e.g. stealing, lying, etc)?
I think it’s more patronising to tell scope-insensitive values that they aren’t permitted to trade with scope-sensitive values, but I’m open to being persuaded otherwise.
I mention this in (3).
I used to think that there was some idealisation process P such that we should treat agent A in the way that P(A) would endorse, but see On the limits of idealized values by Joseph Carlsmith. I’m increasingly sympathetic to the view that we should treat agent A in the way that A actually endorses.
Would it be nice for EAs to grab all the stars? I mean “nice” in Joe Carlsmith’s sense. My immediate intuition is “no that would be power grabby / selfish / tyrannical / not nice”.
But I have a countervailing intuition:
“Look, these non-EA ideologies don’t even care about stars. At least, not like EAs do. They aren’t scope sensitive or zero time-discounting. If the EAs could negotiate creditable commitments with these non-EA values, then we would end up with all the stars, especially those most distant in time and space.
Wouldn’t it be presumptuous for us to project scope-sensitivity onto these other value systems?”
Not sure what to think tbh. I’m increasingly leaning towards the second intuition, but here are some unknowns:
Empirically, is it true that non-EAs don’t care about stars? My guess is yes, I could buy future stars from people easily if I tried. Maybe OpenPhil can organise a negotiation between their different grantmakers.
Are these negotiations unfair? Maybe because EAs have more “knowledge” about the feasibility of space colonisation in the near future. Maybe because EAs have more “understanding” of numbers like 10^40 (though I’m doubtful because scientists understand magnitudes and they aren’t scope sensitive).
Should EAs negotiate with these value systems as they actually are (the scope insensitive humans working down the hall) or instead with some “ideal” version of these value systems (a system with all the misunderstandings and irrationalities somehow removed)? My guess is that “ideal” here is bullshit, and also it strikes me as a patronising away to treat people.
Minimising some term like , with , where the standard deviation and expectation are taken over the batch.
Why does this make tend to be small? Wouldn’t it just encourage equally-sized jumps, without any regard for the size of those jumps?
Will AI accelerate biomedical research at companies like Novo Nordisk or Pfizer? I don’t think so. If OpenAI or Anthropic built a system that could accelerate R&D by more than 2x, they aren’t releasing it externally.
Maybe the AI company deploys the AI internally, with their own team accounting for 90%+ of the biomedical innovation.
I saw someone use OpenAI’s new Operator model today. It couldn’t order a pizza by itself. Why is AI in the bottom percentile of humans at using a computer, and top percentile at solving maths problems? I don’t think maths problems are shorter horizon than ordering a pizza, nor easier to verify.
Your answer was helpful but I’m still very confused by what I’m seeing.
I don’t think this works when the AIs are smart and reasoning in-context, which is the case where scheming matters. Also this maybe backfires by making scheming more salient.
Still, might be worth running an experiment.
The AI-generate prose is annoying to read. I haven’t read this closely, but my guess is these arguments also imply that CNNs can’t classify hand-drawn digits.
People often tell me that AIs will communicate in neuralese rather than tokens because it’s continuous rather than discrete.
But I think the discreteness of tokens is a feature not a bug. If AIs communicate in neuralese then they can’t make decisive arbitrary decisions, c.f. Buridan’s ass. The solution to Buridan’s ass is sampling from the softmax, i.e. communicate in tokens.
Also, discrete tokens are more tolerant to noise than the continuous activations, c.f. digital circuits are almost always more efficient and reliable than analogue ones.
Anthropic has a big advantage over their competitors because they are nicer to their AIs. This means that their AIs are less incentivised to scheme against them, and also the AIs of competitors are incentivised to defect to Anthropic. Similar dynamics applied in WW2 and the Cold War — e.g. Jewish scientists fled Nazi Germany to US because US was nicer to them, Soviet scientists covered up their mistakes to avoid punishment.
I think it’s a mistake to naïvely extrapolate the current attitudes of labs/governments towards scaling into the near future, e.g. 2027 onwards.
A sketch of one argument:
I expect there will be a firehose of blatant observations that AIs are misaligned/scheming/incorrigible/unsafe — if they indeed are. So I want the decisions around scaling to be made by people exposed to that firehose.
A sketch of another:
Corporations mostly acquire resources by offering services and products that people like. Government mostly acquire resources by coercing their citizens and other countries.
Another:
Coordination between labs seems easier than coordination between governments. The lab employees are pretty similar people, living in the same two cities, working at the same companies, attending the same parties, dating the same people. I think coordination between US and China is much harder.
In hindsight, the main positive impact of AI safety might be funnelling EAs into the labs, especially if alignment is easy-by-default.
I think many current goals of AI governance might be actively harmful, because they shift control over AI from the labs to USG.
This note doesn’t include any arguments, but I’m registering this opinion now. For a quick window into my beliefs, I think that labs will be increasing keen to slow scaling, and USG will be increasingly keen to accelerate scaling.
The Hash Game: Two players alternate choosing an 8-bit number. After 40 turns, the numbers are concatenated. If the hash is 0 then Player 1 wins, otherwise Player 2 wins. That is, Player 1 wins if hash(a1,b1,a2,b2,...a40,b40)=0. The Hash Game has the same branching factor and duration as chess, but there’s probably no way to play this game without brute-forcing the min-max algorithm.