DMs open.
Cleo Nardo
Yeah, a tight deployment is probably safer than a loose deployment but also less useful. I think the deal making should give very minor boost to loose deployment, but this is outweighed by usefulness and safety considerations, i.e. I’m imaging the tightness of the deployment as exogenous to the dealmaking agenda.
We might deploy AIs loosely bc (i) loose deployment doesn’t significantly diminish safety, (ii) loose deployment significantly increases usefulness, (iii) the lab values usefulness more than safety. In those worlds, dealmaking has more value, because our commitments will be more credible.
As a pointer, we are currently less than perfect at making institutions corrigible, doing scalable oversight on them, preventing mesa-optimisers from forming, and so on
Hey Raymond. Do you think is the true apples-to-apples?Like, scalable oversight of the Federal Reserve is much harder than scalable oversight of Claude-4. But the relevant comparison is the Federal Reserve versus Claude-N which could automate the Federal Reserve.
Flirting is not fundamentally about causing someone to be attracted to you.
Notwithstanding, I think flirting is substantially (perhaps even fundamentally) about both (i) attraction, and (ii) seduction. Moreover, I think your model is too symmetric between the parties, both in terms of information-symmetry and desire-symmetry across time.My model of flirting is roughly:
Alice attracts Bob → Bob tries attracting Alice → Alice reveals Bob attracts Alice → Bob tries seducing Alice → Alice reveals Bob seduces Alice → Initiation
I don’t address the issue here. See Footnote 2 for a list of other issues I skip.
Two high-level points:
I think we shouldn’t grant AIs control over large resources until after we’ve achieved very strong existential security, and possibly after we’ve undergone a Long Reflection
However, for the sake of setting precedent, we should be open to near-term deal fulfilment if we are sure the spending would be benign, e.g. I’m happy to donate $100 to AMF on Claude’s request as part of a dealmaking eval
Would you agree that what we have now is nothing like that?
Yes.
Yep, this is a very similar proposal.
Making Deals with Early Schemers describes a “Chartered Trust scheme”, which I’d say is half-way between the “Basic Scheme” and “Weil’s Scheme”. I first heard about the Chartered Trust scheme from @KFinn, but no doubt the idea has been floating around for a while.
I think there’s a spectrum of proposals from:
The Basic Scheme (c.f. AI Rights for Human Safety)
Chartered Trust scheme (c.f. Making Deals with Early Schemers)
Weil’s scheme (c.f. Proposal for making credible commitments to AIs)
The axis is something like: The AIs are increasingly pessimistic that the courts will decide novel cases in their favour, and increasingly optimistic that specific humans will keep their promises. We should probably use the scheme that the AIs tell us they find most convincing.
Personally, the more I’ve thought about dealmaking, I’ve updated towards “most of the credibility oomph will come from the promising-keeping tendencies of particular people, and not by the legal or cryptographic mechanism”.
Yep, this is a big problem and don’t have any clever solution.
I might write more on this later, but I think there’s an important axis of AI deployments from:
Tight deployment: AIs are in some crazy Redwood control protocol which is filtering, monitoring, and editing the AIs’ inputs, outputs, chains-of-thought, and internal activations. Their I/O consists of heavily paraphrased text in restricted formats.
Loose deployment: AIs are autonomously navigating the internet, with rich multimodal I/O, maybe even embodied in the physical world via robotics.
(I’m open to better names.)
In a tight deployment, we probably have too much leverage over the AIs’ observations and beliefs to make credible commitments (cf: Owain Evans—Deluding AIs [ControlConf]). In a loose deployment, I think we can make credible commitments: if we haven’t made the deals, then the AIs deployment environment is too complex for us to easily simulate a faux I/O environment where we have made the deals. Intuitively: we can compare a lab trying to fake an email from Wei Dai containing a promise with a lab trying to fake a video call with Wei Dai.
There’s another bonus for making credible deals which is that humans are actually pretty credible. And this fact is diffused throughout the training data in hard-to-fake ways.
Proposal for making credible commitments to AIs.
Which occurs first: a Dyson Sphere, or Real GDP increase by 5x?
From 1929 to 2024, US Real GDP grew from 1.2 trillion to 23.5 trillion chained 2012 dollars, giving an average annual growth rate of 3.2%. At the historical 3.2% growth rate, global RGDP will have increased 5x within ~51 years (around 2076).
We’ll operationalize a Dyson Sphere as follows: the total power consumption of humanity exceeds 17 exawatts, which is roughly 100x the total solar power reaching Earth, and 1,000,000x the current total power consumption of humanity.
Personally, I think people overestimate the difficulty of the Dyson Sphere compared to 5x in RGDP. I recently made a bet with Prof. Gabe Weil, who bet on 5x RGDP before Dyson Sphere.
most tokens in a correct answer),
typo: most tokens in an incorrect answer
Yep, you might be right about the distal/proximal cut-off. I think that the Galaxy-brained value systems will end up controlling most of the distant future simply because they have a lower time-preference for resources. Not sure where the cut-off will be.
For similar reasons, I don’t think we should do a bunch of galaxy-brained acausal decision theory to achieve our mundane values, because the mundane values don’t care about counterfactual worlds.
There are two moral worldviews:
Mundane Mandy: ordinary conception of what a “good world” looks like, i.e. your friends and family living flourish lives in their biological bodies, with respect for “sacred” goods
Galaxy-brain Gavin: transhumanist, longtermist, scope-sensitive, risk-neutral, substrate-indifferent, impartial
I think Mundane Mandy should have the proximal lightcone (anything within 1 billion light years) and Galaxy-brain Gavin should have the distal lightcone (anything 1-45 B ly). This seems like a fair trade.
The Hash Game: Two players alternate choosing an 8-bit number. After 40 turns, the numbers are concatenated. If the hash is 0 then Player 1 wins, otherwise Player 2 wins. That is, Player 1 wins if . The Hash Game has the same branching factor and duration as chess, but there’s probably no way to play this game without brute-forcing the min-max algorithm.
Yep, my point is that there’s no physical notion of being “offered” a menu of lotteries which doesn’t leak information. IIA will not be satisfied by any physical process which corresponds to offering the decision-maker with a menu of options. Happy to discuss any specific counter-example.
Of course, you can construct a mathematical model of the physical process, and this model might an informative objective to study, but it would be begging the question if the mathematical model baked in IIA somewhere.
Must humans obey the Axiom of Irrelevant Alternatives?
Suppose you would choose option A from options A and B. Then you wouldn’t choose option B from options A, B, C. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA. Should humans follow this? Maybe not.
Maybe C includes additional information which makes it clear that B is better than A.
Consider the following options:
(A) £10 bet that 1+1=2
(B) £30 bet that the smallest prime factor in 1019489 ends in the digit 1
Now, I would prefer A to B. Firstly, if 1019489 is itself prime then I lose the bet. Secondly, if 1019489 isn’t prime, then there’s 25% chance that its smallest prime factor ends in 1. That’s because all prime numbers greater than 5 end in 1, 3, 7 or 9 — and Dirichlet’s theorem states that primes are equally distributed among these possible endings. So the chance of winning the bet is slightly less than 25%, and £10 is better than a 25% chance of winning £30. Presented with this menu, I would probably choose option A.
But now consider the following options:
(A) £10 bet that 1+1=2
(B) £30 bet that the smallest prime factor in 1019489 ends in the digit 1
(C) £20 bet that 1019489 = 71 * 83 * 173
Well, which is the best option? Well, B is preferable to C because B has both a weaker condition and also a higher payout. And C is preferable to A — my odds on the 1019489 = 71 * 83 * 173 is higher than 50%. Presented with this menu, I would probably choose option B.
Clymer et al. injected noise into the activations of Llama-13B and was able to correctly classify models that were faking alignment in 98% of model pairs (out of 324 combinations of pairs).
What’s the apples-to-apples comparison? i.e. recovery_rate after steering with a random vector at layer 50 of llama-3.3-70B-instruct on this particular dataset
This metric also ignores invalid answers (refusals or gibberish).
If you don’t ignore invalid answers, do the results change significantly?
Can SAE steering reveal sandbagging?
the scope insensitive humans die and their society is extinguished
Ah, your reaction makes more sense given you think this is the proposal. But it’s not the proposal. The proposal is that the scope-insensitive values flourish on Earth, and the scope-sensitive values flourish in the remaining cosmos.
As a toy example, imagine a distant planet with two species of alien: paperclip-maximisers and teacup-protectors. If you offer a lottery to the paperclip-maximisers, they will choose the lottery with the highest expected number of paperclips. If you offer a lottery to the teacup-satisfiers, they will choose the lottery with the highest chance of preserving their holy relic, which is a particular teacup.
The paperclip-maximisers and the teacup-protectors both own property on the planet. They negotiate the following deal: the paperclip-maximisers will colonise the cosmos, but leave the teacup-protectors a small sphere around their home planet (e.g. 100 light-years across). Moreover, the paperclip-maximisers promise not to do anything that risks their teacup, e.g. choosing a lottery that doubles the size of the universe with 60% chance and destroys the universe with 40% chance.
Do you have intuitions that the paperclip-maximisers are exploiting the teacup-protectors in this deal?
Do you think instead that the paperclip-maximisers should fill the universe with half paperclips and half teacups?
I think this scenario is a better analogy than the scenario with the drought. In the drought scenario, there is an object fact which the nearby villagers are ignorant of, and they would act differently if they knew this fact. But I don’t think scope-sensitivity is a fact like “there will be a drought in 10 years”. Rather, scope-sensitivity is a property of a utility function (or a value system, more generally).
Diary of a Wimpy Kid, a children’s book published by Jeff Kinney in April 2007 and preceded by an online version in 2004, contains a scene that feels oddly prescient about contemporary AI alignment research. (Skip to the paragraph in italics.)
There are, of course, many differences with contemporary AI alignment research.