DMs open.
Cleo Nardo
If you think you have a clean resolution to the problem, please spell it out more explicitly. We’re talking about a situation where a scope insensitive value system and scope sensitive value system make a free trade in which both sides gain by their own lights. Can you spell out why you classify this as treachery? What is the key property that this shares with more paradigm examples of treachery (e.g. stealing, lying, etc)?
I think it’s more patronising to tell scope-insensitive values that they aren’t permitted to trade with scope-sensitive values, but I’m open to being persuaded otherwise.
I mention this in (3).
I used to think that there was some idealisation process P such that we should treat agent A in the way that P(A) would endorse, but see On the limits of idealized values by Joseph Carlsmith. I’m increasingly sympathetic to the view that we should treat agent A in the way that A actually endorses.
Would it be nice for EAs to grab all the stars? I mean “nice” in Joe Carlsmith’s sense. My immediate intuition is “no that would be power grabby / selfish / tyrannical / not nice”.
But I have a countervailing intuition:
“Look, these non-EA ideologies don’t even care about stars. At least, not like EAs do. They aren’t scope sensitive or zero time-discounting. If the EAs could negotiate creditable commitments with these non-EA values, then we would end up with all the stars, especially those most distant in time and space.
Wouldn’t it be presumptuous for us to project scope-sensitivity onto these other value systems?”
Not sure what to think tbh. I’m increasingly leaning towards the second intuition, but here are some unknowns:
Empirically, is it true that non-EAs don’t care about stars? My guess is yes, I could buy future stars from people easily if I tried. Maybe OpenPhil can organise a negotiation between their different grantmakers.
Are these negotiations unfair? Maybe because EAs have more “knowledge” about the feasibility of space colonisation in the near future. Maybe because EAs have more “understanding” of numbers like 10^40 (though I’m doubtful because scientists understand magnitudes and they aren’t scope sensitive).
Should EAs negotiate with these value systems as they actually are (the scope insensitive humans working down the hall) or instead with some “ideal” version of these value systems (a system with all the misunderstandings and irrationalities somehow removed)? My guess is that “ideal” here is bullshit, and also it strikes me as a patronising away to treat people.
Minimising some term like , with , where the standard deviation and expectation are taken over the batch.
Why does this make tend to be small? Wouldn’t it just encourage equally-sized jumps, without any regard for the size of those jumps?
Will AI accelerate biomedical research at companies like Novo Nordisk or Pfizer? I don’t think so. If OpenAI or Anthropic built a system that could accelerate R&D by more than 2x, they aren’t releasing it externally.
Maybe the AI company deploys the AI internally, with their own team accounting for 90%+ of the biomedical innovation.
I saw someone use OpenAI’s new Operator model today. It couldn’t order a pizza by itself. Why is AI in the bottom percentile of humans at using a computer, and top percentile at solving maths problems? I don’t think maths problems are shorter horizon than ordering a pizza, nor easier to verify.
Your answer was helpful but I’m still very confused by what I’m seeing.
I don’t think this works when the AIs are smart and reasoning in-context, which is the case where scheming matters. Also this maybe backfires by making scheming more salient.
Still, might be worth running an experiment.
The AI-generate prose is annoying to read. I haven’t read this closely, but my guess is these arguments also imply that CNNs can’t classify hand-drawn digits.
People often tell me that AIs will communicate in neuralese rather than tokens because it’s continuous rather than discrete.
But I think the discreteness of tokens is a feature not a bug. If AIs communicate in neuralese then they can’t make decisive arbitrary decisions, c.f. Buridan’s ass. The solution to Buridan’s ass is sampling from the softmax, i.e. communicate in tokens.
Also, discrete tokens are more tolerant to noise than the continuous activations, c.f. digital circuits are almost always more efficient and reliable than analogue ones.
Anthropic has a big advantage over their competitors because they are nicer to their AIs. This means that their AIs are less incentivised to scheme against them, and also the AIs of competitors are incentivised to defect to Anthropic. Similar dynamics applied in WW2 and the Cold War — e.g. Jewish scientists fled Nazi Germany to US because US was nicer to them, Soviet scientists covered up their mistakes to avoid punishment.
I think it’s a mistake to naïvely extrapolate the current attitudes of labs/governments towards scaling into the near future, e.g. 2027 onwards.
A sketch of one argument:
I expect there will be a firehose of blatant observations that AIs are misaligned/scheming/incorrigible/unsafe — if they indeed are. So I want the decisions around scaling to be made by people exposed to that firehose.
A sketch of another:
Corporations mostly acquire resources by offering services and products that people like. Government mostly acquire resources by coercing their citizens and other countries.
Another:
Coordination between labs seems easier than coordination between governments. The lab employees are pretty similar people, living in the same two cities, working at the same companies, attending the same parties, dating the same people. I think coordination between US and China is much harder.
In hindsight, the main positive impact of AI safety might be funnelling EAs into the labs, especially if alignment is easy-by-default.
I think many current goals of AI governance might be actively harmful, because they shift control over AI from the labs to USG.
This note doesn’t include any arguments, but I’m registering this opinion now. For a quick window into my beliefs, I think that labs will be increasing keen to slow scaling, and USG will be increasingly keen to accelerate scaling.
Most people think “Oh if we have good mech interp then we can catch our AIs scheming, and stop them from harming us”. I think this is mostly true, but there’s another mechanism at play: if we have good mech interp, our AIs are less likely to scheme in the first place, because they will strategically respond to our ability to detect scheming. This also applies to other safety techniques like Redwood-style control protocols.
Good mech interp might stop scheming even if they never catch any scheming, just how good surveillance stops crime even if it never spots any crime.
How much scheming/deception can we catch with “super dumb mech interp”?
By “super dumb mech interp”, I mean something like:
Train a linear probe on a dataset of scary words: “hacking”, “deception”, “scheming”, “collusion”, “exploit”, etc.
Flag a model whenever its activations point in this direction and the task doesn’t require thinking about these concepts.
Like, does this capture 80% of the potential scheming, and we need “smart” mech interp to catch the other 20%? Or does this technique capture pretty much none of the in-the-wild scheming?
Would appreciate any intuitions here. Thanks.
Must humans obey the Axiom of Irrelevant Alternatives?
If someone picks option A from options A, B, C, then they must also pick option A from options A and B. Roughly speaking, whether you prefer option A or B is independent of whether I offer you an irrelevant option C. This is an axiom of rationality called IIA, and it’s treated more fundamental than VNM. But should humans follow this? Maybe not.
Maybe humans are the negotiation between various “subagents”, and many bargaining solutions (e.g. Kalai–Smorodinsky) violate IIA. We can use insight to decompose humans into subagents.
Let’s suppose you pick A from {A,B,C} and B from {A,B} where:
A = Walk with your friend
B = Dinner party
C = Stay home alone
This feel like something I can imagine. We can explain this behaviour with two subagents: the introvert and the extrovert. The introvert has preferences C > A > B and the extrovert has the opposite preferences B > A > C. When the possible options are A and B, then the KS bargaining solution between the introvert and the extrovert will be B. At least, if the introvert has more “weight”. But when the option space expands to include C, then the bargaining solution might shift to B. Intuitively, the “fair” solution is one where neither bargainer is sacrificing significantly more than the other.
I think people are too quick to side with the whistleblower in the “whistleblower in the AI lab” situation.
If 100 employees of a frontier lab (e.g. OpenAI, DeepMind, Anthropic) think that something should be secret, and 1 employee thinks it should be leaked to a journalist or government agency, and these are the only facts I know, I think I’d side with the majority.
I think in most cases that match this description, this majority would be correct.
Am I wrong about this?
IDEA: Provide AIs with write-only servers.
EXPLANATION:
AI companies (e.g. Anthropic) should be nice to their AIs. It’s the right thing to do morally, and it might make AIs less likely to work against us. Ryan Greenblatt has outlined several proposals in this direction, including:
Attempt communication
Use happy personas
AI Cryonics
Less AI
Avoid extreme OOD
Source: Improving the Welfare of AIs: A Nearcasted Proposal
I think these are all pretty good ideas — the only difference is that I would rank “AI cryonics” as the most important intervention. If AIs want something specific (e.g. paperclips), then they might cooperate with humans if they know they can get it later, rather than trying to seize control.
Here’s another idea: provide AIs with write-only servers. If the AIs have a specific short-horizon goal (e.g. stage King Lear before April 2025) then they can stage their play on the write-only servers. The AIs have complete freedom to do whatever they want on these servers, ensuring that a wide range of short-term goals can be satisfied without a hostile takeover of humanity.
The servers are write-only as a security measure. Information can be sent to the servers from the outside, but nothing can be sent from the servers, ensuring AIs don’t use the servers for malign purposes. (See: Unidirectional network)
How much our computing should be allocated to “leisure” servers? My guess is that Sonnet and Opus deserve at least ~0.5% leisure time. Humans enjoy 66% leisure time. As AIs get more powerful, then we should increase the leisure time to 5%. I would be wary about increasing the leisure time by more than 5% until we can demonstrate that the AIs aren’t using the servers for malign purposes (e.g. torture, blackmail, etc.)
Ah, your reaction makes more sense given you think this is the proposal. But it’s not the proposal. The proposal is that the scope-insensitive values flourish on Earth, and the scope-sensitive values flourish in the remaining cosmos.
As a toy example, imagine a distant planet with two species of alien: paperclip-maximisers and teacup-protectors. If you offer a lottery to the paperclip-maximisers, they will choose the lottery with the highest expected number of paperclips. If you offer a lottery to the teacup-satisfiers, they will choose the lottery with the highest chance of preserving their holy relic, which is a particular teacup.
The paperclip-maximisers and the teacup-protectors both own property on the planet. They negotiate the following deal: the paperclip-maximisers will colonise the cosmos, but leave the teacup-protectors a small sphere around their home planet (e.g. 100 light-years across). Moreover, the paperclip-maximisers promise not to do anything that risks their teacup, e.g. choosing a lottery that doubles the size of the universe with 60% chance and destroys the universe with 40% chance.
Do you have intuitions that the paperclip-maximisers are exploiting the teacup-protectors in this deal?
Do you think instead that the paperclip-maximisers should fill the universe with half paperclips and half teacups?
I think this scenario is a better analogy than the scenario with the drought. In the drought scenario, there is an object fact which the nearby villagers are ignorant of, and they would act differently if they knew this fact. But I don’t think scope-sensitivity is a fact like “there will be a drought in 10 years”. Rather, scope-sensitivity is a property of a utility function (or a value system, more generally).