I’m broadly interested in AI strategy and want to figure out the most effective interventions to get good AI outcomes.
Thomas Larsen
Introducing the Center for AI Policy (& we’re hiring!)
Long-Term Future Fund: April 2023 grant recommendations
The first line of defence is to avoid training models that have sufficient dangerous capabilities and misalignment to pose extreme risk. Sufficiently concerning evaluation results should warrant delaying a scheduled training run or pausing an existing one
It’s very disappointing to me that this sentence doesn’t say “cancel”. As far as I understand, most people on this paper agree that we do not have alignment techniques to align superintelligence. Therefor, if the model evaluations predict an AI that is sufficiently smarter than humans, the training run should be cancelled.
Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives.[3]
Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.[4]
Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.[5]
I think that the conditions for an SLT to arrive are weaker than you describe.
For (1), it’s unclear to me why you think you need to have this multi-level inner structure.[1] If instead of reward circuitry inducing human values, evolution directly selected over policies, I’d expect similar inner alignment failures. It’s also not necessary that the inner values of the agent make no mention of human values / objectives, it needs to both a) value them enough to not take over, and b) maintain these values post-reflection.
For (2), it seems like you are conflating ‘amount of real world time’ with ‘amount of consequences-optimization’. SGD is just a much less efficient optimizer than intelligent cognition—in-context learning happens much faster than SGD learning. When the inner optimizer starts learning and accumulating knowledge, it seems totally plausible to me that this will happen on much faster timescales than the outer selection.
For (3), I don’t think that the SLT requires the inner optimizer to run freely, it only requires one of:
a. the inner optimizer running much faster than the outer optimizer, such that the updates don’t occur in time.
b. the inner optimizer does gradient hacking / exploration hacking, such that the outer loss’s updates are ineffective.
- ^
Evolution, of course, does have this structure, with 2 levels of selection, it just doesn’t seem like this is a relevant property for thinking about the SLT.
Sometimes, but the norm is to do 70%. This is mostly done on a case by case basis, but salient factors to me include:
Does the person need the money? (what cost of living place are they living in, do they have a family, etc)
What is the industry counterfactual? If someone would make 300k, we likely wouldn’t pay them 70%, while if their counterfactual was 50k, it feels more reasonable to pay them 100% (or even more).
How good is the research?
I’m a guest fund manager for the LTFF, and wanted to say that my impression is that the LTFF is often pretty excited about giving people ~6 month grants to try out alignment research at 70% of their industry counterfactual pay (the reason for the 70% is basically to prevent grift). Then, the LTFF can give continued support if they seem to be doing well. If getting this funding would make you excited to switch into alignment research, I’d encourage you to apply.
I also think that there’s a lot of impactful stuff to do for AI existential safety that isn’t alignment research! For example, I’m quite into people doing strategy, policy outreach to relevant people in government, actually writing policy, capability evaluations, and leveraged community building like CBAI.
Some claims I’ve been repeating in conversation a bunch:
Safety work (I claim) should either be focused on one of the followingCEV-style full value loading, to deploy a sovereign
A task AI that contributes to a pivotal act or pivotal process.
I think that pretty much no one is working directly on 1. I think that a lot of safety work is indeed useful for 2, but in this case, it’s useful to know what pivotal process you are aiming for. Specifically, why aren’t you just directly working to make that pivotal act/process happen? Why do you need an AI to help you? Typically, the response is that the pivotal act/process is too difficult to be achieved by humans. In that case, you are pushing into a difficult capabilities regime—the AI has some goals that do not equal humanity’s CEV, and so has a convergent incentive to powerseek and escape. With enough time or intelligence, you therefore get wrecked, but you are trying to operate in this window where your AI is smart enough to do the cognitive work, but is ‘nerd-sniped’ or focused on the particular task that you like. In particular, if this AI reflects on its goals and starts thinking big picture, you reliably get wrecked. This is one of the reasons that doing alignment research seems like a particularly difficult pivotal act to aim for.
(I deleted this comment)
Fwiw I’m pretty confident that if a top professor wanted funding at 50k/year to do AI Safety stuff they would get immediately funded, and that the bottleneck is that people in this reference class aren’t applying to do this.
There’s also relevant mentorship/management bottlenecks in this, so funding them to do their own research is generally a lot less overall costly than if it also required oversight.(written quickly, sorry if unclear)
Thinking about ethics.
After thinking more about orthogonality I’ve become more confident that one must go about ethics in a mind-dependent way. If I am arguing about what is ‘right’ with a paperclipper, there’s nothing I can say to them to convince them to instead value human preferences or whatever.
I used to be a staunch moral realist, mainly relying on very strong intuitions against nihilism, and then arguing something that not nihilism → moral realism. I now reject the implication, and think that there is both 1) no universal, objective morality, and 2) things matter.
My current approach is to think of “goodness” in terms of what CEV-Thomas would think of as good. Moral uncertainty, then, is uncertainty over what CEV-Thomas thinks. CEV is necessary to get morality out of a human brain, because it is currently a bundle of contradictory heuristics. However, my moral intuitions still give bits about goodness. Other people’s moral intuitions also give some bits about goodness, because of how similar their brains are to mine, so I should weight other peoples beliefs in my moral uncertainty.
Ideally, I should trade with other people so that we both maximize a joint utility function, instead of each of us maximizing our own utility function. In the extreme, this looks like ECL. For most people, I’m not sure that this reasoning is necessary, however, because their intuitions might already be priced into my uncertainty over my CEV.
In real world computers, we have finite memory, so my reading of this was assuming a finite state space. The fractal stuff requires an infinite sets, where two notions of smaller (‘is a subset of’ and ‘has fewer elements’) disagree—the mini-fractal is a subset of the whole fractal, but it has the same number of elements and hence corresponds perfectly.
Following up to clarify this: the point is that this attempt fails 2a because if you perturb the weights along the connection , there is now a connection from the internal representation of to the output, and so training will send this thing to the function .
(My take on the reflective stability part of this)
The reflective equilibrium of a shard theoretic agent isn’t a utility function weighted according to each of the shards, it’s a utility function that mostly cares about some extrapolation of the (one or very few) shard(s) that were most tied to the reflective cognition.
It feels like a ‘let’s do science’ or ‘powerseek’ shard would be a lot more privileged, because these shards will be tied to the internal planning structure that ends up doing reflection for the first time.
There’s a huge difference between “Whenever I see ice cream, I have the urge to eat it”, and “Eating ice cream is a fundamentally morally valuable atomic action”. The former roughly describes one of the shards that I have, and the latter is something that I don’t expect to see in my CEV. Similarly, I imagine that a bunch of the safety properties will look more like these urges because the shards will be relatively weak things that are bolted on to the main part of the cognition, not things that bid on the intelligent planning part. The non-reflectively endorsed shards will be seen as arbitrary code that is attached to the mind that the reflectively endorsed shards have to plan around (similar to how I see my “Whenever I see ice cream, I have the urge to eat it” shard.
In other words: there is convergent pressure for CEV-content integrity, but that does not mean that the current way of making decisions (e.g. shards) is close to the CEV optimum, and the shards will choose to self modify to become closer to their CEV.
I don’t feel epistemically helpless here either, and would love a theory of which shards get preserved under reflection.
Some rough takes on the Carlsmith Report.
Carlsmith decomposes AI x-risk into 6 steps, each conditional on the previous ones:Timelines: By 2070, it will be possible and financially feasible to build APS-AI: systems with advanced capabilities (outperform humans at tasks important for gaining power), agentic planning (make plans then acts on them), and strategic awareness (its plans are based on models of the world good enough to overpower humans).
Incentives: There will be strong incentives to build and deploy APS-AI.
Alignment difficulty: It will be much harder to build APS-AI systems that don’t seek power in unintended ways, than ones that would seek power but are superficially attractive to deploy.
High-impact failures: Some deployed APS-AI systems will seek power in unintended and high-impact ways, collectively causing >$1 trillion in damage.
Disempowerment: Some of the power-seeking will in aggregate permanently disempower all of humanity.
Catastrophe: The disempowerment will constitute an existential catastrophe.
These steps defines a tree over possibilities. But the associated outcome buckets don’t feel that reality carving to me. A recurring crux is that good outcomes are also highly conjunctive, i.e, one of these 6 conditions failing does not give a good AI outcome. Going through piece by piece:
Timelines makes sense and seems like a good criteria; everything else is downstream of timelines.
Incentives seems wierd. What does the world in which there are no incentives to deploy APS-AI look like? There are a bunch of incentives that clearly do impact people towards this already: status, desire for scientific discovery, power, money. Moreover, this doesn’t seem necessary for AI x-risk—even if we somehow removed the gigantic incentives to build APS-AI that we know exist, people might still deploy APS-AI because they personally wanted to, even though there weren’t social incentives to do so.
Alignment difficulty is another non necessary condition. Some ways of getting x-risk without alignment being very hard:
For one, this is a clear spectrum, and even if it is on the really low end of the system, perhaps you only need a small amount of extra compute overhead to robustly align your system. One of the RAAP stories might occur, and even though technical alignment might be pretty easy, but the companies that spend that extra compute robustly aligning their AIs gradually lose out to other companies in the competitive marketplace.
Maybe alignment is easy, but someone misuses AI, say to create an AI assisted dictatorship
Maybe we try really hard and we can align AI to whatever we want, but we make a bad choice and lock-in current day values, or we make a bad choice about reflection procedure that gives us much less than the ideal value of the universe.
High-impact failures contains much of the structure, at least in my eyes. The main ways that we avoid alignment failure are worlds where something happens to take us off of the default trajectory:
Perhaps we make a robust coordination agreement between labs/countries that causes people to avoid deploying until they’ve solved alignment
Perhaps we solve alignment, and harden the world in some way, e.g. by removing compute access, dramatically improving cybersec, monitoring and shutting down dangerous training runs.
In general, thinking about the likelihood of any of these interventions that work, feels very important.
Disempowerment. This and (4), are very entangled with upstream things like takeoff shape. Also, it feels extremely difficult for humanity to not be disempowered.
Catastrophe. To avoid this, again, I need to imagine the extra structure upsteam of this, e.g. 4 was satisfied by a warning shot, and then people coordinated and deployed a benign sovreign that disempowered humanity for good reasons.
My current preferred way to think about likelihood of AI risk routes through something like this framework, but is more structured and has a tree with more conjuncts towards success as well as doom.
I think a really substantial fraction of people who are doing “AI Alignment research” are instead acting with the primary aim of “make AI Alignment seem legit”. These are not the same goal, a lot of good people can tell and this makes them feel kind of deceived, and also this creates very messy dynamics within the field where people have strong opinions about what the secondary effects of research are, because that’s the primary thing they are interested in, instead of asking whether the research points towards useful true things for actually aligning the AI.
This doesn’t feel right to me, off the top of my head, it does seem like most of the field is just trying to make progress. For most of those that aren’t, it feels like they are pretty explicit about not trying to solve alignment, and also I’m excited about most of the projects. I’d guess like 10-20% of the field are in the “make alignment seem legit” camp. My rough categorization:
Make alignment progress:
Anthropic Interp
Redwood
ARC Theory
Conjecture
MIRI
Most independent researchers that I can think of (e.g. John, Vanessa, Steven Byrnes, the MATS people I know)
Some of the safety teams at OpenAI/DM
Aligned AI
Team Shard
make alignment seem legit:
CAISsafe.aiAnthropic scaring laws
ARC Evals (arguably, but it seems like this isn’t quite the main aim)
Some of the safety teams at OpenAI/DM
Open Phil (I think I’d consider Cold Takes to be doing this, but it doesn’t exactly brand itself as alignment research)
What am I missing? I would be curious which projects you feel this way about.
I personally benefitted tremendously from the Lightcone offices, especially when I was there over the summer during SERI MATS. Being able to talk to lots of alignment researchers and other aspiring alignment researchers increased my subjective rate of alignment upskilling by >3x relative to before, when I was in an environment without other alignment people.
Thanks so much to the Lightcone team for making the office happen. I’m sad (emotionally, not making a claim here whether it was the right decision or not) to see it go, but really grateful that it existed.
Because you have a bunch of shards, and you need all of them to balance each other out to maintain the ‘appears nice’ property. Even if I can’t predict which ones will be self modified out, some of them will, and this could disrupt the balance.
I expect the shards that are more [consequentialist, powerseeky, care about preserving themselves] to become more dominant over time. These are probably the relatively less nice shards
These are both handwavy enough that I don’t put much credence in them.
Yeah good point, edited
For any 2 of {reflectively stable, general, embedded}, I can satisfy those properties.
{reflectively stable, general} → do something that just rolls out entire trajectories of the world given different actions that it takes, and then has some utility function/preference ordering over trajectories, and selects actions that lead to the highest expected utility trajectory.
{general, embedded} → use ML/local search with enough compute to rehash evolution and get smart agents out.
{reflectively stable, embedded} → a sponge or a current day ML system.
Yeah, this is fair, and later in the section they say:
Which is good, supports your interpretation, and gets close to the thing I want, albeit less explicitly than I would have liked.
I still think the “delay/pause” wording pretty strongly implies that the default is to wait for a short amount of time, and then keep going at the intended capability level. I think there’s some sort of implicit picture that the eval result will become unconcerning in a matter of weeks-months, which I just don’t see the mechanism for short of actually good alignment progress.