TBC, I don’t particularly expect hard constraints to show up, that was more a way of illustrating the underlying concept. The same underlying concept in the the market-style picture would be: across many different top-level goals, there are convergent ways of carving up “property rights”. So, a system can be generally corrigible by “respecting the convergent property rights”, so to speak.
johnswentworth
40 standard deviations away from natural human IQ would have an IQ of 600
Nitpick: 700.
Yup, exactly, and good job explaining it too.
Yup, I’m familiar with that one. The big difference is that I’m backward-chaining, whereas that post forward chains; the hope of backward chaining would be to identify big things which aren’t on peoples’ radar as nootropics (yet).
(Relatedly: if one is following this sort of path, step 1 should be a broad nutrition panel and supplementing anything in short supply, before we get to anything fancier.)
Here’s a side project David and I have been looking into, which others might have useful input on...
Background: Thyroid & Cortisol Systems
As I understand it, thyroid hormone levels are approximately-but-accurately described as the body’s knob for adjusting “overall metabolic rate” or the subjective feeling of needing to burn energy. Turn up the thyroid knob, and people feel like they need to move around, bounce their leg, talk fast, etc (at least until all the available energy sources are burned off and they crash). Turn down the thyroid knob, and people are lethargic.
That sounds like the sort of knob which should probably typically be set higher, today, than was optimal in the ancestral environment. Not cranked up to 11; hyperthyroid disorders are in fact dangerous and unpleasant. But at least set to the upper end of the healthy range, rather than the lower end.
… and that’s nontrivial. You can just dump the relevant hormones (T3/T4) into your body, but there’s a control system which tries to hold the level constant. Over the course of months, the thyroid gland (which normally produces T4) will atrophy, as it shrinks to try to keep T4 levels fixed. Just continuing to pump T3/T4 into your system regularly will keep you healthy—you’ll basically have a hypothyroid disorder, and supplemental T3/T4 is the standard treatment. But you better be ready to manually control your thyroid hormone levels indefinitely if you start down this path. Ideally, one would intervene further up the control loop in order to adjust the thyroid hormone set-point, but that’s more of a research topic than a thing humans already have lots of experience with.
So that’s thyroid. We can tell a similar story about cortisol.
As I understand it, the cortisol hormone system is approximately-but-accurately described as the body’s knob for adjusting/tracking stress. That sounds like the sort of knob which should probably be set lower, today, than was optimal in the ancestral environment. Not all the way down; problems would kick in. But at least set to the lower end of the healthy range.
… and that’s nontrivial, because there’s a control loop in place, etc. Ideally we’d intervene on the relatively-upstream parts of the control loop in order to change the set point.
We’d like to generalize this sort of reasoning, and ask: what are all the knobs of this sort which we might want to adjust relative to their ancestral environment settings?
Generalization
We’re looking for signals which are widely broadcast throughout the body, and received by many endpoints. Why look for that type of thing? Because the wide usage puts pressure on the signal to “represent one consistent thing”. It’s not an accident that there are individual hormonal signals which are approximately-but-accurately described by the human-intuitive phrases “overall metabolic rate” or “stress”. It’s not an accident that those hormones’ signals are not hopelessly polysemantic. If we look for widely-broadcast signals, then we have positive reason to expect that they’ll be straightforwardly interpretable, and therefore the sort of thing we can look at and (sometimes) intuitively say “I want to turn that up/down”.
Furthermore, since these signals are widely broadcast, they’re the sort of thing which impacts lots of stuff (and is therefore impactful to intervene upon). And they’re relatively easy to measure, compared to “local” signals.
The “wide broadcast” criterion helps focus our search a lot. For instance, insofar as we’re looking for chemical signals throughout the whole body, we probably want species in the bloodstream; that’s the main way a concentration could be “broadcast” throughout the body, rather than being a local signal. So, basically endocrine hormones.
Casting a slightly wider net, we might also be interested in:
Signals widely broadcast through the body by the nervous system.
Chemical signals widely broadcast through the brain specifically (since that’s a particularly interesting/relevant organ).
Non-chemical signals widely broadcast through the brain specifically.
… and of course for all of these there will be some control system, so each has its own tricky question about how to adjust it.
Some Promising Leads, Some Dead Ends
With some coaxing, we got a pretty solid-sounding list of endocrine hormones out of the LLMs. There were some obvious ones on the list, including thyroid and cortisol systems, sex hormones, and pregnancy/menstruation signals. There were also a lot of signals for homeostasis of things we don’t particularly want to adjust: salt balance, calcium, digestion, blood pressure, etc. There were several inflammation and healing signals, which we’re interested in but haven’t dug into yet. And then there were some cool ones: oxytocin (think mother-child bonding), endocannabinoids (think pot), satiety signals (think Ozempic). None of those really jumped out as clear places to turn a knob in a certain direction, other than obvious things like “take ozempic if you are even slightly overweight” and the two we already knew about (thyroid and cortisol).
Then there were neuromodulators. Here’s the list we coaxed from the LLMs:
Dopamine: Tracks expected value/reward—how good things are compared to expectations.
Norepinephrine: Sets arousal/alertness level—how much attention and energy to devote to the current situation.
Serotonin: Regulates resource availability mindset—whether to act like resources are plentiful or scarce. Affects patience, time preference, and risk tolerance.
Acetylcholine: Controls signal-to-noise ratio in neural circuits—acts like a gain/precision parameter, determining whether to amplify precise differences (high ACh) or blur things together (low ACh).
Histamine: Manages the sleep/wake switch—promotes wakefulness and suppresses sleep when active.
Orexin: Acts as a stability parameter for brain states—increases the depth of attractor basins and raises transition barriers between states. Higher orexin = stronger attractors = harder to switch states.
Of those, serotonin immediately jumps out as a knob you’d probably want to turn to the “plentiful resources” end of the healthy spectrum, compared to the ancestral environment. That puts the widespread popularity of SSRIs in an interesting light!
Moving away from chemical signals, brain waves (alpha waves, theta oscillations, etc) are another potential category—they’re oscillations at particular frequencies which (supposedly) are widely synced across large regions of the brain. I read up just a little, and so far have no idea how interesting they are as signals or targets.
Shifting gears, the biggest dead end so far has been parasympathetic tone, i.e. overall activation level of the parasympathetic nervous system. As far as I can tell, parasympathetic tone is basically Not A Thing: there are several different ways to measure it, and the different measurements have little correlation. It’s probably more accurate to think of parasympathetic nervous activity as localized, without much meaningful global signal.
Anybody see obvious things we’re missing?
However, the corrigibility-via-instrumental-goals does have the feel of “make the agent uncertain regarding what goals it will want to pursue next”.
That’s an element, but not the central piece. The central piece (in the subagents frame) is about acting-as-though there are other subagents in the environment which are also working toward your terminal goal, so you want to avoid messing them up.
The “uncertainty regarding the utility function” enters here mainly when we invoke instrumental convergence, in hopes that the subagent can “act as though other subagents are also working torward its terminal goal” in a way agnostic to its terminal goal. Which is a very different role than the old “corrigibility via uncertainty” proposals.
Note that the instrumental goal is importantly distinct from the subagent which pursues that instrumental goal. I think a big part of the insight in this post is to say “corrigibility is a property of instrumental goals, separate from the subagents which pursue those goals”; we can study the goals (i.e. problem factorization) rather than the subagents in order to understand corrigibility.
I think this misunderstands the idea, mainly because it’s framing things in terms of subagents rather than subgoals. Let me try to illustrate the picture in my head. (Of course at this stage it’s just a hand-wavy mental picture, I don’t expect to have the right formal operationalization yet.)
Imagine that the terminal goal is some optimization problem. Each instrumental goal is also an optimization problem, with a bunch of constraints operationalizing the things which must be done to avoid interfering with other subgoals. The instrumental convergence we’re looking for here is mainly in those constraints; we hope to see that roughly the same constraints show up in many instrumental goals for many terminal goals. Insofar as we see convergence in the constraints, we can forget about the top-level goal, and expect that a (sub)agent which respects those constraints will “play well” in an environment with other (sub)agents trying to achieve other instrumental and/or terminal goals.
Then, addressing this part specifically:
I predict such an agent is relatively easy to make, and will convert the universe into batteries/black holes, computers, and robots. I fail to see why it would respect agents with other terminal goals.
… that would only happen insofar as converting the universe into batteries, computers and robots can typically be done without interfering with other subgoals, for a wide variety of terminal objectives. If it does interfere with other subgoals (for a wide variety of terminal objectives), then the constraints would say “don’t do that”.
And to be clear, maybe there would be some large-scale battery/computer/robot building! But it would be done in a way which doesn’t step on the toes of other subplans, and makes the batteries/computers/robots readily available and easy to use for those other subplans.
As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
The latter.
Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
To be clear, I am not claiming that this failure mode is very likely very hard to resolve. Just harder than “run it twice on the original question and a rephrasing/transformation of the question”.
But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that).
Strong disagree with this. Probably not the most cruxy thing for us, but I’ll briefly walk through the issue because it is a special case of a common cruxy issue with a lot of prosaic alignment work.
The reason this doesn’t work is that the prototypical “blatant lie” doesn’t look like “the model chooses a random number to output”. The prototypical blatant lie is that there’s a subtle natural mistake one could make in reasoning about the question, the model “knows” that it’s a mistake, but the model just presents an argument with the subtle mistake in it.
Or, alternatively: the model “knows” what kind of argument would seem most natural to the user, and presents that kind of argument despite “knowing” that it systematically overlooks major problems.
Or, alternatively: human experts in the field generally believe a particular wrong answer. Maybe there’s a few dissenters, and the model is clever enough to recognize that the dissenters are correct, but also the model was RLHF’d and got positive feedback for matching supposed-experts’ answers even when those experts were wrong, so it outputs the answer and arguments the human experts in the field would generally give.
These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal “knows” the right answer in some sense.
The first three aren’t addressable by any technical research or solution. Corporate leaders might be greedy, hubristic, and/or reckless. Or human organizations might not be nimble enough to effect development and deployment of the maximum safety we are technically capable of. No safety research portfolio addresses those risks. The other four are potential failures by us as a technical community that apply broadly. If too high a percentage of the people in our space are bad statisticians, can’t think distributionally, are lazy or prideful, or don’t understand causal reasoning well enough, that will doom all potential directions of AI safety research, not just AI control.
Technical research can have a huge impact on these things! When a domain is well-understood in general (think e.g. electrical engineering), it becomes far easier and cheaper for human organizations to successfully coordinate around the technical knowledge, for corporate leaders to use it, for bureaucracies to build regulations based on its models, for mid researchers to work in the domain without deceiving or confusing themselves, etc. But that all requires correct and deep technical understanding first.
Now, you are correct that a number of other AI safety subfields suffer from the same problems to some extent. But that’s a different discussion for a different time.
I think part of John’s belief is more like “the current Control stuff won’t transfer to the society of Von Neumanns.”
That is also separately part of my belief, medium confidence, depends heavily on which specific thing we’re talking about. Notably that’s more a criticism of broad swaths of prosaic alignment work than of control specifically.
Yeah, basically. Or at least unlikely that they’re scheming enough or competently enough for it to be the main issue.
For instance, consider today’s AIs. If we keep getting slop at roughly the current level, and scheming at roughly the current level, then slop is going to be the far bigger barrier to using these things to align superintelligence (or nearer-but-strong intelligence).
No, more like a disjunction of possibilities along the lines of:
The critical AGIs come before huge numbers of von Neumann level AGIs.
At that level, really basic stuff like “just look at the chain of thought” turns out to still work well enough, so scheming isn’t a hard enough problem to be a bottleneck.
Scheming turns out to not happen by default in a bunch of von Neumann level AGIs, or is at least not successful at equilibrium (e.g. because the AIs don’t fully cooperate with each other).
“huge numbers of von Neumann level AGIs” and/or “scheming” turns out to be the wrong thing to picture in the first place, the future is Weirder than that in ways which make our intuitions about von Neumann society and/or scheming not transfer at all.
Pile together the probability mass on those sorts of things, and it seems far more probable than the prototypical scheming story.
That’s where most of the uncertainty is, I’m not sure how best to price it in (though my gut has priced in some estimate).
I think control research has relatively little impact on X-risk in general, and wrote up the case against here.
Basic argument: scheming of early transformative AGI is not a very large chunk of doom probability. The real problem is getting early AGI to actually solve the problems of aligning superintelligences, before building those superintelligences. That’s a problem for which verification is hard, solving the problem itself seems pretty hard too, so it’s a particularly difficult type of problem to outsource to AI—and a particularly easy to type of problem to trick oneself into thinking the AI has solved, when it hasn’t.
Addendum: Why Am I Writing This?
Because other people asked me to. I don’t particularly like getting in fights over the usefulness of other peoples’ research agendas; it’s stressful and distracting and a bunch of work, I never seem to learn anything actually useful from it, it gives me headaches, and nobody else ever seems to do anything useful as a result of such fights. But enough other people seemed to think this was important to write, so Maybe This Time Will Be Different.
Just because the number of almost-orthogonal vectors in d dimensions scales exponentially with d, doesn’t mean one can choose all those signals independently. We can still only choose d real-valued signals at a time (assuming away the sort of tricks by which one encodes two real numbers in a single real number, which seems unlikely to happen naturally in the body). So “more intended behaviors than input-vector components” just isn’t an option, unless you’re exploiting some kind of low-information-density in the desired behaviors (like e.g. very “sparse activation” of the desired behaviors, or discreteness of the desired behaviors to a limited extent).