johnswentworth

Karma: 52,029

johnswentworth 3 Feb 2025 16:57 UTC
10 points
0
in reply to: harfe’s comment on: What do coherence arguments actually prove about agentic behavior?
If you already accept the concept of expected utility maximization, then you could also use mixed strategies to get the convexity-like assumption (but that is not useful if the point is to motivate using probabilities and expected utility maximization).
That is indeed what I had in mind when I said we’d need another couple sentences to argue that the agent maximizes expected utility under the distribution. It is less circular than it might seem at first glance, because two importantly different kinds of probabilities are involved: uncertainty over the environment (which is what we’re deriving), and uncertainty over the agent’s own actions arising from mixed strategies.

johnswentworth 31 Jan 2025 18:23 UTC
5 points
−7
in reply to: yams’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
Big crux here: I don’t actually expect useful research to occur as a result of my control-critique post. Even having updated on the discussion remaining more civil than I expected, I still expect basically-zero people to do anything useful as a result.
As a comparison: I wrote a couple posts on my AI model delta with Yudkowsky and with Christiano. For each of them, I can imagine changing ~one big piece in my model, and end up with a model which looks basically like theirs.
By contrast, when I read the stuff written on the control agenda… it feels like there is no model there at all. (Directionally-correct but probably not quite accurate description:) it feels like whoever’s writing, or whoever would buy the control agenda, is just kinda pattern-matching natural language strings without tracking the underlying concepts those strings are supposed to represent. (Joe’s recent post on “fake vs real thinking” feels like it’s pointing at the right thing here; the posts on control feel strongly like “fake” thinking.) And that’s not a problem which gets fixed by engaging at the object level; that type of cognition will mostly not produce useful work, so getting useful work out of such people would require getting them to think in entirely different ways.
… so mostly I’ve tried to argue at a different level, like e.g. in the Why Not Just… posts. The goal there isn’t really to engage the sort of people who would otherwise buy the control agenda, but rather to communicate the underlying problems to the sort of people who would already instinctively feel something is off about the control agenda, and give them more useful frames to work with. Because those are the people who might have any hope of doing something useful, without the whole structure of their cognition needing to change first.

johnswentworth 31 Jan 2025 16:21 UTC
4 points
0
in reply to: Alvin Ånestrand’s comment on: The Case Against AI Control Research
Even if all of those are true, the argument in the post would still imply that control research (at least of the sort people do today) cannot have very high expected value. Like, sure, let’s assume for sake of discussion that most total AI safety research will be done by early transformative AI, that the only chance of aligning superintelligent AIs is to delegate, that control research is unusually tractable, and that for some reason we’re going to use the AIs to pursue formal verification (not a good idea, but whatever).
Even if we assume all that, we still have the problem that control research of the sort people do today does basically-nothing to address slop; it is basically-exclusively focused on intentional scheming. Insofar as intentional scheming is not the main thing which makes outsourcing to early AIs fail, all that control research cannot have very high expected value. None of your bullet points address that core argument at all.

johnswentworth 30 Jan 2025 17:14 UTC
4 points
0
in reply to: Thane Ruthenis’s comment on: johnswentworth’s Shortform
Just because the number of almost-orthogonal vectors in $d$ dimensions scales exponentially with $d$ , doesn’t mean one can choose all those signals independently. We can still only choose $d$ real-valued signals at a time (assuming away the sort of tricks by which one encodes two real numbers in a single real number, which seems unlikely to happen naturally in the body). So “more intended behaviors than input-vector components” just isn’t an option, unless you’re exploiting some kind of low-information-density in the desired behaviors (like e.g. very “sparse activation” of the desired behaviors, or discreteness of the desired behaviors to a limited extent).

johnswentworth 30 Jan 2025 17:07 UTC
5 points
0
in reply to: Max Harms’s comment on: Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
TBC, I don’t particularly expect hard constraints to show up, that was more a way of illustrating the underlying concept. The same underlying concept in the the market-style picture would be: across many different top-level goals, there are convergent ways of carving up “property rights”. So, a system can be generally corrigible by “respecting the convergent property rights”, so to speak.

johnswentworth 28 Jan 2025 21:57 UTC
2 points
0
in reply to: GeneSmith’s comment on: Yudkowsky on The Trajectory podcast
40 standard deviations away from natural human IQ would have an IQ of 600
Nitpick: 700.

johnswentworth 27 Jan 2025 18:38 UTC
6 points
0
in reply to: Max Harms’s comment on: Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
Yup, exactly, and good job explaining it too.

johnswentworth 26 Jan 2025 21:13 UTC
6 points
2
in reply to: trevor’s comment on: johnswentworth’s Shortform
Yup, I’m familiar with that one. The big difference is that I’m backward-chaining, whereas that post forward chains; the hope of backward chaining would be to identify big things which aren’t on peoples’ radar as nootropics (yet).
(Relatedly: if one is following this sort of path, step 1 should be a broad nutrition panel and supplementing anything in short supply, before we get to anything fancier.)

johnswentworth 26 Jan 2025 20:12 UTC
66 points
2
on: johnswentworth’s Shortform
Here’s a side project David and I have been looking into, which others might have useful input on...
Background: Thyroid & Cortisol Systems
As I understand it, thyroid hormone levels are approximately-but-accurately described as the body’s knob for adjusting “overall metabolic rate” or the subjective feeling of needing to burn energy. Turn up the thyroid knob, and people feel like they need to move around, bounce their leg, talk fast, etc (at least until all the available energy sources are burned off and they crash). Turn down the thyroid knob, and people are lethargic.
That sounds like the sort of knob which should probably typically be set higher, today, than was optimal in the ancestral environment. Not cranked up to 11; hyperthyroid disorders are in fact dangerous and unpleasant. But at least set to the upper end of the healthy range, rather than the lower end.
… and that’s nontrivial. You can just dump the relevant hormones (T3/T4) into your body, but there’s a control system which tries to hold the level constant. Over the course of months, the thyroid gland (which normally produces T4) will atrophy, as it shrinks to try to keep T4 levels fixed. Just continuing to pump T3/T4 into your system regularly will keep you healthy—you’ll basically have a hypothyroid disorder, and supplemental T3/T4 is the standard treatment. But you better be ready to manually control your thyroid hormone levels indefinitely if you start down this path. Ideally, one would intervene further up the control loop in order to adjust the thyroid hormone set-point, but that’s more of a research topic than a thing humans already have lots of experience with.
So that’s thyroid. We can tell a similar story about cortisol.
As I understand it, the cortisol hormone system is approximately-but-accurately described as the body’s knob for adjusting/tracking stress. That sounds like the sort of knob which should probably be set lower, today, than was optimal in the ancestral environment. Not all the way down; problems would kick in. But at least set to the lower end of the healthy range.
… and that’s nontrivial, because there’s a control loop in place, etc. Ideally we’d intervene on the relatively-upstream parts of the control loop in order to change the set point.
We’d like to generalize this sort of reasoning, and ask: what are all the knobs of this sort which we might want to adjust relative to their ancestral environment settings?
Generalization
We’re looking for signals which are widely broadcast throughout the body, and received by many endpoints. Why look for that type of thing? Because the wide usage puts pressure on the signal to “represent one consistent thing”. It’s not an accident that there are individual hormonal signals which are approximately-but-accurately described by the human-intuitive phrases “overall metabolic rate” or “stress”. It’s not an accident that those hormones’ signals are not hopelessly polysemantic. If we look for widely-broadcast signals, then we have positive reason to expect that they’ll be straightforwardly interpretable, and therefore the sort of thing we can look at and (sometimes) intuitively say “I want to turn that up/down”.
Furthermore, since these signals are widely broadcast, they’re the sort of thing which impacts lots of stuff (and is therefore impactful to intervene upon). And they’re relatively easy to measure, compared to “local” signals.
The “wide broadcast” criterion helps focus our search a lot. For instance, insofar as we’re looking for chemical signals throughout the whole body, we probably want species in the bloodstream; that’s the main way a concentration could be “broadcast” throughout the body, rather than being a local signal. So, basically endocrine hormones.
Casting a slightly wider net, we might also be interested in:
- Signals widely broadcast through the body by the nervous system.
- Chemical signals widely broadcast through the brain specifically (since that’s a particularly interesting/relevant organ).
- Non-chemical signals widely broadcast through the brain specifically.
… and of course for all of these there will be some control system, so each has its own tricky question about how to adjust it.
Some Promising Leads, Some Dead Ends
With some coaxing, we got a pretty solid-sounding list of endocrine hormones out of the LLMs. There were some obvious ones on the list, including thyroid and cortisol systems, sex hormones, and pregnancy/menstruation signals. There were also a lot of signals for homeostasis of things we don’t particularly want to adjust: salt balance, calcium, digestion, blood pressure, etc. There were several inflammation and healing signals, which we’re interested in but haven’t dug into yet. And then there were some cool ones: oxytocin (think mother-child bonding), endocannabinoids (think pot), satiety signals (think Ozempic). None of those really jumped out as clear places to turn a knob in a certain direction, other than obvious things like “take ozempic if you are even slightly overweight” and the two we already knew about (thyroid and cortisol).
Then there were neuromodulators. Here’s the list we coaxed from the LLMs:
- Dopamine: Tracks expected value/reward—how good things are compared to expectations.
- Norepinephrine: Sets arousal/alertness level—how much attention and energy to devote to the current situation.
- Serotonin: Regulates resource availability mindset—whether to act like resources are plentiful or scarce. Affects patience, time preference, and risk tolerance.
- Acetylcholine: Controls signal-to-noise ratio in neural circuits—acts like a gain/precision parameter, determining whether to amplify precise differences (high ACh) or blur things together (low ACh).
- Histamine: Manages the sleep/wake switch—promotes wakefulness and suppresses sleep when active.
- Orexin: Acts as a stability parameter for brain states—increases the depth of attractor basins and raises transition barriers between states. Higher orexin = stronger attractors = harder to switch states.
Of those, serotonin immediately jumps out as a knob you’d probably want to turn to the “plentiful resources” end of the healthy spectrum, compared to the ancestral environment. That puts the widespread popularity of SSRIs in an interesting light!
Moving away from chemical signals, brain waves (alpha waves, theta oscillations, etc) are another potential category—they’re oscillations at particular frequencies which (supposedly) are widely synced across large regions of the brain. I read up just a little, and so far have no idea how interesting they are as signals or targets.
Shifting gears, the biggest dead end so far has been parasympathetic tone, i.e. overall activation level of the parasympathetic nervous system. As far as I can tell, parasympathetic tone is basically Not A Thing: there are several different ways to measure it, and the different measurements have little correlation. It’s probably more accurate to think of parasympathetic nervous activity as localized, without much meaningful global signal.
Anybody see obvious things we’re missing?

johnswentworth 26 Jan 2025 19:50 UTC
12 points
0
in reply to: Thane Ruthenis’s comment on: Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
However, the corrigibility-via-instrumental-goals does have the feel of “make the agent uncertain regarding what goals it will want to pursue next”.
That’s an element, but not the central piece. The central piece (in the subagents frame) is about acting-as-though there are other subagents in the environment which are also working toward your terminal goal, so you want to avoid messing them up.
The “uncertainty regarding the utility function” enters here mainly when we invoke instrumental convergence, in hopes that the subagent can “act as though other subagents are also working torward its terminal goal” in a way agnostic to its terminal goal. Which is a very different role than the old “corrigibility via uncertainty” proposals.

johnswentworth 26 Jan 2025 18:58 UTC
5 points
0
in reply to: Mateusz Bagiński’s comment on: Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
Note that the instrumental goal is importantly distinct from the subagent which pursues that instrumental goal. I think a big part of the insight in this post is to say “corrigibility is a property of instrumental goals, separate from the subagents which pursue those goals”; we can study the goals (i.e. problem factorization) rather than the subagents in order to understand corrigibility.

johnswentworth 26 Jan 2025 18:55 UTC
10 points
2
in reply to: Max Harms’s comment on: Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
I think this misunderstands the idea, mainly because it’s framing things in terms of subagents rather than subgoals. Let me try to illustrate the picture in my head. (Of course at this stage it’s just a hand-wavy mental picture, I don’t expect to have the right formal operationalization yet.)
Imagine that the terminal goal is some optimization problem. Each instrumental goal is also an optimization problem, with a bunch of constraints operationalizing the things which must be done to avoid interfering with other subgoals. The instrumental convergence we’re looking for here is mainly in those constraints; we hope to see that roughly the same constraints show up in many instrumental goals for many terminal goals. Insofar as we see convergence in the constraints, we can forget about the top-level goal, and expect that a (sub)agent which respects those constraints will “play well” in an environment with other (sub)agents trying to achieve other instrumental and/or terminal goals.
Then, addressing this part specifically:
I predict such an agent is relatively easy to make, and will convert the universe into batteries/black holes, computers, and robots. I fail to see why it would respect agents with other terminal goals.
… that would only happen insofar as converting the universe into batteries, computers and robots can typically be done without interfering with other subgoals, for a wide variety of terminal objectives. If it does interfere with other subgoals (for a wide variety of terminal objectives), then the constraints would say “don’t do that”.
And to be clear, maybe there would be some large-scale battery/computer/robot building! But it would be done in a way which doesn’t step on the toes of other subplans, and makes the batteries/computers/robots readily available and easy to use for those other subplans.

johnswentworth 25 Jan 2025 3:30 UTC
6 points
0
in reply to: ryan_greenblatt’s comment on: The Case Against AI Control Research
As in, the claim is that there is almost always a “schelling” mistake? Or is the argument mostly that scheming is largely unimportant because false answers will naturally perform much better than true answers in important cases such that considering the adversarial case doesn’t make much difference?
The latter.

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals

johnswentworth and David Lorell

24 Jan 2025 20:20 UTC

175 points

56 comments5 min readLW link

johnswentworth 24 Jan 2025 17:36 UTC
6 points
6
in reply to: Buck’s comment on: The Case Against AI Control Research
To be clear, I am not claiming that this failure mode is very likely very hard to resolve. Just harder than “run it twice on the original question and a rephrasing/transformation of the question”.

johnswentworth 24 Jan 2025 16:33 UTC
9 points
1
in reply to: Buck’s comment on: The Case Against AI Control Research
But if you ask the AI for a justification of its reasoning, and run it twice on the original question and a rephrasing/transformation of the question, it seems way intuitively hard for it to come up with consistent explanations that lead to consistent shared wrong answers (e.g. I think Von Neumann would have substantial trouble with that).
Strong disagree with this. Probably not the most cruxy thing for us, but I’ll briefly walk through the issue because it is a special case of a common cruxy issue with a lot of prosaic alignment work.
The reason this doesn’t work is that the prototypical “blatant lie” doesn’t look like “the model chooses a random number to output”. The prototypical blatant lie is that there’s a subtle natural mistake one could make in reasoning about the question, the model “knows” that it’s a mistake, but the model just presents an argument with the subtle mistake in it.
Or, alternatively: the model “knows” what kind of argument would seem most natural to the user, and presents that kind of argument despite “knowing” that it systematically overlooks major problems.
Or, alternatively: human experts in the field generally believe a particular wrong answer. Maybe there’s a few dissenters, and the model is clever enough to recognize that the dissenters are correct, but also the model was RLHF’d and got positive feedback for matching supposed-experts’ answers even when those experts were wrong, so it outputs the answer and arguments the human experts in the field would generally give.
These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal “knows” the right answer in some sense.

johnswentworth 23 Jan 2025 20:24 UTC
10 points
6
in reply to: Matt Levinson’s comment on: The Case Against AI Control Research
The first three aren’t addressable by any technical research or solution. Corporate leaders might be greedy, hubristic, and/or reckless. Or human organizations might not be nimble enough to effect development and deployment of the maximum safety we are technically capable of. No safety research portfolio addresses those risks. The other four are potential failures by us as a technical community that apply broadly. If too high a percentage of the people in our space are bad statisticians, can’t think distributionally, are lazy or prideful, or don’t understand causal reasoning well enough, that will doom all potential directions of AI safety research, not just AI control.
Technical research can have a huge impact on these things! When a domain is well-understood in general (think e.g. electrical engineering), it becomes far easier and cheaper for human organizations to successfully coordinate around the technical knowledge, for corporate leaders to use it, for bureaucracies to build regulations based on its models, for mid researchers to work in the domain without deceiving or confusing themselves, etc. But that all requires correct and deep technical understanding first.
Now, you are correct that a number of other AI safety subfields suffer from the same problems to some extent. But that’s a different discussion for a different time.

johnswentworth 23 Jan 2025 20:13 UTC
6 points
2
in reply to: Raemon’s comment on: The Case Against AI Control Research
I think part of John’s belief is more like “the current Control stuff won’t transfer to the society of Von Neumanns.”
That is also separately part of my belief, medium confidence, depends heavily on which specific thing we’re talking about. Notably that’s more a criticism of broad swaths of prosaic alignment work than of control specifically.

johnswentworth 23 Jan 2025 20:09 UTC
7 points
2
in reply to: Buck’s comment on: The Case Against AI Control Research
Yeah, basically. Or at least unlikely that they’re scheming enough or competently enough for it to be the main issue.
For instance, consider today’s AIs. If we keep getting slop at roughly the current level, and scheming at roughly the current level, then slop is going to be the far bigger barrier to using these things to align superintelligence (or nearer-but-strong intelligence).

johnswentworth 23 Jan 2025 19:47 UTC
6 points
2
in reply to: Buck’s comment on: The Case Against AI Control Research
No, more like a disjunction of possibilities along the lines of:
- The critical AGIs come before huge numbers of von Neumann level AGIs.
- At that level, really basic stuff like “just look at the chain of thought” turns out to still work well enough, so scheming isn’t a hard enough problem to be a bottleneck.
- Scheming turns out to not happen by default in a bunch of von Neumann level AGIs, or is at least not successful at equilibrium (e.g. because the AIs don’t fully cooperate with each other).
- “huge numbers of von Neumann level AGIs” and/or “scheming” turns out to be the wrong thing to picture in the first place, the future is Weirder than that in ways which make our intuitions about von Neumann society and/or scheming not transfer at all.
Pile together the probability mass on those sorts of things, and it seems far more probable than the prototypical scheming story.

johnswentworth

Background: Thyroid & Cortisol Systems

Generalization

Some Promising Leads, Some Dead Ends

In­stru­men­tal Goals Are A Differ­ent And Friendlier Kind Of Thing Than Ter­mi­nal Goals

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals