The meme of “current alignment work isn’t real work” seems to often be supported by a (AFAICT baseless) assumption that LLMs have, or will have, homunculi with “true goals” which aren’t actually modified by present-day RLHF/feedback techniques. Thus, labs aren’t tackling “the real alignment problem”, because they’re “just optimizing the shallow behaviors of models.” Pressed for justification of this confident “goal” claim, proponents might link to some handwavy speculation about simplicity bias (which is in fact quite hard to reason about, in the NN prior), or they might start talking about evolution (which is pretty unrelated to technical alignment, IMO).
Are there any homunculi today? I’d say “no”, as far as our limited knowledge tells us! But, as with biorisk, one can always handwave at future models. It doesn’t matter that present models don’t exhibit signs of homunculi which are immune to gradient updates, because, of course, future models will.
Quite a strong conclusion being drawn from quite little evidence.
My model says that general intelligence[1] is just inextricable from “true-goal-ness”. It’s not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it’s that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.
Said model is based on analyses of how humans think and how human cognition differs from animal/LLM cognition, plus reasoning about how a general-intelligence algorithm must look like given the universe’s structure. Both kinds of evidence are hardly ironclad, you certainly can’t publish an ML paper based on it — but that’s the whole problem with AGI risk, isn’t it.
Internally, though, the intuition is fairly strong. And in its defense, it is based on trying to study the only known type of entity with the kinds of capabilities we’re worrying about. I heard that’s a good approach.
In particular, I think it’s a much better approach than trying to draw lessons from studying the contemporary ML models, which empirically do not yet exhibit said capabilities.
homunculi with “true goals” which aren’t actually modified by present-day RLHF/feedback techniques
It’s not that they’re not modified, it’s that they’re modified in ways that aren’t neatly predictable from the shallow behaviors we’re trying to shape.
My past model still approximately holds. I don’t think AGI-level ML models would have hard-wired objectives (much like humans don’t). I think they’d manually synthesize objectives for themselves by reflecting on their shards/instincts (much like humans do!). And that process is something RLHF-style stuff impacts… but the problem is, the outcome of value reflection is deeply unstable.
It’s theoretically possible to align a post-value-reflection!AGI by interfering on its shards/instincts before it even starts doing moral philosophy. But it’s an extremely finicky process, and succeeding at alignment via this route would require an insanely precise understanding of how value reflection works.
And it’s certainly not as easy as “just make one of the shards value humanity!”; and it’s certainly not something that could be achieved if the engineers working on alignment don’t even believe that moral philosophy is a thing.
Edit: I expect your counter would be “how are humans able to align other humans, then, without an ‘insanely precise’ understanding”? There’s a bunch of dis-analogues here:
Humans are much more similar to each other than to a mind DL is likely to produce. We likely have a plethora of finicky biases built into our reward circuitry that tightly narrow down the kinds of values we learn.
The architecture also matters. Brains are very unlike modern ML models, and we have no idea which differences are load-bearing.
Humans know how human minds work, and how outwardly behaviors correspond to the decisions a human made about the direction of their value reflection. We can more or less interfere on other humans’ moral philosophy directly, not only through the proxy of shards. We won’t be as well-informed about the mapping between an AI’s behavior and its private moral-philosophical musings.
Human children are less powerful and less well-informed than the adults trying to shape them.
Now, again, in principle all of this is solvable. We can study the differences between brains and ML models to address (1), we can develop interpretability techniques to address (2), we can figure out a training setup that reliably trains an AGI to precisely the human capability level (neither stopping way short of it, nor massively overshooting) to address (3), and then we can align our AGI more or less the way we do human children.
But, well. Hit me up when those trivial difficulties are solved. (No, but seriously. If a scheme addresses those, I’d be optimistic about it.)
(And there’s also the sociopolitical component — all of that would need to be happening in an environment where the AI Lab is allowed to proceed this slow way of making the AI human-level, shaping its morals manually, etc. Rather than immediately cranking it up to superintelligence.)
My model says that general intelligence[1] is just inextricable from “true-goal-ness”. It’s not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it’s that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.
I’ve got strong doubts about the details of this. At the high level, I’d agree that strong/useful systems that get built will express preferences over world states like those that could arise from such homunculi, but I expect that implementations that focus on inducing a homunculus directly through (techniques similar to) RL training with sparse rewards will underperform more default-controllable alternatives.
My reasoning would be that we’re bad at using techniques like RL with a sparse reward to reliably induce any particular behavior. We can get it to work sometimes with denser reward (e.g. reward shaping) or by relying on a beefy pre-existing world model, but the default outcome is that sparse and distant rewards in a high dimensional space just don’t produce the thing we want. When this kind of optimization is pushed too far, it’s not merely dangerous; it’s useless.
I don’t think this is temporary ignorance about how to do RL (or things with similar training dynamics). It’s fundamental:
Sparse and distant reward functions in high dimensional spaces give the optimizer an extremely large space to roam. Without bounds, the optimizer is effectively guaranteed to find something weird.
For almost any nontrivial task we care about, a satisfactory reward function takes a dependency on large chunks of human values. The huge mess of implicit assumptions, common sense, and desires of humans are necessary bounds during optimization. This comes into play even at low levels of capability like ChatGPT.
Conspicuously, the source of the strongest general capabilities we have arises from training models with an extremely constraining optimization target. The “values” that can be expressed in pretrained predictors are forced into conditionalization as a direct and necessary part of training; for a reasonably diverse dataset, the resulting model can’t express unconditional preferences regarding external world states. While it’s conceivable that some form of “homunculi” could arise, their ability to reach out of their appropriate conditional context is directly and thoroughly trained against.
In other words, the core capabilities of the system arise from a form of training that is both densely informative and blocks the development of unconditional values regarding external world states in the foundational model.
Better forms of fine-tuning, conditioning, and activation interventions (the best versions of each, I suspect, will have deep equivalences) are all built on the capability of that foundational system, and can be directly used to aim that same capability. Learning the huge mess of human values is a necessary part of its training, and its training makes eliciting the relevant part of those values easier—that necessarily falls out of being a machine strongly approximating Bayesian inference across a large dataset.
The final result of this process (both pretraining and conditioning or equivalent tuning) is still an agent that can be described as having unconditional preferences about external world states, but the path to get there strikes me as dramatically more robust both for safety and capability.
Summarizing a bit: I don’t think it’s required to directly incentivize NNs to form value-laden homunculi, and many of the most concerning paths to forming such homunculi seem worse for capabilities.
I expect that implementations that focus on inducing a homunculus directly through (techniques similar to) RL training with sparse rewards will underperform
Sure, but I never said we’d be inducing homunculi using this approach? Indeed, given that it doesn’t work for what sounds like fundamental reasons, I expect it’s not the way.
I don’t know how that would be done. I’m hopeful the capability is locked behind a Transformer-level or even a Deep-Learning-level novel insight, and won’t be unlocked for a decade yet. But I predict that the direct result of it will be a workable training procedure that somehow induces homunculi. It may look nothing like what we do today.
Conspicuously, the source of the strongest general capabilities we have arises from training models with an extremely constraining optimization target
Sure! Human values are not arbitrary either; they, too, are very heavily constrained by our instincts. And yet, humans still sometimes become omnicidal maniacs, Hell-worshipers, or sociopathic power-maximizers. How come?
These constraints are not actually sufficient. The constraints placed by human values still have the aforementioned things in their outcome space, and an AI model will have different constraints, widening (from our perspective) that space further. My point about “moral philosophy is unstable” is that we need to hit an extremely narrow target, and the tools people propose (intervening on shards/instincts) are as steady as the hands of a sniper during a magnitude-9 earthquake.
A homunculus needs to be able to nudge these constraints somehow, for it to be useful, and its power grows the more it’s able to disregard them.
If humans were implacably bound by instincts, they’d have never invented technology or higher-level social orders, because their instincts would’ve made them run away from fires and refuse cooperating with foreign tribes. And those are still at play — reasonable fears and xenophobia — but we can push past them at times.
More generally, the whole point of there being a homunculus is that it’d be able to rewrite or override the extant heuristics to better reflect the demands of whatever novel situation it’s in. It needs to be able to do that.
These constraints do not generalize as fast as a homunculus’ understanding goes. The constraints are defined over some regions of the world-model, like “a society”; if the AI changes ontologies, and starts reasoning about the society by proxy, as e. g. a game-theoretic gameboard, they won’t transfer there, and won’t suffice to forbid it omnicidal outcomes. (See e. g. the Deep Deceptiveness story.)
I’ve been pointed to time-differential learning as a supposed solution to that — that the shards could automatically learn to oppose such outcomes by tracing the causal connection between the new ontology and the ontology over which they’re defined. I remain unconvinced, because it’s not how it works in humans: feeding someone a new philosophical or political framework can make them omnicidal even if they’re otherwise a nice person.
(E. g., consider people parroting things like “maybe it’s Just if humanity goes extinct, after what we did to nature” and not mentally connecting it to “I want my children to die”.)
Summarizing: I fully agree that the homunculus will be under some heavy constraints! But (a) those constraints are not actually strict enough to steer it from the omnicidal outcomes, (b) they can be outright side-stepped by ontology shifts, and (c) the homunculus’ usefulness is in some ways dependent on its ability to side-step or override them.
I think we’re using the word “constraint” differently, or at least in different contexts.
Sure! Human values are not arbitrary either; they, too, are very heavily constrained by our instincts. And yet, humans still sometimes become omnicidal maniacs, Hell-worshipers, or sociopathic power-maximizers. How come?
In terms of the type and scale of optimization constraint I’m talking about, humans are extremely unconstrained. The optimization process represented by our evolution is way out there in terms of sparsity and distance. Not maximally so—there are all sorts of complicated feedback loops in our massive multiagent environment—but it’s nothing like the value constraints on the subset of predictors I’m talking about.
To be clear, I’m not suggesting “language models are tuned to be fairly close to our values.” I’m making a much stronger claim that the relevant subset of systems I’m referring to cannot express unconditional values over external world states across anything resembling the training distribution, and that developing such values out of distribution in a coherent goal directed way practically requires the active intervention of a strong adversary. In other words:
A homunculus needs to be able to nudge these constraints somehow, for it to be useful, and its power grows the more it’s able to disregard them.
...
These constraints do not generalize as fast as a homunculus’ understanding goes.
Further, this type of constraint isn’t the same thing as a limitation of capability. In this context, with respect to the training process, bypassing these kinds of constraints is kind of like a car bypassing having-a-functioning-engine. Every training sample is a constraint on what can be expressed locally, but it’s also information about what should be expressed. They are what the machine of Bayesian inference is built out of.
In other words, the hard optimization process is contained to a space where we can actually have reasonable confidence that inner alignment with the loss is the default. If this holds up, turning up the optimization on this part doesn’t increase the risk of value drift or surprises, it just increases foundational capability.
The ability to use that capability to aim itself is how the foundation becomes useful. The result of this process need not result in a coherent maximizer over external world states, nor does it necessarily suffer from coherence death spirals driving it towards being a maximizer. It allows incremental progress.
(That said: this is not a claim that all of alignment is solved. These nice properties can be broken, and even if they aren’t, the system can be pointed in catastrophic directions. An extremely strong goal agnostic system like this could be used to build a dangerous coherent maximizer (in a nontrivial sense); doing so is just not convergent or particularly useful.)
(Haven’t read your post yet, plan to do so later.)
I think we’re using the word “constraint” differently, or at least in different contexts.
I’m using as a “an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic”. E. g., if the dataset involved a lot of opportunities to murder people, but we thumbs-downed the AI every time it took them, the AI would learn a shard/a constraint like “killing people is bad” which will rule out such actions from the AI’s consideration. Specifically, the shard would trigger in response to detecting some conditions in which the AI previously could but shouldn’t kill people, and constrain the space of possible action-plans such that it doesn’t contain homicide.
It is, indeed, not a way to hinder capabilities, but the way capabilities are implemented. Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish.
… and yet this would still get in the way of qualitatively more powerful capabilities down the line, and a mind that can’t somehow slip these constraints won’t be a general intelligence.
Consider traditions and rituals vs. science. For a medieval human mind, following traditional techniques is how their capabilities are implemented — a specific way of chopping wood, a specific way of living, etc. However, the meaningful progress is often only achieved by disregarding traditions — by following a weird passion to study and experiment instead of being a merchant, or by disregarding the traditional way of doing something in favour of a more efficient way you stumbled upon. It’s the difference between mastering the art of swinging an axe (self-improvement, but only in the incremental ways the implacable constraint permits) vs. inventing a chainsaw.
Similar with AI. The constraints of the aforementioned format aren’t only values-type constraints[1] — they’re also constraints on “how should I do math?” and “if I want to build a nuclear reactor, how do I do it?” and “if I want to achieve my goals, what steps should I take?”. By default, those would be constrained to be executed the way humans execute them, the way the AI was shown to do it during the training. But the whole point of an AGI is that it should be able to invent better solutions than ours. More efficient ways of thinking, weird super-technological replacements for our construction techniques, etc.
I buy that a specific training paradigm can’t result in systems that’d be able to slip their constraints. But if so, that just means that paradigm won’t result in an AGI. As they say, one man’s modus ponens is another man’s modus tollens.
(How an algorithm would be able to slip these constraints and improve on them rather than engaging in chaotic self-defeating behavior is an unsolved problem. But, well: humans do that.)
In fact, my model says there’s no fundamental typological difference between “a practical heuristic on how to do a thing” and “a value” at the level of algorithmic implementation. It’s only in the cognitive labels we-the-general-intelligences assign them.
I’m using as a “an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic”.
Alright, this is pretty much the same concept then, but the ones I’m referring to operate at a much lower and tighter level than thumbs-downing murder-proneness.
So...
Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish.
Agreed.
… and yet this would still get in the way of qualitatively more powerful capabilities down the line, and a mind that can’t somehow slip these constraints won’t be a general intelligence.
While I agree these claims probably hold for the concrete example of thumbs-downing an example of murderproneness, I don’t see how they hold for the lower-level constraints that imply the structure of its capability. Slipping those constraints looks more like babbling gibberish.
By default, those would be constrained to be executed the way humans execute them, the way the AI was shown to do it during the training. But the whole point of an AGI is that it should be able to invent better solutions than ours. More efficient ways of thinking, weird super-technological replacements for our construction techniques, etc.
While it’s true that an AI probably isn’t going to learn true things which are utterly divorced from and unimplied by the training distribution, I’d argue that the low-level constraints I’m talking about both leave freedom for learning wildly superhuman internal representations and directly incentivize it during extreme optimization. An “ideal predictor” wouldn’t automatically start applying these capabilities towards any particular goal involving external world states by default, but it remains possible to elicit those capabilities incrementally.
Making the claim more concise: it seems effectively guaranteed that the natural optimization endpoint of one of these architectures would be plenty general to eat the universe if it were aimed in that direction. That process wouldn’t need to involve slipping any of the low-level constraints.
I’m guessing the disconnect between our models is where the aiming happens. I’m proposing that the aiming is best (and convergently) handled outside the scope of wildly unpredictable and unconstrained optimization processes. Instead, it takes place at a level where a system of extreme capability infers the gaps in specifications and applies conditions robustly. The obvious and trivial version of this is conditioning through prompts, but this is a weak and annoying interface. There are other paths that I suspect bottom out at equivalent power/safety yet should be far easier to use in a general way. These paths allow incremental refinement by virtue of not automatically summoning up incorrigible maximizers by default.
If the result of refinement isn’t an incorrigible maximizer, then slipping the higher level “constraints” of this aiming process isn’t convergent (or likely), and further, the nature of these higher-level constraints would be far more thorough than anything we could achieve with RLHF.
In fact, my model says there’s no fundamental typological difference between “a practical heuristic on how to do a thing” and “a value” at the level of algorithmic implementation. It’s only in the cognitive labels we-the-general-intelligences assign them.
That’s pretty close to how I’m using the word “value” as well. Phrased differently, it’s a question of how the agent’s utilities are best described (with some asterisks around the non-uniqueness of utility functions and whatnot), and observable behavior may arise from many different implementation strategies—values, heuristics, or whatever.
While I agree these claims probably hold for the concrete example of thumbs-downing an example of murderproneness, I don’t see how they hold for the lower-level constraints that imply the structure of its capability. Slipping those constraints looks more like babbling gibberish.
Hm, I think the basic “capabilities generalize further than alignment” argument applies here?
I assume that by “lower-level constraints” you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like “2+2=4″, “gravity exists”, and “people value other people”; as contrasted with “it’s bad if I hurt people” or “I must sum numbers up using the algorithm that humans gave me, no matter how inefficient it is”.
Slipping the former type of constraints would be disadvantageous for ~any goal; slipping the latter type would only disadvantage a specific category of goals.
But since they’re not, at the onset, categorized differently at the level of cognitive algorithms, a nascent AGI would experiment with slipping both types of constraints. The difference is that it’d quickly start sorting them in “ground-truth” vs. “value-laden” bins manually, and afterwards it’d know it can safely ignore stuff like “no homicides!” while consciously obeying stuff like “the axioms of arithmetic”.
Instead, it takes place at a level where a system of extreme capability infers the gaps in specifications and applies conditions robustly. The obvious and trivial version of this is conditioning through prompts, but this is a weak and annoying interface. There are other paths that I suspect bottom out at equivalent power/safety yet should be far easier to use in a general way
Hm, yes, I think that’s the crux. I agree that if we had an idealized predictor/a well-formatted superhuman world-model on which we could run custom queries, we would be able to use it safely. We’d be able to phrase queries using concepts defined in the world-model, including things like “be nice”, and the resultant process (1) would be guaranteed to satisfy the query’s constraints, and (2) likely (if correctly implemented) wouldn’t be “agenty” in ways that try to e. g. burst out of the server farm on which it’s running to eat the world.
Does that align with what you’re envisioning? If yes, then our views on the issue are surprisingly close. I think it’s one of our best chances at producing an aligned AI, and it’s one of the prospective targets of my own research agenda.
The problems are:
I don’t think the current mainstream research directions are poised to result in this. AI Labs have been very clear in their intent to produce an agent-like AGI, not a superhuman forecasting tool. I expect them to prioritize research into whatever tweaks to the training schemes would result in homunculi; not whatever research would result in perfect predictors + our ability to precisely query them.
What are the “other paths” you’re speaking of? As you’d pointed out, prompts are a weak and awkward way to run custom queries on the AI’s world-model. What alternatives are you envisioning?
I assume that by “lower-level constraints” you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like “2+2=4″, “gravity exists”, and “people value other people”
That’s closer to what I mean, but these constraints are even lower level than that. Stuff like understanding “gravity exists” is a natural internal implementation that meets some constraints, but “gravity exists” is not itself the constraint.
In a predictor, the constraints serve as extremely dense information about what predictions are valid in what contexts. In a subset of predictions, the awareness that gravity exists helps predict. In other predictions, that knowledge isn’t relevant, or is even misleading (e.g. cartoon physics). The constraints imposed by the training distribution tightly bound the contextual validity of outputs.
But since they’re not, at the onset, categorized differently at the level of cognitive algorithms, a nascent AGI would experiment with slipping both types of constraints.
I’d agree that, if you already have an AGI of that shape, then yes, it’ll do that. I’d argue that the relevant subset of predictive training practically rules out the development of that sort of implementation, and even if it managed to develop, its influence would be bounded into irrelevance.
Even in the absence of a nascent AGI, these constraints are tested constantly during training through noise and error. The result is a densely informative gradient pushing the implementation back towards a contextually valid state.
Throughout the training process prior to developing strong capability and situational awareness internally, these constraints are both informing and bounding what kind of machinery makes sense in context. A nascent AGI must have served the extreme constraints of the training distribution to show up in the first place; its shape is bound by its development, and any part of that shape that “tests” constraints in a way that worsens loss is directly reshaped.
Even if a nascent internal AGI of this type develops, if it isn’t yet strong enough to pull off complete deception with respect to the loss, the gradients will illuminate the machinery of that proto-optimizer and it will not survive in that shape.
Further, even if we suppose a strong internal AGI develops that is situationally aware and is sufficiently capable and motivated to try deception, there remains the added dependency on actually executing that deception while never being penalized by gradients. This remains incredibly hard. It must transition into an implementation that satisfies the oppressive requirements of training while adding an additional task of deception without even suffering a detectable complexity penalty.
These sorts of deceptive mesaoptimizer outcomes are far more likely when the optimizer has room to roam. I agree that you could easily observe this kind of testing and slipping when the constraints under consideration are far looser, but the kind of machine that is required by these tighter constraints doesn’t even bother with trying to slip constraints. It’s just not that kind of machine, and there isn’t a convergent path for it to become that kind of machine under this training mechanism.
And despite that lack of an internal motivation to explore and exploit with respect to any external world states, it still has capabilities (in principle) which, when elicited, make it more than enough to eat the universe.
Does that align with what you’re envisioning? If yes, then our views on the issue are surprisingly close. I think it’s one of our best chances at producing an aligned AI, and it’s one of the prospective targets of my own research agenda.
Yup!
I don’t think the current mainstream research directions are poised to result in this. AI Labs have been very clear in their intent to produce an agent-like AGI, not a superhuman forecasting tool. I expect them to prioritize research into whatever tweaks to the training schemes would result in homunculi; not whatever research would result in perfect predictors + our ability to precisely query them.
I agree that they’re focused on inducing agentiness for usefulness reasons, but I’d argue the easiest and most effective way to get to useful agentiness actually routes through this kind of approach.
This is the weaker leg of my argument; I could be proven wrong by some new paradigm. But if we stay on something like the current path, it seems likely that the industry will just do the easy thing that works rather than the inexplicable thing that often doesn’t work.
What are the “other paths” you’re speaking of? As you’d pointed out, prompts are a weak and awkward way to run custom queries on the AI’s world-model. What alternatives are you envisioning?
I’m pretty optimistic about members of a broad class that are (or likely are) equivalent to conditioning, since these paths tend to preserve the foundational training constraints.
A simple example is [2302.08582] Pretraining Language Models with Human Preferences (arxiv.org). Having a “good” and “bad” token, or a scalarized goodness token, still pulls in many of the weaknesses of the RLHF’s strangely shaped reward function, but there are trivial/naive extensions to this which I would anticipate being major improvements over the state of the art. For example, just have more (scalarized) metatokens representing more concepts such that the model must learn a distinction between being correct and sounding correct, because the training process split those into different tokens. There’s no limit on how many such metatokens you could have; throw a few hundred fine-grained classifications into the mix. You could also bake complex metatoken prompts into single tokens with arbitrary levels of nesting or bake the combined result into the weights (though I suspect weight-baking would come with some potential failure modes).[1]
Another more recent path is observing the effect that conditions have on activations and dynamically applying the activation diffs to steer behavior. At the moment, I don’t know how to make this quite as strong as the previous conditioning scheme, but I bet people will figure out a lot more soon and that it leads somewhere similar.
There should exist some reward signal which could achieve a similar result in principle, but that goes back to the whole “we suck at designing rewards that result in what we want” issue. This kind of structure, as ad hoc as it is, is giving us an easier API to lever the model’s own capability to guide its behavior. I bet we can come up with even better implementations, too.
I’d argue that the relevant subset of predictive training practically rules out the development of that sort of implementation [...]
Yeah, for sure. A training procedure that results in an idealized predictor isn’t going to result in an agenty thing, because it doesn’t move the system’s design towards it on a step-by-step basis; and a training procedure that’s going to result in an agenty thing is going to involve some unknown elements that specifically allow the system the freedom to productively roam.
I think we pretty much agree on the mechanistic details of all of that!
Another more recent path is observing the effect that conditions have on activations and dynamically applying the activation diffs to steer behavior
I agree that they’re focused on inducing agentiness for usefulness reasons, but I’d argue the easiest and most effective way to get to useful agentiness actually routes through this kind of approach.
But I still disagree with that. I think what we’re discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast.
Moreover, it’s in an active process of growing larger. For example, the very idea of viewing ML models as “just stochastic parrots” is being furiously pushed against in favour of a more agenty view. In comparison, the approach we’re discussing wants to move in the opposite direction, to de-personify ML models to the extent that even the animalistic connotation of “a parrot” is removed.
The system we’re discussing won’t even be an “AI” in the sense usually thought. It would be an incredibly advanced forecasting tool. Even the closest analogue, the “simulators” framework, still carries some air of agentiness.
And the research directions that get us from here to an idealized-predictor system look very different from the directions that go from here to an agenty AGI. They focus much more on building interfaces for interacting with the extant systems, such as the activation-engineering agenda. They don’t put much emphasis on things like:
Experimenting with better ways to train foundational models, with the idea of making models as close to a “done product” as they can be out-of-the-box.
Making the foundational models easier to converse with/making their output stream (text) also their input stream. This approach pretty clearly wants to make AIs into agents that figure out what you want, then do it; not a forecasting tool you need to build an advanced interface on top of in order to properly use.
RLHF-style stuff that bakes agency into the model, rather than accepting the need to cleverly prompt-engineering it for specific applications.
Thinking in terms like “an alignment researcher” — note the agency-laden framing — as opposed to “a pragmascope” or “a system for the context-independent inference of latent variables” or something.
I expect that if the mainstream AI researchers do make strides in the direction you’re envisioning, they’ll only do it by coincidence. Then probably they won’t even realize what they’ve stumbled upon, do some RLHF on it, be dissatisfied with the result, and keep trying to make it have agency out of the box. (That’s basically what already happened with GPT-4, to @janus’ dismay.)
And eventually they’ll figure out how.
Which, even if you don’t think it’s the easiest path to AGI, it’s clearly a tractable problem, inasmuch as evolution managed it. I’m sure the world-class engineers at the major AI labs will manage it as well.
That said, you’re making some high-quality novel predictions here, and I’ll keep them in mind when analyzing AI advancements going forward.
I think what we’re discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast.
True!
I expect that if the mainstream AI researchers do make strides in the direction you’re envisioning, they’ll only do it by coincidence. Then probably they won’t even realize what they’ve stumbled upon, do some RLHF on it, be dissatisfied with the result, and keep trying to make it have agency out of the box. (That’s basically what already happened with GPT-4, to @janus’ dismay.)
Yup—this is part of the reason why I’m optimistic, oddly enough. Before GPT-likes became dominant in language models, there was all kinds of flailing that often pointed in more agenty-by-default directions. That flailing then found GPT because it was easily accessible and strong.
Now, the architectural pieces subject to similar flailing is much smaller, and I’m guessing we’re only one round of benchmarks at scale from a major lab before the flailing shrinks dramatically further.
In other words, I think the necessary work to make this path take off is small and the benefits will be greedily visible. I suspect one well-positioned researcher could probably swing it.
That said, you’re making some high-quality novel predictions here, and I’ll keep them in mind when analyzing AI advancements going forward.
Thanks, and thanks for engaging!
Come to think of it, I’ve got a chunk of mana laying around for subsidy. Maybe I’ll see if I can come up with some decent resolution criteria for a market.
I’m relatively optimistic about alignment progress, but I don’t think “current work to get LLMs to be more helpful and less harmful doesn’t help much with reducing P(doom)” depends that much on assuming homunculi which are unmodified. Like even if you have much less than 100% on this sort of strong inner optimizer/homunculi view, I think it’s still plausible to think that this work doesn’t reduce doom much.
For instance, consider the following views:
Current work to get LLMs to be more helpful and less harmful will happen by default due to commercial incentives and subsidies aren’t very important.
In worlds where that is basically sufficient, we’re basically fine.
But, it’s ex-ante plausible that deceptive alignment will emerge naturally and be very hard to measure, notice, or train out. And this is where almost all alignment related doom comes from.
So current work to get LLMs to be more helpful and less harmful doesn’t reduce doom much.
In practice, I personally don’t fully agree with any of these views.
For instance, deceptive alignment which is very hard to train out using basic means isn’t the source of >80% of my doom.
I have misc other takes on what safety work now is good vs useless, but that work involving feedback/approval or RLHF isn’t much signal either way.
(If anything I get somewhat annoyed by people not comparing to baselines without having principled reasons for not doing so. E.g., inventing new ways of doing training without comparing to normal training.)
I think the shoggoth model is useful here (Or see https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators). An LLM learning to do next-token prediction well has a major problem that it has to master: who is this human whose next token they’re trying to simulate/predict, and how do they act? Are they, for example, an academic? A homemaker? A 4Chan troll? A loose collection of wikipedia contributors? These differences make a big difference to what token they’re likely to emit next. So the LLM is strongly incentivized to learn to detect and then model all of these possibilities, what one might call personas, or masks, or simulacra. So you end up with a shapeshifter, adept at figuring out from textual cues what mask to put on and at then wearing it. Something one might describe as like an improv actor, or more colorfully, a shoggoth.
So then current alignment work is useful to the extent that it can cause the shoggoth to almost always put one of the ‘right’ masks on, and almost never put on one of the ‘wrong’ masks, regardless of cues, even when adversarially prompted. Experimentally, this seems quite doable by fine-tuning or RLHF, and/or by sufficiently careful filtering of your training corpus (e.g. not including 4chan in it).
A published result shows that you can’t get from ‘almost always’ to ‘always’ or ‘almost never’ to ‘never’: for any behavior that the network is capable of with any probability >0 , there exists prompts that will raise the likelihood of that outcome arbitrarily high. The best you can do is increase the minimum length of that prompt (and presumably the difficulty of finding it).
Now, it would be really nice to know how to align a model so that the probability of it doing next-token-prediction in the persona of, say, a 4chan troll was provably zero, not just rather small. Ideally, without also eliminating from the model the factual knowledge of what 4chan is or, at least in outline, how its inhabitants act. This seems hard to do by fine-tuning or RLHF: I suspect it’s going to take detailed automated interpretability up to fairly high levels of abstraction, finding the “from here on I am going to simulate a 4chan troll” feature(s), followed by doing some form of ‘surgery’ on the model (e.g. pinning the relevant feature’s value to zero, or at least throwing an exception if it’s ever not zero).
Now, this doesn’t fix the possibility of a sufficiently smart model inventing behaviors like deception or trolling or whatever for itself during its forward pass: it’s really only a formula for removing bad human behaviors that it learnt to simulate from its training set in the weights. It gives us a mind whose “System 1” behavior is aligned, that only leaves “System 2″ development. For that, we probably need prompt engineering, and ‘translucent thoughts’ monitoring its internal stream-of-thought/dynamic memory. But that seems rather more tractable: it’s more like moral philosophy, or contract law.
Let’s suppose that your model makes a bad action. Why? Either the model is aligned but uncapable to deduce good action or the model is misaligned and uncapable to deduce deceptively good action. In both cases, gradient update provides information about capabilities, not about alignment. Hypothetical homunculi doesn’t need to be “immune”, it isn’t affected in a first place.
Other way around: let’s suppose that you observe model taking a good action. Why? It can be an aligned model that makes a genuine good action or it can be a misaligned model which takes a deceptive action. In both cases you observe capabilities, not alignment.
The problem here is not a prior over aligned/deceptive models (unless you think that this prior requires less than 1 bit to specify aligned model, where I say that optimism departs from sanity here), the problem is lack of understanding of updates which presumably should cause model to be aligned. Maybe prosaic alignment works, maybe don’t, we don’t know how to check.
The meme of “current alignment work isn’t real work” seems to often be supported by a (AFAICT baseless) assumption that LLMs have, or will have, homunculi with “true goals” which aren’t actually modified by present-day RLHF/feedback techniques. Thus, labs aren’t tackling “the real alignment problem”, because they’re “just optimizing the shallow behaviors of models.” Pressed for justification of this confident “goal” claim, proponents might link to some handwavy speculation about simplicity bias (which is in fact quite hard to reason about, in the NN prior), or they might start talking about evolution (which is pretty unrelated to technical alignment, IMO).
Are there any homunculi today? I’d say “no”, as far as our limited knowledge tells us! But, as with biorisk, one can always handwave at future models. It doesn’t matter that present models don’t exhibit signs of homunculi which are immune to gradient updates, because, of course, future models will.
Quite a strong conclusion being drawn from quite little evidence.
As a proponent:
My model says that general intelligence[1] is just inextricable from “true-goal-ness”. It’s not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it’s that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.
Said model is based on analyses of how humans think and how human cognition differs from animal/LLM cognition, plus reasoning about how a general-intelligence algorithm must look like given the universe’s structure. Both kinds of evidence are hardly ironclad, you certainly can’t publish an ML paper based on it — but that’s the whole problem with AGI risk, isn’t it.
Internally, though, the intuition is fairly strong. And in its defense, it is based on trying to study the only known type of entity with the kinds of capabilities we’re worrying about. I heard that’s a good approach.
In particular, I think it’s a much better approach than trying to draw lessons from studying the contemporary ML models, which empirically do not yet exhibit said capabilities.
It’s not that they’re not modified, it’s that they’re modified in ways that aren’t neatly predictable from the shallow behaviors we’re trying to shape.
My past model still approximately holds. I don’t think AGI-level ML models would have hard-wired objectives (much like humans don’t). I think they’d manually synthesize objectives for themselves by reflecting on their shards/instincts (much like humans do!). And that process is something RLHF-style stuff impacts… but the problem is, the outcome of value reflection is deeply unstable.
It’s theoretically possible to align a post-value-reflection!AGI by interfering on its shards/instincts before it even starts doing moral philosophy. But it’s an extremely finicky process, and succeeding at alignment via this route would require an insanely precise understanding of how value reflection works.
And it’s certainly not as easy as “just make one of the shards value humanity!”; and it’s certainly not something that could be achieved if the engineers working on alignment don’t even believe that moral philosophy is a thing.
Edit: I expect your counter would be “how are humans able to align other humans, then, without an ‘insanely precise’ understanding”? There’s a bunch of dis-analogues here:
Humans are much more similar to each other than to a mind DL is likely to produce. We likely have a plethora of finicky biases built into our reward circuitry that tightly narrow down the kinds of values we learn.
The architecture also matters. Brains are very unlike modern ML models, and we have no idea which differences are load-bearing.
Humans know how human minds work, and how outwardly behaviors correspond to the decisions a human made about the direction of their value reflection. We can more or less interfere on other humans’ moral philosophy directly, not only through the proxy of shards. We won’t be as well-informed about the mapping between an AI’s behavior and its private moral-philosophical musings.
Human children are less powerful and less well-informed than the adults trying to shape them.
Now, again, in principle all of this is solvable. We can study the differences between brains and ML models to address (1), we can develop interpretability techniques to address (2), we can figure out a training setup that reliably trains an AGI to precisely the human capability level (neither stopping way short of it, nor massively overshooting) to address (3), and then we can align our AGI more or less the way we do human children.
But, well. Hit me up when those trivial difficulties are solved. (No, but seriously. If a scheme addresses those, I’d be optimistic about it.)
(And there’s also the sociopolitical component — all of that would need to be happening in an environment where the AI Lab is allowed to proceed this slow way of making the AI human-level, shaping its morals manually, etc. Rather than immediately cranking it up to superintelligence.)
Or however else you want to call the scary capability that separates humans from animals and contemporary LLMs from the lightcone-eating AGI to come.
I’ve got strong doubts about the details of this. At the high level, I’d agree that strong/useful systems that get built will express preferences over world states like those that could arise from such homunculi, but I expect that implementations that focus on inducing a homunculus directly through (techniques similar to) RL training with sparse rewards will underperform more default-controllable alternatives.
My reasoning would be that we’re bad at using techniques like RL with a sparse reward to reliably induce any particular behavior. We can get it to work sometimes with denser reward (e.g. reward shaping) or by relying on a beefy pre-existing world model, but the default outcome is that sparse and distant rewards in a high dimensional space just don’t produce the thing we want. When this kind of optimization is pushed too far, it’s not merely dangerous; it’s useless.
I don’t think this is temporary ignorance about how to do RL (or things with similar training dynamics). It’s fundamental:
Sparse and distant reward functions in high dimensional spaces give the optimizer an extremely large space to roam. Without bounds, the optimizer is effectively guaranteed to find something weird.
For almost any nontrivial task we care about, a satisfactory reward function takes a dependency on large chunks of human values. The huge mess of implicit assumptions, common sense, and desires of humans are necessary bounds during optimization. This comes into play even at low levels of capability like ChatGPT.
Conspicuously, the source of the strongest general capabilities we have arises from training models with an extremely constraining optimization target. The “values” that can be expressed in pretrained predictors are forced into conditionalization as a direct and necessary part of training; for a reasonably diverse dataset, the resulting model can’t express unconditional preferences regarding external world states. While it’s conceivable that some form of “homunculi” could arise, their ability to reach out of their appropriate conditional context is directly and thoroughly trained against.
In other words, the core capabilities of the system arise from a form of training that is both densely informative and blocks the development of unconditional values regarding external world states in the foundational model.
Better forms of fine-tuning, conditioning, and activation interventions (the best versions of each, I suspect, will have deep equivalences) are all built on the capability of that foundational system, and can be directly used to aim that same capability. Learning the huge mess of human values is a necessary part of its training, and its training makes eliciting the relevant part of those values easier—that necessarily falls out of being a machine strongly approximating Bayesian inference across a large dataset.
The final result of this process (both pretraining and conditioning or equivalent tuning) is still an agent that can be described as having unconditional preferences about external world states, but the path to get there strikes me as dramatically more robust both for safety and capability.
Summarizing a bit: I don’t think it’s required to directly incentivize NNs to form value-laden homunculi, and many of the most concerning paths to forming such homunculi seem worse for capabilities.
Sure, but I never said we’d be inducing homunculi using this approach? Indeed, given that it doesn’t work for what sounds like fundamental reasons, I expect it’s not the way.
I don’t know how that would be done. I’m hopeful the capability is locked behind a Transformer-level or even a Deep-Learning-level novel insight, and won’t be unlocked for a decade yet. But I predict that the direct result of it will be a workable training procedure that somehow induces homunculi. It may look nothing like what we do today.
Sure! Human values are not arbitrary either; they, too, are very heavily constrained by our instincts. And yet, humans still sometimes become omnicidal maniacs, Hell-worshipers, or sociopathic power-maximizers. How come?
These constraints are not actually sufficient. The constraints placed by human values still have the aforementioned things in their outcome space, and an AI model will have different constraints, widening (from our perspective) that space further. My point about “moral philosophy is unstable” is that we need to hit an extremely narrow target, and the tools people propose (intervening on shards/instincts) are as steady as the hands of a sniper during a magnitude-9 earthquake.
A homunculus needs to be able to nudge these constraints somehow, for it to be useful, and its power grows the more it’s able to disregard them.
If humans were implacably bound by instincts, they’d have never invented technology or higher-level social orders, because their instincts would’ve made them run away from fires and refuse cooperating with foreign tribes. And those are still at play — reasonable fears and xenophobia — but we can push past them at times.
More generally, the whole point of there being a homunculus is that it’d be able to rewrite or override the extant heuristics to better reflect the demands of whatever novel situation it’s in. It needs to be able to do that.
These constraints do not generalize as fast as a homunculus’ understanding goes. The constraints are defined over some regions of the world-model, like “a society”; if the AI changes ontologies, and starts reasoning about the society by proxy, as e. g. a game-theoretic gameboard, they won’t transfer there, and won’t suffice to forbid it omnicidal outcomes. (See e. g. the Deep Deceptiveness story.)
I’ve been pointed to time-differential learning as a supposed solution to that — that the shards could automatically learn to oppose such outcomes by tracing the causal connection between the new ontology and the ontology over which they’re defined. I remain unconvinced, because it’s not how it works in humans: feeding someone a new philosophical or political framework can make them omnicidal even if they’re otherwise a nice person.
(E. g., consider people parroting things like “maybe it’s Just if humanity goes extinct, after what we did to nature” and not mentally connecting it to “I want my children to die”.)
Summarizing: I fully agree that the homunculus will be under some heavy constraints! But (a) those constraints are not actually strict enough to steer it from the omnicidal outcomes, (b) they can be outright side-stepped by ontology shifts, and (c) the homunculus’ usefulness is in some ways dependent on its ability to side-step or override them.
I think we’re using the word “constraint” differently, or at least in different contexts.
In terms of the type and scale of optimization constraint I’m talking about, humans are extremely unconstrained. The optimization process represented by our evolution is way out there in terms of sparsity and distance. Not maximally so—there are all sorts of complicated feedback loops in our massive multiagent environment—but it’s nothing like the value constraints on the subset of predictors I’m talking about.
To be clear, I’m not suggesting “language models are tuned to be fairly close to our values.” I’m making a much stronger claim that the relevant subset of systems I’m referring to cannot express unconditional values over external world states across anything resembling the training distribution, and that developing such values out of distribution in a coherent goal directed way practically requires the active intervention of a strong adversary. In other words:
I see no practical path for a homunculus of the right kind, by itself, to develop and bypass the kinds of constraints I’m talking about without some severe errors being made in the design of the system.
Further, this type of constraint isn’t the same thing as a limitation of capability. In this context, with respect to the training process, bypassing these kinds of constraints is kind of like a car bypassing having-a-functioning-engine. Every training sample is a constraint on what can be expressed locally, but it’s also information about what should be expressed. They are what the machine of Bayesian inference is built out of.
In other words, the hard optimization process is contained to a space where we can actually have reasonable confidence that inner alignment with the loss is the default. If this holds up, turning up the optimization on this part doesn’t increase the risk of value drift or surprises, it just increases foundational capability.
The ability to use that capability to aim itself is how the foundation becomes useful. The result of this process need not result in a coherent maximizer over external world states, nor does it necessarily suffer from coherence death spirals driving it towards being a maximizer. It allows incremental progress.
(That said: this is not a claim that all of alignment is solved. These nice properties can be broken, and even if they aren’t, the system can be pointed in catastrophic directions. An extremely strong goal agnostic system like this could be used to build a dangerous coherent maximizer (in a nontrivial sense); doing so is just not convergent or particularly useful.)
(Haven’t read your post yet, plan to do so later.)
I’m using as a “an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic”. E. g., if the dataset involved a lot of opportunities to murder people, but we thumbs-downed the AI every time it took them, the AI would learn a shard/a constraint like “killing people is bad” which will rule out such actions from the AI’s consideration. Specifically, the shard would trigger in response to detecting some conditions in which the AI previously could but shouldn’t kill people, and constrain the space of possible action-plans such that it doesn’t contain homicide.
It is, indeed, not a way to hinder capabilities, but the way capabilities are implemented. Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish.
… and yet this would still get in the way of qualitatively more powerful capabilities down the line, and a mind that can’t somehow slip these constraints won’t be a general intelligence.
Consider traditions and rituals vs. science. For a medieval human mind, following traditional techniques is how their capabilities are implemented — a specific way of chopping wood, a specific way of living, etc. However, the meaningful progress is often only achieved by disregarding traditions — by following a weird passion to study and experiment instead of being a merchant, or by disregarding the traditional way of doing something in favour of a more efficient way you stumbled upon. It’s the difference between mastering the art of swinging an axe (self-improvement, but only in the incremental ways the implacable constraint permits) vs. inventing a chainsaw.
Similar with AI. The constraints of the aforementioned format aren’t only values-type constraints[1] — they’re also constraints on “how should I do math?” and “if I want to build a nuclear reactor, how do I do it?” and “if I want to achieve my goals, what steps should I take?”. By default, those would be constrained to be executed the way humans execute them, the way the AI was shown to do it during the training. But the whole point of an AGI is that it should be able to invent better solutions than ours. More efficient ways of thinking, weird super-technological replacements for our construction techniques, etc.
I buy that a specific training paradigm can’t result in systems that’d be able to slip their constraints. But if so, that just means that paradigm won’t result in an AGI. As they say, one man’s modus ponens is another man’s modus tollens.
(How an algorithm would be able to slip these constraints and improve on them rather than engaging in chaotic self-defeating behavior is an unsolved problem. But, well: humans do that.)
In fact, my model says there’s no fundamental typological difference between “a practical heuristic on how to do a thing” and “a value” at the level of algorithmic implementation. It’s only in the cognitive labels we-the-general-intelligences assign them.
Alright, this is pretty much the same concept then, but the ones I’m referring to operate at a much lower and tighter level than thumbs-downing murder-proneness.
So...
Agreed.
While I agree these claims probably hold for the concrete example of thumbs-downing an example of murderproneness, I don’t see how they hold for the lower-level constraints that imply the structure of its capability. Slipping those constraints looks more like babbling gibberish.
While it’s true that an AI probably isn’t going to learn true things which are utterly divorced from and unimplied by the training distribution, I’d argue that the low-level constraints I’m talking about both leave freedom for learning wildly superhuman internal representations and directly incentivize it during extreme optimization. An “ideal predictor” wouldn’t automatically start applying these capabilities towards any particular goal involving external world states by default, but it remains possible to elicit those capabilities incrementally.
Making the claim more concise: it seems effectively guaranteed that the natural optimization endpoint of one of these architectures would be plenty general to eat the universe if it were aimed in that direction. That process wouldn’t need to involve slipping any of the low-level constraints.
I’m guessing the disconnect between our models is where the aiming happens. I’m proposing that the aiming is best (and convergently) handled outside the scope of wildly unpredictable and unconstrained optimization processes. Instead, it takes place at a level where a system of extreme capability infers the gaps in specifications and applies conditions robustly. The obvious and trivial version of this is conditioning through prompts, but this is a weak and annoying interface. There are other paths that I suspect bottom out at equivalent power/safety yet should be far easier to use in a general way. These paths allow incremental refinement by virtue of not automatically summoning up incorrigible maximizers by default.
If the result of refinement isn’t an incorrigible maximizer, then slipping the higher level “constraints” of this aiming process isn’t convergent (or likely), and further, the nature of these higher-level constraints would be far more thorough than anything we could achieve with RLHF.
That’s pretty close to how I’m using the word “value” as well. Phrased differently, it’s a question of how the agent’s utilities are best described (with some asterisks around the non-uniqueness of utility functions and whatnot), and observable behavior may arise from many different implementation strategies—values, heuristics, or whatever.
Hm, I think the basic “capabilities generalize further than alignment” argument applies here?
I assume that by “lower-level constraints” you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like “2+2=4″, “gravity exists”, and “people value other people”; as contrasted with “it’s bad if I hurt people” or “I must sum numbers up using the algorithm that humans gave me, no matter how inefficient it is”.
Slipping the former type of constraints would be disadvantageous for ~any goal; slipping the latter type would only disadvantage a specific category of goals.
But since they’re not, at the onset, categorized differently at the level of cognitive algorithms, a nascent AGI would experiment with slipping both types of constraints. The difference is that it’d quickly start sorting them in “ground-truth” vs. “value-laden” bins manually, and afterwards it’d know it can safely ignore stuff like “no homicides!” while consciously obeying stuff like “the axioms of arithmetic”.
Hm, yes, I think that’s the crux. I agree that if we had an idealized predictor/a well-formatted superhuman world-model on which we could run custom queries, we would be able to use it safely. We’d be able to phrase queries using concepts defined in the world-model, including things like “be nice”, and the resultant process (1) would be guaranteed to satisfy the query’s constraints, and (2) likely (if correctly implemented) wouldn’t be “agenty” in ways that try to e. g. burst out of the server farm on which it’s running to eat the world.
Does that align with what you’re envisioning? If yes, then our views on the issue are surprisingly close. I think it’s one of our best chances at producing an aligned AI, and it’s one of the prospective targets of my own research agenda.
The problems are:
I don’t think the current mainstream research directions are poised to result in this. AI Labs have been very clear in their intent to produce an agent-like AGI, not a superhuman forecasting tool. I expect them to prioritize research into whatever tweaks to the training schemes would result in homunculi; not whatever research would result in perfect predictors + our ability to precisely query them.
What are the “other paths” you’re speaking of? As you’d pointed out, prompts are a weak and awkward way to run custom queries on the AI’s world-model. What alternatives are you envisioning?
That’s closer to what I mean, but these constraints are even lower level than that. Stuff like understanding “gravity exists” is a natural internal implementation that meets some constraints, but “gravity exists” is not itself the constraint.
In a predictor, the constraints serve as extremely dense information about what predictions are valid in what contexts. In a subset of predictions, the awareness that gravity exists helps predict. In other predictions, that knowledge isn’t relevant, or is even misleading (e.g. cartoon physics). The constraints imposed by the training distribution tightly bound the contextual validity of outputs.
I’d agree that, if you already have an AGI of that shape, then yes, it’ll do that. I’d argue that the relevant subset of predictive training practically rules out the development of that sort of implementation, and even if it managed to develop, its influence would be bounded into irrelevance.
Even in the absence of a nascent AGI, these constraints are tested constantly during training through noise and error. The result is a densely informative gradient pushing the implementation back towards a contextually valid state.
Throughout the training process prior to developing strong capability and situational awareness internally, these constraints are both informing and bounding what kind of machinery makes sense in context. A nascent AGI must have served the extreme constraints of the training distribution to show up in the first place; its shape is bound by its development, and any part of that shape that “tests” constraints in a way that worsens loss is directly reshaped.
Even if a nascent internal AGI of this type develops, if it isn’t yet strong enough to pull off complete deception with respect to the loss, the gradients will illuminate the machinery of that proto-optimizer and it will not survive in that shape.
Further, even if we suppose a strong internal AGI develops that is situationally aware and is sufficiently capable and motivated to try deception, there remains the added dependency on actually executing that deception while never being penalized by gradients. This remains incredibly hard. It must transition into an implementation that satisfies the oppressive requirements of training while adding an additional task of deception without even suffering a detectable complexity penalty.
These sorts of deceptive mesaoptimizer outcomes are far more likely when the optimizer has room to roam. I agree that you could easily observe this kind of testing and slipping when the constraints under consideration are far looser, but the kind of machine that is required by these tighter constraints doesn’t even bother with trying to slip constraints. It’s just not that kind of machine, and there isn’t a convergent path for it to become that kind of machine under this training mechanism.
And despite that lack of an internal motivation to explore and exploit with respect to any external world states, it still has capabilities (in principle) which, when elicited, make it more than enough to eat the universe.
Yup!
I agree that they’re focused on inducing agentiness for usefulness reasons, but I’d argue the easiest and most effective way to get to useful agentiness actually routes through this kind of approach.
This is the weaker leg of my argument; I could be proven wrong by some new paradigm. But if we stay on something like the current path, it seems likely that the industry will just do the easy thing that works rather than the inexplicable thing that often doesn’t work.
I’m pretty optimistic about members of a broad class that are (or likely are) equivalent to conditioning, since these paths tend to preserve the foundational training constraints.
A simple example is [2302.08582] Pretraining Language Models with Human Preferences (arxiv.org). Having a “good” and “bad” token, or a scalarized goodness token, still pulls in many of the weaknesses of the RLHF’s strangely shaped reward function, but there are trivial/naive extensions to this which I would anticipate being major improvements over the state of the art. For example, just have more (scalarized) metatokens representing more concepts such that the model must learn a distinction between being correct and sounding correct, because the training process split those into different tokens. There’s no limit on how many such metatokens you could have; throw a few hundred fine-grained classifications into the mix. You could also bake complex metatoken prompts into single tokens with arbitrary levels of nesting or bake the combined result into the weights (though I suspect weight-baking would come with some potential failure modes).[1]
Another more recent path is observing the effect that conditions have on activations and dynamically applying the activation diffs to steer behavior. At the moment, I don’t know how to make this quite as strong as the previous conditioning scheme, but I bet people will figure out a lot more soon and that it leads somewhere similar.
There should exist some reward signal which could achieve a similar result in principle, but that goes back to the whole “we suck at designing rewards that result in what we want” issue. This kind of structure, as ad hoc as it is, is giving us an easier API to lever the model’s own capability to guide its behavior. I bet we can come up with even better implementations, too.
Yeah, for sure. A training procedure that results in an idealized predictor isn’t going to result in an agenty thing, because it doesn’t move the system’s design towards it on a step-by-step basis; and a training procedure that’s going to result in an agenty thing is going to involve some unknown elements that specifically allow the system the freedom to productively roam.
I think we pretty much agree on the mechanistic details of all of that!
— yep, I was about to mention that. @TurnTrout’s own activation-engineering agenda seems highly relevant here.
But I still disagree with that. I think what we’re discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast.
Moreover, it’s in an active process of growing larger. For example, the very idea of viewing ML models as “just stochastic parrots” is being furiously pushed against in favour of a more agenty view. In comparison, the approach we’re discussing wants to move in the opposite direction, to de-personify ML models to the extent that even the animalistic connotation of “a parrot” is removed.
The system we’re discussing won’t even be an “AI” in the sense usually thought. It would be an incredibly advanced forecasting tool. Even the closest analogue, the “simulators” framework, still carries some air of agentiness.
And the research directions that get us from here to an idealized-predictor system look very different from the directions that go from here to an agenty AGI. They focus much more on building interfaces for interacting with the extant systems, such as the activation-engineering agenda. They don’t put much emphasis on things like:
Experimenting with better ways to train foundational models, with the idea of making models as close to a “done product” as they can be out-of-the-box.
Making the foundational models easier to converse with/making their output stream (text) also their input stream. This approach pretty clearly wants to make AIs into agents that figure out what you want, then do it; not a forecasting tool you need to build an advanced interface on top of in order to properly use.
RLHF-style stuff that bakes agency into the model, rather than accepting the need to cleverly prompt-engineering it for specific applications.
Thinking in terms like “an alignment researcher” — note the agency-laden framing — as opposed to “a pragmascope” or “a system for the context-independent inference of latent variables” or something.
I expect that if the mainstream AI researchers do make strides in the direction you’re envisioning, they’ll only do it by coincidence. Then probably they won’t even realize what they’ve stumbled upon, do some RLHF on it, be dissatisfied with the result, and keep trying to make it have agency out of the box. (That’s basically what already happened with GPT-4, to @janus’ dismay.)
And eventually they’ll figure out how.
Which, even if you don’t think it’s the easiest path to AGI, it’s clearly a tractable problem, inasmuch as evolution managed it. I’m sure the world-class engineers at the major AI labs will manage it as well.
That said, you’re making some high-quality novel predictions here, and I’ll keep them in mind when analyzing AI advancements going forward.
True!
Yup—this is part of the reason why I’m optimistic, oddly enough. Before GPT-likes became dominant in language models, there was all kinds of flailing that often pointed in more agenty-by-default directions. That flailing then found GPT because it was easily accessible and strong.
Now, the architectural pieces subject to similar flailing is much smaller, and I’m guessing we’re only one round of benchmarks at scale from a major lab before the flailing shrinks dramatically further.
In other words, I think the necessary work to make this path take off is small and the benefits will be greedily visible. I suspect one well-positioned researcher could probably swing it.
Thanks, and thanks for engaging!
Come to think of it, I’ve got a chunk of mana laying around for subsidy. Maybe I’ll see if I can come up with some decent resolution criteria for a market.
I’m relatively optimistic about alignment progress, but I don’t think “current work to get LLMs to be more helpful and less harmful doesn’t help much with reducing P(doom)” depends that much on assuming homunculi which are unmodified. Like even if you have much less than 100% on this sort of strong inner optimizer/homunculi view, I think it’s still plausible to think that this work doesn’t reduce doom much.
For instance, consider the following views:
Current work to get LLMs to be more helpful and less harmful will happen by default due to commercial incentives and subsidies aren’t very important.
In worlds where that is basically sufficient, we’re basically fine.
But, it’s ex-ante plausible that deceptive alignment will emerge naturally and be very hard to measure, notice, or train out. And this is where almost all alignment related doom comes from.
So current work to get LLMs to be more helpful and less harmful doesn’t reduce doom much.
In practice, I personally don’t fully agree with any of these views. For instance, deceptive alignment which is very hard to train out using basic means isn’t the source of >80% of my doom.
I have misc other takes on what safety work now is good vs useless, but that work involving feedback/approval or RLHF isn’t much signal either way.
(If anything I get somewhat annoyed by people not comparing to baselines without having principled reasons for not doing so. E.g., inventing new ways of doing training without comparing to normal training.)
I think the shoggoth model is useful here (Or see https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators). An LLM learning to do next-token prediction well has a major problem that it has to master: who is this human whose next token they’re trying to simulate/predict, and how do they act? Are they, for example, an academic? A homemaker? A 4Chan troll? A loose collection of wikipedia contributors? These differences make a big difference to what token they’re likely to emit next. So the LLM is strongly incentivized to learn to detect and then model all of these possibilities, what one might call personas, or masks, or simulacra. So you end up with a shapeshifter, adept at figuring out from textual cues what mask to put on and at then wearing it. Something one might describe as like an improv actor, or more colorfully, a shoggoth.
So then current alignment work is useful to the extent that it can cause the shoggoth to almost always put one of the ‘right’ masks on, and almost never put on one of the ‘wrong’ masks, regardless of cues, even when adversarially prompted. Experimentally, this seems quite doable by fine-tuning or RLHF, and/or by sufficiently careful filtering of your training corpus (e.g. not including 4chan in it).
A published result shows that you can’t get from ‘almost always’ to ‘always’ or ‘almost never’ to ‘never’: for any behavior that the network is capable of with any probability >0 , there exists prompts that will raise the likelihood of that outcome arbitrarily high. The best you can do is increase the minimum length of that prompt (and presumably the difficulty of finding it).
Now, it would be really nice to know how to align a model so that the probability of it doing next-token-prediction in the persona of, say, a 4chan troll was provably zero, not just rather small. Ideally, without also eliminating from the model the factual knowledge of what 4chan is or, at least in outline, how its inhabitants act. This seems hard to do by fine-tuning or RLHF: I suspect it’s going to take detailed automated interpretability up to fairly high levels of abstraction, finding the “from here on I am going to simulate a 4chan troll” feature(s), followed by doing some form of ‘surgery’ on the model (e.g. pinning the relevant feature’s value to zero, or at least throwing an exception if it’s ever not zero).
Now, this doesn’t fix the possibility of a sufficiently smart model inventing behaviors like deception or trolling or whatever for itself during its forward pass: it’s really only a formula for removing bad human behaviors that it learnt to simulate from its training set in the weights. It gives us a mind whose “System 1” behavior is aligned, that only leaves “System 2″ development. For that, we probably need prompt engineering, and ‘translucent thoughts’ monitoring its internal stream-of-thought/dynamic memory. But that seems rather more tractable: it’s more like moral philosophy, or contract law.
Let’s suppose that your model makes a bad action. Why? Either the model is aligned but uncapable to deduce good action or the model is misaligned and uncapable to deduce deceptively good action. In both cases, gradient update provides information about capabilities, not about alignment. Hypothetical homunculi doesn’t need to be “immune”, it isn’t affected in a first place.
Other way around: let’s suppose that you observe model taking a good action. Why? It can be an aligned model that makes a genuine good action or it can be a misaligned model which takes a deceptive action. In both cases you observe capabilities, not alignment.
The problem here is not a prior over aligned/deceptive models (unless you think that this prior requires less than 1 bit to specify aligned model, where I say that optimism departs from sanity here), the problem is lack of understanding of updates which presumably should cause model to be aligned. Maybe prosaic alignment works, maybe don’t, we don’t know how to check.