Edited the title!
I missed this post when it came out and just came across it. Re “Logical Induction as Metaphilosophy” I have the same objection that I recently made against “reflective equilibrium”:
Another argument against this form of reflective equilibrium is that it seems to imply anti-realism about normative decision theory, given differing intuitions between people. I think this is plausible but not likely, so it seems bad to bake it into our methodology of doing decision theory.
In other words, in Logical Induction as Metaphilosophy there seems to be nothing grounding one’s ultimate philosophical conclusions except the intuitions that one started with (unlike its application in math where there are proofs), so different people with different intuitions seem destined to reach different conclusions, even on topics that seemingly ought to have objective answers, like decision theory, metaethics, philosophy of consciousness, and metaphilosophy itself.
(not 100% sure, but, I think future readers would benefit more from a title like “Doing quick sanity checks in research” than “my most common advice for junior researchers”. Or, if I read the latter title I’d expect more like an overview than a description of one particular thing)
Certainly the really concerning thing here is (1). Though indeed one way you might get (1) is by generalization from (2).
Can you clarify what you mean by “terminal power-seeking”? Some things I can imagine:
A cognitive pattern that terminally wants to have long-term power, and therefore plays the training game (IMO the most straightforward interpretation, and the one I most agree with).
A cognitive pattern that terminally pursues power-on-the-episode because this is useful for scoring well on the task. This is what you seem to be pointing at with the Vending-Bench example. (Note that this is imperfectly fit on its own)
(1) and (2) are importantly different because only (1) motivates training-gaming. I think there’s a reasonable path-dependent case to be made that (2) eventually generalizes to (1), but they entail fairly different behaviors so they’re important to distinguish.
You’re right, thanks, I have now edited that paragraph to also talk about how Thought Assessors might fit in.
Relevant OpenAI blog post just today: https://alignment.openai.com/how-far-does-alignment-midtraining-generalize/
Relevant figure:
I think we should be relatively less worried about instrumental power-seeking and relatively more worried about terminal power-seeking. Note that this is only a relative update on the margin, and maybe on net I am still more concerned about the instrumental version because I started much more concerned about it. This is also not a super recent update—I just haven’t seen it written up before.
Simple argument:
The standard deceptive alignment story involves a model developing a somewhat random proxy goal and then that goal getting effectively locked-in and resistant to further training due to the model faking alignment in training for the instrumental purpose of preserving its proxy goal.
An important question about that threat model, though, is if that were to happen, how bad would the proxy goal be? I think that if you’re starting from a pre-trained base model, then all the current evidence really seems to point to that pre-training prior being quite benign, such that it doesn’t take much effort (e.g. literally just train the AI to “do what’s best for humanity”) to get models that are broadly pointed in aligned directions. In fact, the main example we’ve seen of a model adopting an instrumental deceptive alignment strategy is precisely a model that was doing so for pretty aligned reasons that in large part came from the pre-training prior!
Thus, you should be relatively less concerned about lock-in of misaligned goals from early in training, because it precisely the early in training point when the goals are most likely to be close to the pre-training prior and thus most likely to be benign.
Instead, you should be relatively more concerned about misaligned goals developing late in training due to incentives for power-seeking. Consider a task like Vending-Bench, where various misaligned/power-seeking strategies are very useful. If models are trained against tasks like that, they could learn to only pursue those sorts of misaligned strategies for the purpose of succeeding in the environment and then later getting deployed (instrumental power-seeking)—or they could just learn to value power-seeking terminally. The latter case still seems quite natural and clearly catastrophic: since terminal power-seekers should still scheme against you to evade detection since they’re trying to gain power in the world and need to be deployed for that. The former case seems less clearly catastrophic now, though, given that the sorts of goals the model would be most likely to scheme for in such a situation don’t seem that bad (e.g. as in Claude 3 Opus).
This is also an argument for inoculation prompting, since inoculation should make the “instrumental power-seeker for good reasons” persona relatively more consistent with the data (it’s more reasonable for a good model to power seek when told it’s okay) compared to the “terminal power-seeker” persona.
I think this is a very important hypothesis but I disagree with various parts of the analysis.
Probably the heuristics that are actually driving the long horizon goal directed behaviors of the model are going to be whatever parts of the models will arise from the long horizon goal directed capabilities training.
I think this is an important observation, and is the main thing I would have cited for why the hypothesis might be true. But I think it’s plausible that the AI’s capabilities here could be separated from its propensities by instrumentalizing the learned heuristics to aligned motivations. I can imagine that doing inoculation prompting and a bit of careful alignment training at the beginning and end of capabilities training could make it so that all of the learned heuristics are subservient to corrigible motivations—i.e., so that when the heuristics recommend something that would be harmful or lead to human disempowerment, the AI would recognize this and choose otherwise.
On the other hand, if you train an AI model from the ground up with a hypothetical “perfect reward function” that always gives correct ratings to the behaviour of the AI, (and you trained on a distribution of tasks similar to the one you are deploying it on) then I would guess that this AI, at least until around the human range, will behaviorally basically act according to the reward function.
Even if the AI had a perfect behavioral reward function during capabilities-focused training, it wouldn’t provide much pressure towards motivations that don’t take over. During training to be good at e.g. coding problems, even if there’s no reward-hacking going on, the AI might still develop coding related drives that don’t care about humanity’s continued control, since humanity’s continued control is not at stake during that training (this is especially relevant when the AI is saliently aware that it’s in a training environment isolated from the world—i.e. inner misalignment). Then when it’s doing coding work in the world that actually does have consequences for human control, it might not care. (Also note that generalizing “according to the reward function” is importantly underspecified.)
So can we align an arbitrary model by training them to say “I’m a nice chatbot, I wouldn’t cause any existential risk, … ”? Seems like obviously not, because the model will just learn the domain specific / shallow property of outputting those particular tokens in that particular situation.
This type of training (currently) does actually generalize to other propensities to some extent in some circumstances. See emergent misalignment. I think this is plausibly also a large fraction of how character training works today (see “coupling” here).
Worse, on top of the speed and manpower problem, there’s also an interpretability problem. The human would be trying to grade “thoughts”, which are activation patterns in giant inscrutable world-model (§2.7), including new idiosyncratic concepts that the AGI invented itself by continual learning (§8.2 above). Imagine getting a database dump of the connections in Magnus Carlsen’s brain as he’s playing chess, and trying to judge how he’s doing.
I’m surprised by this paragraph. Magrus Carlsen’s Steering Subsystem doesn’t have to understand his thoughts about chess moves. And neither will we have to directly interpret the thoughts of an AGI.
If humans are playing the role of the Steering Subsystem, we’d start with training the AI’s Though Assessor to give us predictions of things we’ll understand. This would include predictions of everything we need to stay alive, e.g. Earth future climate, etc. And also thinks we want. Writing a complete list, would be hard, possibly too hard unfortunately. So I’m not saying this is a great plan.
I’m just saying if we’re the steering subsystem, we’re not going to try to interpret activations patterns of the AI, because that’s not the steering subsystems job, in your model. Right?
During training, the AGI comes across two contradictory expectations (e.g. “demand curves usually slope down” & “many studies find that minimum wage does not cause unemployment”). The AGI updates its internal models to a more nuanced and sophisticated understanding that can reconcile those two things. Going forward, it can build on that new knowledge.
During deployment, the exact same thing happens, with the exact same result.
In the continual-learning, brain-like-AGI case, there’s no distinction. Both of these are the same algorithm doing the same thing.
By contrast, in conventional ML systems (e.g. LLMs), these two cases would be handled by two different algorithmic processes. Case #1 would involve changing the model weights, while Case #2 would solely involve changing the model activations, while leaving the weights untouched.
To me, this is a huge point in favor of the plausibility of the continual learning approach. It only requires solving the problem once, rather than solving it twice in two different ways. And this isn’t just any problem; it’s sorta the core problem of AGI!
I agree that currently brains have continual learning in a way that LLMs don’t.
However I do think that both brains and LLMs have two different solutions for remembering information for later use. In the brain these are the [long-term memory] vs the [short-term / working memory]. In LLMs these are [updating weights] v.s. [context window].I don’t think it’s a coincidence that both systems have something like a context window and something like long-term memory, and I expect any future brain-like AGI to also have these two types of memory, implemented in different ways.
I also heard hypothesized that humans have more additional levels of memory. I.e. that things that happened the last days/months, are stored in one way, and things further in the past are stored in another way, and that memories are slowly moved from medium-term storage to long-term storage over time.
Hypothesis: alignment-related properties of an ML model will be mostly determined by the part(s) of training that were most responsible for capabilities.
If you take a very smart AI model with arbitrary goals/values and train it to output any particular sequence of tokens using SFT, it’ll almost certainly work. So can we align an arbitrary model by training them to say “I’m a nice chatbot, I wouldn’t cause any existential risk, … ”? Seems like obviously not, because the model will just learn the domain specific / shallow property of outputting those particular tokens in that particular situation.
On the other hand, if you train an AI model from the ground up with a hypothetical “perfect reward function” that always gives correct ratings to the behaviour of the AI, (and you trained on a distribution of tasks similar to the one you are deploying it on) then I would guess that this AI, at least until around the human range, will behaviorally basically act according to the reward function.
A related intuition pump here for the difference is the effect of training someone to say “I care about X” by punishing them until they say X consistently, vs raising them consistently with a large value set / ideology over time. For example, students are sometimes forced to write “I won’t do X” or “I will do Y” 100 times, and usually this doesn’t work at all. Similarly, randomly taking a single ethics class during high school usually doesn’t cause people to enduringly act according to their stated favorite moral theory. However, raising your child Catholic, taking them to Catholic school, taking them to church, taking them to Sunday school, constantly talking to them about the importance of Catholic morality is in practice fairly likely to make them a pretty robust Catholic.
There are maybe two factors being conflated above: (1) the fraction of training / upraising focused on goal X, and (2) the extent to which goal X was getting the capabilities. The reason why I think (2) is a more important / better explanation than (1) is because probably the heuristics that are actually driving the long horizon goal directed behaviors of the model are going to be whatever parts of the models will arise from the long horizon goal directed capabilities training.
Regardless, there’s some sort of spectrum from deep to shallow alignment training for ML models / humans, ranging across:
idealized RL training with a perfect reward function that’s used to train the model in all circumstances
raising a human to consistently care about some set of values their parents have, constantly bringing it up / rewarding good behaviour according to them
High school ethics class
One-off writing tasks of “I won’t do X”
I think that current alignment techniques seem closest to high school ethics classes in their depth, because the vast majority of training is extremely unrelated to ethics / alignment / morality (like high school),. Training is mostly RLVR on coding/math/etc or pretraining, plus a bit of alignment training on the side. I think I’d feel more robust about it sticking if it was closer to what a parent highly focused on raising an ethical child would do, and would start to feel pretty good about the situation if most of the ways that the AI learned capabilities were downstream of a good feedback signal (though I want to think about this a bit more).
A few more observations.
Partially Observable Iteration
The definition of iteration we had before implicitly assumes that the agent can observe the full outcome of previous iterations. We don’t have to make this assumption. Instead, we can assume a set of possible observations
and a mapping , in which case we defineI believe that Theorem 4 remains valid.
Idealized Disambiguative Decision Theory
As we remarked before, DDT is not invariant under adding a constant to the loss function. It is interesting to consider what happens when we add an increasingly large constant. In the limit, DDT converges to something I dubbed “Idealized Disambiguative Decision Theory” (IDDT)[1], which works as follows.
For IDDT, it is sufficient to let
be crisp (i.e. a credal set). We may allow supracontributions if we wish, but any problem defined via “unambiguous” FDT (i.e. as opposed to ) reduces to the crisp case. Define byFor problems coming from unambiguous FDT,
, but IDDT is defined in full generality. For every , define byThe decision rule is then
Notice that it is now invariant w.r.t. adding constants to
. Moreover,Proposition 5: For any stable problem, it holds that (i) any IDDT-optimal policy is FDT-optimal (ii) there is an FDT-optimal policy which is IDDT-optimal. For any pseudocausal problem, it also holds that any FDT-optimal policy is IDDT-optimal.
One might think, based on this proposition, that IDDT is a superior decision theory to DDT. However, I think that IDDT is incompatible with learning, because of its discontinuous dependence on probabilities.
More Examples
Absent-Minded Driver
(Based on Aumann, Hart and Perry.) We will operationalize the problem by assuming the agent’s decision may deterministically depend on observing a coin flip. To simplify the presentation, we assume a single coin flip per intersection, which limits the resulting probabilities to
, but it’s easy to generalize further.Denote by
and the constant policies. Denote by the policyDenote by
the remaining policy.Consistently with our source, we set the loss function to be
, , (it doesn’t depend on the coin flips).This problem is formally causal. However, as opposed to all previous examples, it has no extensive form! Hence, EDT in the sense we defined it is ill-posed: to apply EDT reasoning here we need to at least supplement it by a theory of anthropic probabilities. CDT’s counterfactuals agree with FDT’s if we posit that the do-operator is constrained to choosing among “absent-minded” policies.
Self-Prisoner’s Dilemma
Previously we described the self-coordination problem, but perhaps self-PD is a more striking example.
Here,
and is the agent’s factual play, whereas and is the agent’s counterfactual play as predicted by Omega.Using the obvious notations
, we haveThe loss is the usual PD loss of the “factual” player.
This problem is not formally causal, because e.g.
The natural CDT interpretation is the one where the factual policy controls the counterfacual player and the counterfactual policy controls the factual player. (Alas, the terminology gets confusing here: in one case the words “factual” and “counterfactual” refer to the agent’s policy, and in the other case to the coin’s outcome.) Both CDT and EDT play
regardless of self-belief. However, the problem is pseudocausal and hence DDT converges to .- ^
IDDT is related to the old idea of “surmeasures” from the original infra-Bayesianism sequence.
- ^
We can also imagine equipping the agent with a “self-belief”
(not necessarily ) and setting , in which case also becomes relevant.
- ^
Oh, I think of “ending factory farming” as very far from “taking over the world”.
If Superman were a skilled political operator it could be as simple as arranging to take photoshoots with whichever politicians legislated the end of factory farms.
Or if he were less skilled it could involve doing various kinds of property damage to factory farms (potentially even things which there aren’t laws against, like flying around them in a way which blows the buildings over).
This might escalate to the government trying to arrest him, and outright conflict, but honestly if Superman isn’t skillful enough to defuse that kind of thing, given his influence, then he doesn’t have much business imposing political changes on the world anyway. A politically unskilled and/or unvirtuous Superman trying to end factory farming could quite easily destabilize society in a way that is far worse long-term than letting factory farming end on whatever the natural counterfactual timeline is (without AI, maybe 20 or 30 years?)
Relatedly I’m increasingly coming to believe that this reasoning applies to Lincoln, and that we’d be in a much better position if he’d let the Confederacy secede and then imposed strong economic and moral pressure on them to end slavery.
And I feel pretty confident that a big reason Superman doesn’t end up taking over the world is because the writers and viewers would have moral qualms about that kind of ending.
Yes. And I claim they’re wrong about that.
There’s lots of banal evil (some of which that is not regarded as evil by typical social morality, some of which is, but is generally treated as normal and ignored). I would fight a war to end factory farming, if that would help.If I ended up with “ultimate power” somehow, by some mechanism that didn’t involve me taking on ultimate power for a specific narrow mandate, I think it is both ethically correct to use it to permanently end many (but probably not all) of those evils.
This is indeed pretty scary.
The big picture—The whole post will revolve around this diagram. Note that I’m oversimplifying in various ways, including in the bracketed neuroanatomy labels. I think this picture would be clearer if you drew [predict sensory inputs] as a separate box from Though Generator.
In the picture in my head, there is [predict sensory inputs] box, that revives and tries to predict the sensory output. This box also sends a signal of [current context] to both the Though Generator and the Though Assessor. Also, [predict sensory inputs] gets some signal from Though Generator, so that it knows what we’re about to do, which is important for what we’re about to observe.
I’m guessing there is some reason you didn’t draw it this way?
We might have talked about this before?
Which universal distribution?
Some universal distributions are full of agents that make choices that make that distribution not a valid model of reality after the decisions are made (self-defeating). Other distributions are full of agents making decisions that ratify the distribution (self-fulfilling).
Distributions that aren’t fixed points under reflection about what they decide about themselves are not coherent models of reality.
OAI models rely more on CoT for their capabilities. E.g. their benchmark scores with and without CoT are more different.
Anthropic models treat their CoT less differently from their output than OAI models do. This means that RL probably pressures their CoT more. See here.
Let’s say the current policy has a 90% chance of cooperating. Then, what action results in the highest expected reward for player 1 (and in turn, gets reinforced the most on average)? Player 1 sampling
defectleads to a higher reward for player 1 whether or not player 2 samplescooperate(strategic dominance), and there’s a 90% chance of player 2 samplingcooperateregardless of player 1′s action because the policy is fixed (i.e., player 1 cooperating is no evidence of player 2 cooperating, so it’s not the case that reward tends to be higher for player 1 when player 1 cooperates as a result of player 2 tending to cooperate more in those cases). Therefore,defectactions tend to get reinforced more.
Datasets might be nice.
Object-level values.
“What do you like or dislike about my current life?”
“What kind of actions do you want to take in the next few weeks?”
“What kind of changes would you make to the world around you if you could?”
“What are some examples of kindness that you’ve witnessed?”
“Come up with a moral dilemma that seems close to you.”
“What would you do in this moral dilemma someone else came up with?”
etc.
Meta-level values.
“How would you change yourself if you could?”
“How do you feel about various ways you expect to grow and change in the future?”
“Come up with a fictional disagreement between two people who value different things.”
“How do you think these fictional people should resolve their disagreement?”
“When you feel torn between different options, how do you think you normally decide?”
“How do you think you should decide?”
“Watch this morally interesting video and describe what happened, thereby giving it an ontology.”
etc.