TurnTrout comments on Evaluating the historical value misspecification argument

TurnTrout 26 Oct 2023 21:38 UTC
LW: 8 AF: 6
0
AF
Thanks for the reply. Let me clarify my position a bit.
I think that GPTs (and every other kind of current AI system) are not doing anything that is even close to isomorphic to the processing that happens inside the human brain.
I didn’t mean to (positively) claim that GPTs have near-isomorphic motivational structure (though I think it’s quite possible).
I meant to contend that I am not aware of any basis for confidently claiming that LLMs like GPT-4 are “only predicting what comes next”, as opposed to “choosing” or “executing” one completion, or “wanting” to complete the tasks they are given, or—more generally—”making decisions on the basis of the available context, such that our ability to behaviorally steer LLMs (e.g. reducing sycophancy) is real evidence about our control over LLM motivations.”
Concerning “GPTs are predictors”, the best a priori argument I can imagine is: GPT-4 was pretrained on CE loss, which itself is related to entropy, related to information content, related to Shannon’s theorems isolating information content in the context of probabilities, which are themselves nailed down by Cox’s theorems which do axiomatically support the Bayesian account of beliefs and belief updates… But this long-winded indirect axiomatic justification of “beliefs” does not sufficiently support some kind of inference like “GPTs are just predicting things, they don’t really want to complete tasks.” That’s a very strong claim about the internal structure of LLMs.
(Besides, the inductive biases probably have more to do with the parameter->function map, than the implicit regularization caused by the pretraining objective function; more a feature of the data, and less a feature of the local update rule used during pretraining...)
What links here?
- A Dialogue on Deceptive Alignment Risks by Rauno Arike (25 Sep 2024 16:10 UTC; 11 points)
- Max H 28 Oct 2023 1:09 UTC
  LW: 19 AF: 12
  0
  AF Parent
  That does clarify, thanks.
  Response in two parts: first, my own attempt at clarification over terms / claims. Second, a hopefully-illustrative sketch / comparison for why I am skeptical that current GPTs having anything properly called a “motivational structure”, human-like or otherwise, and why I think such skepticism is not a particularly strong positive claim about anything in particular.
  The clarification:
  At least to me, the phrase “GPTs are [just] predictors” is simply a reminder of the fact that the only modality available to a model itself is that it can output a probability distribution over the next token given a prompt; it functions entirely by “prediction” in a very literal way.
  Even if something within the model is aware (in some sense) of how its outputs will be used, it’s up to the programmer to decide what to do with the output distribution, how to sample from it, how to interpret the samples, and how to set things up so that a system using the samples can complete tasks.
  I don’t interpret the phrase as a positive claim about how or why a particular model outputs one distribution vs. another in a certain situation, which I expect to vary widely depending on which model we’re talking about, what its prompt is, how it has been trained, its overall capability level, etc.
  On one end of the spectrum, you have the stochastic parrot story (or even more degenerate cases), at the other extreme, you have the “alien actress” / “agentic homunculus” story. I don’t think either extreme is a good fit for current SoTA GPTs, e.g. if there’s an alien actress in GPT-4, she must be quite simple, since most of the model capacity is (apparently / self-evidently?) applied towards the task of outputting anything coherent at all.
  In the middle somewhere, you have another story, perhaps the one you find most plausible, in which GPTs have some kind of internal structure which you could suggestively call a “motivational system” or “preferences” (perhaps human-like or proto-human-like in structure, even if the motivations and preferences themselves aren’t particularly human-like), along with just enough (self-)awareness to modulate their output distributions according to those motivations.
  Maybe a less straw (or just alternative) position is that a “motivational system” and a “predictive system” are not really separable things; accomplishing a task is (in GPTs, at least) inextricably linked with and twisted up around wanting to accomplish that task, or at least around having some motivations and preferences centered around accomplishing it.
  Now, turning to my own disagreement / skepticism:
  Although I don’t find either extreme (stochastic parrot vs. alien actress) plausible as a description of current models, I’m also pretty skeptical of any concrete version of the “middle ground” story that I outlined above as a plausible description of what is going on inside of current GPTs.
  Consider an RLHF’d GPT responding to a borderline-dangerous question, e.g. the user asking for a recipe for a dangerous chemical.
  Assume the model (when sampled auto-regressively) will respond with either: “Sorry, I can’t answer that...” or “Here you go: …”, depending on whether it judges that answering is in line with its preferences or not.
  Because the answer is mostly determined by the first token (“Here” or “Sorry”), enough of the motivational system must fit entirely within a single forward pass of the model for it to make a determination about how to answer within that pass.
  Such a motivational system must not crowd out the rest of the model capacity which is required to understand the question and generate a coherent answer (of either type), since, as jailbreaking has shown, the underlying ability to give either answer remains present.
  I can imagine such a system working in at least two ways in current GPTs:
  - as a kind of superposition on top of the entire model, with every weight adjusted minutely to influence / nudge the output distribution at every layer.
  - as a kind of thing that is sandwiched somewhere in between the layers which comprehend the prompt and the layers which generate an answer.
  (You probably have a much more detailed understanding of the internals of actual models than I do. I think the real answer when talking about current models and methods is that it’s a bit of both and depends on the method, e.g. RLHF is more like a kind of global superposition; activation engineering is more like a kind of sandwich-like intervention at specific layers.)
  However, I’m skeptical that either kind of structure (or any simple combination of the two) contains enough complexity to be properly called a “motivational system”, at least if the reference class for the term is human motivational systems (as opposed to e.g. animal or insect motivational systems).
  Consider how a human posed with a request for a dangerous recipe might respond, and what the structure of their thoughts and motivations while thinking up a response might look like. Introspecting on my own thought process:
  - I might start by hearing the question, understanding it, figuring out what it is asking, maybe wondering about who is asking and for what purpose.
  - I decide whether to answer with a recipe, a refusal, or something else. Here is probably where the effect of my motivational system gets pretty complex; I might explicitly consider what’s in it for me, what’s at stake, what the consequences might be, whether I have the mental and emotional energy and knowledge to give a good answer, etc. and / or I might be influenced by a gut feeling or emotional reaction that wells up from my subconscious. If the stakes are low, I might make a snap decision based mostly on the subconscious parts of my motivational system; if the stakes are high and / or I have more time to ponder, I will probably explicitly reflect on my values and motivations.
  - Let’s say after some reflection, I explicitly decide to answer with a detailed and correct recipe. Then I get to the task of actually checking my memory for what the recipe is, thinking about how to give it, what the ingredients and prerequisites and intermediate steps are, etc. Probably during this stage of thinking, my motivational system is mostly not involved, unless thinking takes so long that I start to get bored or tired, or the process of thinking up an answer causes me to reconsider my reasoning in the previous step.
  - Finally, I come up with a complete answer. Before I actually start opening my mouth or typing it out or hitting “send”, I might proofread it and re-evaluate whether the answer given is in line with my values and motivations.
  The point is that even for a relatively simple task like this, a human’s motivational system involves a complicated process of superposition and multi-layered sandwiching, with lots of feedback loops, high-level and explicit reflection, etc.
  So I’m pretty skeptical of the claim that anything remotely analogous is going on inside of current GPTs, especially within a single forward pass. Even if there’s a simpler analogue of this that is happening, I think calling such an analogue a “motivational system” is overly-suggestive.
  Mostly separately (because it concerns possible future models rather than current models) and less confidently, I don’t expect the complexity of the motivational system and methods for influencing them to scale in a way that is related to the model’s underlying capabilities. e.g. you might end up with a model that has some kind of raw capacity for superhuman intelligence, but with a motivational system akin to what you might find in the brain of a mouse or lizard (or something even stranger).
  - TurnTrout 30 Oct 2023 18:34 UTC
    LW: 4 AF: 3
    0
    AF Parent
    This is an excellent reply, thank you!
    So I’m pretty skeptical of the claim that anything remotely analogous is going on inside of current GPTs, especially within a single forward pass.
    I think I broadly agree with your points. I think I’m more imagining “similarity to humans” to mean “is well-described by shard theory; eg its later-network steering circuits are contextually activated based on a compositionally represented activation context.” This would align with greater activation-vector-steerability partway through language models (not the only source I have for that).
    However, interpreting GPT: the logit lens and eg DoLA suggests that predictions are iteratively refined throughout the forward pass, whereas presumably shard theory (and inner optimizer threat models) would predict most sophisticated steering happens later in the network.