porby comments on TurnTrout’s shortform feed

porby 7 Nov 2023 19:28 UTC
4 points
0
I’m using as a “an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic”.
Alright, this is pretty much the same concept then, but the ones I’m referring to operate at a much lower and tighter level than thumbs-downing murder-proneness.
So...
Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish.
Agreed.
… and yet this would still get in the way of qualitatively more powerful capabilities down the line, and a mind that can’t somehow slip these constraints won’t be a general intelligence.
While I agree these claims probably hold for the concrete example of thumbs-downing an example of murderproneness, I don’t see how they hold for the lower-level constraints that imply the structure of its capability. Slipping those constraints looks more like babbling gibberish.
By default, those would be constrained to be executed the way humans execute them, the way the AI was shown to do it during the training. But the whole point of an AGI is that it should be able to invent better solutions than ours. More efficient ways of thinking, weird super-technological replacements for our construction techniques, etc.
While it’s true that an AI probably isn’t going to learn true things which are utterly divorced from and unimplied by the training distribution, I’d argue that the low-level constraints I’m talking about both leave freedom for learning wildly superhuman internal representations and directly incentivize it during extreme optimization. An “ideal predictor” wouldn’t automatically start applying these capabilities towards any particular goal involving external world states by default, but it remains possible to elicit those capabilities incrementally.
Making the claim more concise: it seems effectively guaranteed that the natural optimization endpoint of one of these architectures would be plenty general to eat the universe if it were aimed in that direction. That process wouldn’t need to involve slipping any of the low-level constraints.
I’m guessing the disconnect between our models is where the aiming happens. I’m proposing that the aiming is best (and convergently) handled outside the scope of wildly unpredictable and unconstrained optimization processes. Instead, it takes place at a level where a system of extreme capability infers the gaps in specifications and applies conditions robustly. The obvious and trivial version of this is conditioning through prompts, but this is a weak and annoying interface. There are other paths that I suspect bottom out at equivalent power/safety yet should be far easier to use in a general way. These paths allow incremental refinement by virtue of not automatically summoning up incorrigible maximizers by default.
If the result of refinement isn’t an incorrigible maximizer, then slipping the higher level “constraints” of this aiming process isn’t convergent (or likely), and further, the nature of these higher-level constraints would be far more thorough than anything we could achieve with RLHF.
In fact, my model says there’s no fundamental typological difference between “a practical heuristic on how to do a thing” and “a value” at the level of algorithmic implementation. It’s only in the cognitive labels we-the-general-intelligences assign them.
That’s pretty close to how I’m using the word “value” as well. Phrased differently, it’s a question of how the agent’s utilities are best described (with some asterisks around the non-uniqueness of utility functions and whatnot), and observable behavior may arise from many different implementation strategies—values, heuristics, or whatever.
- Thane Ruthenis 8 Nov 2023 3:39 UTC
  4 points
  0
  Parent
  While I agree these claims probably hold for the concrete example of thumbs-downing an example of murderproneness, I don’t see how they hold for the lower-level constraints that imply the structure of its capability. Slipping those constraints looks more like babbling gibberish.
  Hm, I think the basic “capabilities generalize further than alignment” argument applies here?
  I assume that by “lower-level constraints” you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like “2+2=4″, “gravity exists”, and “people value other people”; as contrasted with “it’s bad if I hurt people” or “I must sum numbers up using the algorithm that humans gave me, no matter how inefficient it is”.
  Slipping the former type of constraints would be disadvantageous for ~any goal; slipping the latter type would only disadvantage a specific category of goals.
  But since they’re not, at the onset, categorized differently at the level of cognitive algorithms, a nascent AGI would experiment with slipping both types of constraints. The difference is that it’d quickly start sorting them in “ground-truth” vs. “value-laden” bins manually, and afterwards it’d know it can safely ignore stuff like “no homicides!” while consciously obeying stuff like “the axioms of arithmetic”.
  Instead, it takes place at a level where a system of extreme capability infers the gaps in specifications and applies conditions robustly. The obvious and trivial version of this is conditioning through prompts, but this is a weak and annoying interface. There are other paths that I suspect bottom out at equivalent power/safety yet should be far easier to use in a general way
  Hm, yes, I think that’s the crux. I agree that if we had an idealized predictor/a well-formatted superhuman world-model on which we could run custom queries, we would be able to use it safely. We’d be able to phrase queries using concepts defined in the world-model, including things like “be nice”, and the resultant process (1) would be guaranteed to satisfy the query’s constraints, and (2) likely (if correctly implemented) wouldn’t be “agenty” in ways that try to e. g. burst out of the server farm on which it’s running to eat the world.
  Does that align with what you’re envisioning? If yes, then our views on the issue are surprisingly close. I think it’s one of our best chances at producing an aligned AI, and it’s one of the prospective targets of my own research agenda.
  The problems are:
  - I don’t think the current mainstream research directions are poised to result in this. AI Labs have been very clear in their intent to produce an agent-like AGI, not a superhuman forecasting tool. I expect them to prioritize research into whatever tweaks to the training schemes would result in homunculi; not whatever research would result in perfect predictors + our ability to precisely query them.
  - What are the “other paths” you’re speaking of? As you’d pointed out, prompts are a weak and awkward way to run custom queries on the AI’s world-model. What alternatives are you envisioning?
  - porby 8 Nov 2023 21:54 UTC
    4 points
    0
    Parent
    I assume that by “lower-level constraints” you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like “2+2=4″, “gravity exists”, and “people value other people”
    That’s closer to what I mean, but these constraints are even lower level than that. Stuff like understanding “gravity exists” is a natural internal implementation that meets some constraints, but “gravity exists” is not itself the constraint.
    In a predictor, the constraints serve as extremely dense information about what predictions are valid in what contexts. In a subset of predictions, the awareness that gravity exists helps predict. In other predictions, that knowledge isn’t relevant, or is even misleading (e.g. cartoon physics). The constraints imposed by the training distribution tightly bound the contextual validity of outputs.
    But since they’re not, at the onset, categorized differently at the level of cognitive algorithms, a nascent AGI would experiment with slipping both types of constraints.
    I’d agree that, if you already have an AGI of that shape, then yes, it’ll do that. I’d argue that the relevant subset of predictive training practically rules out the development of that sort of implementation, and even if it managed to develop, its influence would be bounded into irrelevance.
    Even in the absence of a nascent AGI, these constraints are tested constantly during training through noise and error. The result is a densely informative gradient pushing the implementation back towards a contextually valid state.
    Throughout the training process prior to developing strong capability and situational awareness internally, these constraints are both informing and bounding what kind of machinery makes sense in context. A nascent AGI must have served the extreme constraints of the training distribution to show up in the first place; its shape is bound by its development, and any part of that shape that “tests” constraints in a way that worsens loss is directly reshaped.
    Even if a nascent internal AGI of this type develops, if it isn’t yet strong enough to pull off complete deception with respect to the loss, the gradients will illuminate the machinery of that proto-optimizer and it will not survive in that shape.
    Further, even if we suppose a strong internal AGI develops that is situationally aware and is sufficiently capable and motivated to try deception, there remains the added dependency on actually executing that deception while never being penalized by gradients. This remains incredibly hard. It must transition into an implementation that satisfies the oppressive requirements of training while adding an additional task of deception without even suffering a detectable complexity penalty.
    These sorts of deceptive mesaoptimizer outcomes are far more likely when the optimizer has room to roam. I agree that you could easily observe this kind of testing and slipping when the constraints under consideration are far looser, but the kind of machine that is required by these tighter constraints doesn’t even bother with trying to slip constraints. It’s just not that kind of machine, and there isn’t a convergent path for it to become that kind of machine under this training mechanism.
    And despite that lack of an internal motivation to explore and exploit with respect to any external world states, it still has capabilities (in principle) which, when elicited, make it more than enough to eat the universe.
    Does that align with what you’re envisioning? If yes, then our views on the issue are surprisingly close. I think it’s one of our best chances at producing an aligned AI, and it’s one of the prospective targets of my own research agenda.
    Yup!
    I don’t think the current mainstream research directions are poised to result in this. AI Labs have been very clear in their intent to produce an agent-like AGI, not a superhuman forecasting tool. I expect them to prioritize research into whatever tweaks to the training schemes would result in homunculi; not whatever research would result in perfect predictors + our ability to precisely query them.
    I agree that they’re focused on inducing agentiness for usefulness reasons, but I’d argue the easiest and most effective way to get to useful agentiness actually routes through this kind of approach.
    This is the weaker leg of my argument; I could be proven wrong by some new paradigm. But if we stay on something like the current path, it seems likely that the industry will just do the easy thing that works rather than the inexplicable thing that often doesn’t work.
    What are the “other paths” you’re speaking of? As you’d pointed out, prompts are a weak and awkward way to run custom queries on the AI’s world-model. What alternatives are you envisioning?
    I’m pretty optimistic about members of a broad class that are (or likely are) equivalent to conditioning, since these paths tend to preserve the foundational training constraints.
    A simple example is [2302.08582] Pretraining Language Models with Human Preferences (arxiv.org). Having a “good” and “bad” token, or a scalarized goodness token, still pulls in many of the weaknesses of the RLHF’s strangely shaped reward function, but there are trivial/naive extensions to this which I would anticipate being major improvements over the state of the art. For example, just have more (scalarized) metatokens representing more concepts such that the model must learn a distinction between being correct and sounding correct, because the training process split those into different tokens. There’s no limit on how many such metatokens you could have; throw a few hundred fine-grained classifications into the mix. You could also bake complex metatoken prompts into single tokens with arbitrary levels of nesting or bake the combined result into the weights (though I suspect weight-baking would come with some potential failure modes).^[1]
    Another more recent path is observing the effect that conditions have on activations and dynamically applying the activation diffs to steer behavior. At the moment, I don’t know how to make this quite as strong as the previous conditioning scheme, but I bet people will figure out a lot more soon and that it leads somewhere similar.
    ^
    There should exist some reward signal which could achieve a similar result in principle, but that goes back to the whole “we suck at designing rewards that result in what we want” issue. This kind of structure, as ad hoc as it is, is giving us an easier API to lever the model’s own capability to guide its behavior. I bet we can come up with even better implementations, too.
    - Thane Ruthenis 9 Nov 2023 8:50 UTC
      5 points
      1
      Parent
      I’d argue that the relevant subset of predictive training practically rules out the development of that sort of implementation [...]
      Yeah, for sure. A training procedure that results in an idealized predictor isn’t going to result in an agenty thing, because it doesn’t move the system’s design towards it on a step-by-step basis; and a training procedure that’s going to result in an agenty thing is going to involve some unknown elements that specifically allow the system the freedom to productively roam.
      I think we pretty much agree on the mechanistic details of all of that!
      Another more recent path is observing the effect that conditions have on activations and dynamically applying the activation diffs to steer behavior
      — yep, I was about to mention that. @TurnTrout’s own activation-engineering agenda seems highly relevant here.
      I agree that they’re focused on inducing agentiness for usefulness reasons, but I’d argue the easiest and most effective way to get to useful agentiness actually routes through this kind of approach.
      But I still disagree with that. I think what we’re discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast.
      Moreover, it’s in an active process of growing larger. For example, the very idea of viewing ML models as “just stochastic parrots” is being furiously pushed against in favour of a more agenty view. In comparison, the approach we’re discussing wants to move in the opposite direction, to de-personify ML models to the extent that even the animalistic connotation of “a parrot” is removed.
      The system we’re discussing won’t even be an “AI” in the sense usually thought. It would be an incredibly advanced forecasting tool. Even the closest analogue, the “simulators” framework, still carries some air of agentiness.
      And the research directions that get us from here to an idealized-predictor system look very different from the directions that go from here to an agenty AGI. They focus much more on building interfaces for interacting with the extant systems, such as the activation-engineering agenda. They don’t put much emphasis on things like:
      Experimenting with better ways to train foundational models, with the idea of making models as close to a “done product” as they can be out-of-the-box.
      Making the foundational models easier to converse with/making their output stream (text) also their input stream. This approach pretty clearly wants to make AIs into agents that figure out what you want, then do it; not a forecasting tool you need to build an advanced interface on top of in order to properly use.
      RLHF-style stuff that bakes agency into the model, rather than accepting the need to cleverly prompt-engineering it for specific applications.
      Thinking in terms like “an alignment researcher” — note the agency-laden framing — as opposed to “a pragmascope” or “a system for the context-independent inference of latent variables” or something.
      I expect that if the mainstream AI researchers do make strides in the direction you’re envisioning, they’ll only do it by coincidence. Then probably they won’t even realize what they’ve stumbled upon, do some RLHF on it, be dissatisfied with the result, and keep trying to make it have agency out of the box. (That’s basically what already happened with GPT-4, to @janus’ dismay.)
      And eventually they’ll figure out how.
      Which, even if you don’t think it’s the easiest path to AGI, it’s clearly a tractable problem, inasmuch as evolution managed it. I’m sure the world-class engineers at the major AI labs will manage it as well.
      That said, you’re making some high-quality novel predictions here, and I’ll keep them in mind when analyzing AI advancements going forward.
      - porby 9 Nov 2023 23:43 UTC
        4 points
        2
        Parent
        I think what we’re discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast.
        True!
        I expect that if the mainstream AI researchers do make strides in the direction you’re envisioning, they’ll only do it by coincidence. Then probably they won’t even realize what they’ve stumbled upon, do some RLHF on it, be dissatisfied with the result, and keep trying to make it have agency out of the box. (That’s basically what already happened with GPT-4, to @janus’ dismay.)
        Yup—this is part of the reason why I’m optimistic, oddly enough. Before GPT-likes became dominant in language models, there was all kinds of flailing that often pointed in more agenty-by-default directions. That flailing then found GPT because it was easily accessible and strong.
        Now, the architectural pieces subject to similar flailing is much smaller, and I’m guessing we’re only one round of benchmarks at scale from a major lab before the flailing shrinks dramatically further.
        In other words, I think the necessary work to make this path take off is small and the benefits will be greedily visible. I suspect one well-positioned researcher could probably swing it.
        That said, you’re making some high-quality novel predictions here, and I’ll keep them in mind when analyzing AI advancements going forward.
        Thanks, and thanks for engaging!
        Come to think of it, I’ve got a chunk of mana laying around for subsidy. Maybe I’ll see if I can come up with some decent resolution criteria for a market.