johnswentworth answers Seriously, what goes wrong with “reward the agent when it makes you smile”?

johnswentworth 12 Aug 2022 0:42 UTC
LW: 48 AF: 21
10
AF
I think the main concept missing here is compression: trained systems favor more compact policies/models/heuristics/algorithms/etc. The fewer parameters needed to implement the inner agent, the more parameters are free to vary, and therefore the more parameter-space-volume the agent takes up and the more likely it is to be found. (This is also the main argument for why overparameterized ML systems are able to generalize at all.)
The outer training loop doesn’t just select for high reward, it also implicitly selects for compactness. We expect it to find, not just policies which achieve high reward, but policies which are very compactly represented.
Compression is the main reason we expect inner search processes to appear. Here’s the relevant argument from Risks From Learned Optimization:
In some tasks, good performance requires a very complex policy. At the same time, base optimizers are generally biased in favor of selecting learned algorithms with lower complexity. Thus, all else being equal, the base optimizer will generally be incentivized to look for a highly compressed policy.
One way to find a compressed policy is to search for one that is able to use general features of the task structure to produce good behavior, rather than simply memorizing the correct output for each input. A mesa-optimizer is an example of such a policy. From the perspective of the base optimizer, a mesa-optimizer is a highly-compressed version of whatever policy it ends up implementing: instead of explicitly encoding the details of that policy in the learned algorithm, the base optimizer simply needs to encode how to search for such a policy. Furthermore, if a mesa-optimizer can determine the important features of its environment at runtime, it does not need to be given as much prior information as to what those important features are, and can thus be much simpler.
The same argument applies to the terminal objectives/heuristics/proxies instilled in an RL-trained system: it may not terminally value the reward button being pushed or the human smiling or whatever, but its values should be generated from a relatively small, relatively simple set of things. For instance, a plausible Fermi estimate for humans is that our values are ultimately generated from ~tens of simple proxies. (And I would guess that modern ML training would probably result in even fewer, relative to human evolution.)
Furthermore, whatever terminal values are instilled in the RL-trained system, they do need to at least induce near-perfect optimization of the feedback signal on the training set; otherwise the outer training loop would select some other parameters. The outer training loop is still an optimization process, after all, so whatever policy the trained system ends up with should still be roughly-optimal. (There’s some potential wiggle room here insofar as the AI which takes off will be the first one to pass the threshold, and that may happen during a training run before convergence, but I think that’s probably not central to discussion here?)
Putting that all together: we don’t know that the AI will necessarily end up optimizing reward-button-pushes or smiles; there may be other similarly-compact proxies which correlate near-perfectly with reward in the training process. We can probably rule out “a spread of situationally-activated computations which steer its actions towards historical reward-correlates”, insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s).
What links here?
- Gradient descent doesn’t select for inner search by Ivan Vendrov (13 Aug 2022 4:15 UTC; 47 points)
- Ivan Vendrov 12 Aug 2022 3:02 UTC
  10 points
  0
  Parent
  Agreed with John, with the caveat that I expect search processes + simple objectives to only emerge from massively multi-task training. If you’re literally training an AI just on smiling, TurnTrout is right that “a spread of situationally-activated computations” is more likely since you’re not getting any value from the generality of search.
  The Deep Double Descent paper is a good reference for why gradient descent training in the overparametrized regime favors low complexity models, though I don’t know of explicit evidence for the conjecture that “explicit search + simple objectives” is actually lower complexity (in model space) than “bundle of heuristics”. Seems intuitive if model complexity is something close to Kolmogorov complexity, but would love to see an empirical investigation!
  - Ivan Vendrov 13 Aug 2022 4:23 UTC
    7 points
    1
    Parent
    Thinking about this more, I think gradient descent (at least in the modern regime) probably doesn’t select for inner search processes, because it’s not actually biased towards low Kolmogorov complexity. More in my standalone post, and here’s a John Maxwell comment making a similar point.
- Thane Ruthenis 12 Aug 2022 1:16 UTC
  9 points
  3
  Parent
  We can probably rule out “a spread of situationally-activated computations which steer its actions towards historical reward-correlates”, insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s).
  Not sure if I disagree with the object-level assertion, but I think some important caveats are missing here. We have to take the plausible paths through algorithm-space the SGD is likely to take as well, and that might change the form of the final compressed policy in non-intuitive ways.
  Another compact policy is “a superintelligence with a messy slew of values that figured out the training context and maneuvered the SGD around to learn the reward function without internalizing it + compress itself while keeping its messy values static”, and I think it’s a probable-enough end-point.
  It’s still likely that the “messy slew of values” won’t be that messy and will be near-perfect correlates for the reward, but given some (environment structure, reward) pairs, neither may be true. E. g., if the setup is such that strategic intelligence somehow develops well before the AI achieves optimal performance on the training set, then that intelligence will set in stone proxy objectives that aren’t good correlates of the reward.
- Quintin Pope 12 Aug 2022 3:32 UTC
  5 points
  1
  Parent
  We can probably rule out “a spread of situationally-activated computations which steer its actions towards historical reward-correlates”, insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s).
  Seems like you can have a yet-simpler policy by factoring the fixed “simple objective(s)” into implicit, modular elements that compress many different objectives that may be useful across many different environments. Then at runtime, you feed the environmental state into your factored representation of possible objectives and produce a mix of objectives tailored to your current environment, which steer towards behaviors that achieved high reward on training runs similar to the current environment.
  That would seem quite close to “a spread of situationally-activated computations which steer its actions towards historical reward-correlates”, and it seems pretty similar to how my own values / goals arise in an environmentally-dependent manner without me having access to any explicitly represented “simple objective(s)” that I retain across environments.
  - Thane Ruthenis 12 Aug 2022 12:47 UTC
    4 points
    0
    Parent
    That seems like a semantical difference? We may just as well call these modular elements the “objectives”, with them having different environment-specific local implementations.
    E. g., if my goal is “winning”, it would unfold into different short-term objectives depending on whether I’m playing chess or football, but we can still meaningfully call it a “goal”.
    - Quintin Pope 13 Aug 2022 5:10 UTC
      3 points
      1
      Parent
      I’m confident that this is not a semantic difference. The modular elements I was describing represent a process for determining ones objectives, depending on the environment and your current beliefs. It would be a type error to call them “objectives”, just as it would be a type error to call a search process your “plans”. They each represent compressions of possible objectives / plans, but are not those things themselves.
      Similarly, it would be incorrect to call a GPT model a “collection of sentences”, even though they are essentially compressions over many possible sentences.
      - Thane Ruthenis 13 Aug 2022 6:56 UTC
        1 point
        0
        Parent
        Okay, suppose we feed many environment-states into some factored representation of possible objectives, and generate a lot of (environment, objectives) mappings for a given agent. In your model, is it possible to summarize these results somehow; is it possible to say something general about what the agent is trying to do in all of these environments? (E. g., like my football & chess example.)
        Quintin Pope 13 Aug 2022 7:47 UTC
        2 points
        0
        Parent
        Yes, it’s possible to do summary statistics on the outputted goals, just like you can do summary statistics on the outputs of GPT-3, or in the plans produced by a given search algorithm. That doesn’t make generators of these things have the same type signature as the things themselves.
        
        My counterpoint to John is specifically about the sort of computational structures that can represent goals, while being both simple AND environment/belief-dependent. I’m saying simplicity does not push against representing goals in an environment-dependent way, because your generator of goals can be conditioned on the environment.
        Thane Ruthenis 13 Aug 2022 9:34 UTC
        2 points
        0
        Parent
        Yes, it’s possible to do summary statistics on the outputted goals
        How “meaningful” would that summary be? Does my “winning at chess vs football” analogy fit what you’re describing, with “winning” being the compressed objective-generator and the actual win conditions of chess/football being the environment-specific objectives?
        Quintin Pope 15 Aug 2022 6:42 UTC
        3 points
        1
        Parent
        My point is that you can have “goals” (things your search process steers the world towards) and “generators of goals”. These are different things, and you should not use the same name for them.
        
        More specifically, there is a difference in the computational type signature between generators and the things they generate. You can call these two things by whatever label you like, but they are not the same thing.
        
        You can look a person’s plans / behavior in many different games and conclude that it demonstrates a common thread which you might label “winning”. But you should not call the latent cognitive generators responsible for this common thread by the same name you use for the world states the person’s search process steers towards in different environments.
        Thane Ruthenis 15 Aug 2022 7:28 UTC
        3 points
        0
        Parent
        Alright, then it is a semantics debate from my perspective. I don’t think we’re actually disagreeing, now. Your “objective-generators” cleanly map to my “goals”, and your “objectives” to my “local implementations of goals” (or maybe “values” and “local interpretations of values”). That distinction definitely makes sense at the ground level. In my ontology, it’s a distinction between what you want and how achieving it looks like in a given situation.
        I think it makes more sense to describe it my way, though, since I suspect a continuum of ever-more-specific/local objectives (“winning” as an environment-independent goal, “winning” in this type of game, “winning” against the specific opponent you have, “winning” given this game and opponent and the tactic they’re using), rather than a dichotomy of “objective-generator” vs “objective”, but that’s a finer point.
        Thane Ruthenis 15 Aug 2022 8:32 UTC
        1 point
        0
        Parent
        Although, digging into the previously-mentioned finer points, I think there is room for some meaningful disagreement.
        I don’t think there are goal-generators as you describe them. I think there are just goals, and then some plan-making/search mechanism which does goal translation/adaptation/interpretation for any given environment the agent is in. I. e., the “goal generators” are separate pieces from the “ur-goals” they take as input.
        And as I’d suggested, there’s a continuum of ever-more specific objectives. In this view, I think the line between “goals” and “plans” blurs, even, so that the most specific “objectives” are just “plans”. In this case, the “goal generator” is just the generic plan-making process working in a particular goal-interpreting regime.
        (Edited-in example: “I want to be a winner” → “I want to win at chess” → “I want to win this game of chess” → “I want to decisively progress towards winning in this turn” → “I want to make this specific move”. The early steps here are clear examples of goal-generation/translation (what does winning mean in chess?), the latter clear examples of problem-solving (how do I do well this turn?), but they’re just extreme ends of a continuum.)
        The initial goal-representations from which that process starts could be many things — mathematically-precise environment-independent utility functions, or goals defined over some default environment (as I suspect is the case with humans), or even step-one objective-generators, as you’re suggesting. But the initial representation being an objective-generator itself seems like a weirdly special case, not how this process works in general.
  - Daniel Kokotajlo 15 Aug 2022 23:18 UTC
    4 points
    0
    Parent
    Seems like you can have a yet-simpler policy by factoring the fixed “simple objective(s)” into implicit, modular elements that compress many different objectives that may be useful across many different environments. Then at runtime, you feed the environmental state into your factored representation of possible objectives and produce a mix of objectives tailored to your current environment, which steer towards behaviors that achieved high reward on training runs similar to the current environment.
    Can you explain why this policy is yet-simpler? It sounds more complicated to me.
    - Quintin Pope 16 Aug 2022 19:39 UTC
      2 points
      0
      Parent
      I’m saying that it’s simpler to have a goal generator that can be conditioned on the current environment, rather than memorizing each goal individually.
  - johnswentworth 12 Aug 2022 4:45 UTC
    3 points
    4
    Parent
    That sure does sound like a description of a search algorithm, right there.
    - Quintin Pope 12 Aug 2022 5:03 UTC
      5 points
      1
      Parent
      I’m not objecting to your assertion that some sort of search takes place. I’m objecting to your characterization of what sorts of objectives the search ends up pointed towards. Basically, I’m saying that “situationally activated heuristics that steer towards environment-dependent goals” is totally in-line with a simplicity prior over cognitive structures leading to a search-like process.
      The whole reason you say that we should expect search processes is because they can compress many different environment and beliefs dependent plans into a simpler generator of such plans (the search), which takes in environment info, beliefs, and the agent’s simple, supposedly environment-independent, objectives, and produces a plan. So, the agent only needs to store the search process and its environment-independent objectives.
      I’m saying you can apply a similar “compress into an environment / beliefs conditioned generator” trick to the objectives as well, and get a generator of objectives that condition on the environment and current beliefs to produce objectives for the search process.
      Thus, objectives remain environment-dependent, and will probably steer towards world states that resemble those which were rewarded during training. I think this is quite similar to “a spread of situationally-activated computations which steer its actions towards historical reward-correlates”, if involving rather more sophisticated cognition than phrases like “contextually activated heuristics” often imply.
- TurnTrout 15 Aug 2022 3:54 UTC
  LW: 4 AF: 4
  0
  AF Parent
  We can probably rule out “a spread of situationally-activated computations which steer its actions towards historical reward-correlates”, insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s).
  Here’s what I think you mean by an explicit search process:
  - In every situation, the neural network runs e.g. MCTS with a fixed leaf evaluation function (the simple objective).
  On this understanding of your argument, I would be surprised if it went through. Here are a few quick counterpoints.
  - Outside tiny maze environments, constantly running search with a fixed objective is downright stupid, you’re going to constantly time out; anytime guarantees won’t necessarily save you, they’ll probably be weak or nonexistent; constantly running search will consistently waste computation time which could have been saved by caching computations and then thinking about other things during the rest of the forward pass (aka shards); fixed-depth neural networks also have a speed prior.
    (See also the independently written Gradient descent doesn’t select for inner search)
    EDIT: Reading your reply comment on that post
  And there are many other tricks one can use too—like memoization on subsearches, or A*-style heuristic search, or (one meta-level up from A*) relaxation-based methods to discover heuristics. The key point is that these tricks are all very general purpose: they work on a very wide variety of search problems, and therefore produce general-purpose search algorithms which are more efficient than brute force (at least on realistic problems).
  More advanced general-purpose search methods seem to rely relatively little on enumerating possible actions and evaluating their consequences. By the time we get to human-level search capabilities, we see human problem-solvers spend most of their effort on nontrivial problems thinking about subproblems, abstractions and analogies rather than thinking directly about particular solutions.
  Memoization and heuristics would definitely count as part of a “spread” of contextually activated computations? Are we even disagreeing?
  - Humans are the one example we have of general intelligences; they surely have different e.g. inductive biases than ML, and that’s damn important. But even so, humans do not search in every situation in order to optimize a simple objective. Seems like an important hint.
    More generally: “If your theory of alignment and/or intelligence is correct, why doesn’t it explain the one datapoint we have on general intelligence?”
  - any “simplicity prior” that ANNs have is not like the simplicity prior of a programming language. A single forwards pass is acyclic, so loops / recursion are impossible. If NN layers were expressed as programs, the language in question would also have to be acyclic, which would make “search” quite a dumb thing to do anyways.
    EDIT Although in OP I did presume a recurrent state! Still important to keep in mind as we consider different architectures, though.
  - Initial contextually-activated-heuristics might (low-confidence) starve gradients towards search.
  For instance, a plausible Fermi estimate for humans is that our values are ultimately generated from ~tens of simple proxies. (And I would guess that modern ML training would probably result in even fewer, relative to human evolution.)
  Do you mean “hardcoded reward circuit” by “proxy”?
  What links here?
  - TurnTrout's comment on Clarifying mesa-optimization by Marius Hobbhahn (31 Mar 2023 20:20 UTC; 2 points)
  - johnswentworth 15 Aug 2022 6:05 UTC
    LW: 10 AF: 7
    0
    AF Parent
    Do you mean “hardcoded reward circuit”
    I’m not that committed to the RL frame, but roughly speaking yes. Whatever values we have are probably generated by ~tens of hardcoded things. Anyway, on to the meat of the discussion...
    It seems like a whole bunch of people are completely thrown off by use of the word “search”. So let’s taboo that and talk about what’s actually relevant here.
    We should expect compression, and we should expect general-purpose problem solving (i.e. the ability to take a fairly arbitrary problem in the training environment and solve it reasonably well). The general-purpose part comes from a combination of (a) variation in what the system needs to do to achieve good performance in training, and (b) the recursive nature of problem solving, i.e. solving one problem involves solving a wide variety of subproblems. Compactness means that it probably won’t be a whole boatload of case-specific heuristics; lookup tables are not compact. A subroutine for reasonably-general planning or problem-solving (i.e. take a problem statement, figure out a plan or solution) is the key thing we’re talking about here. Possibly a small number of such subroutines for a few different problem-classes, but not a large number of such subroutines, because compactness. My guess would be basically just one.
    That probably will not look like babble and prune. It may look like a general-purpose heuristic-generator (like e.g. relaxation based heuristic generation). Or it may look like general-purpose efficiency tricks, like caching solutions to common subproblems. Or it may look like harcoded heuristics which are environment-specific but reasonably goal-agnostic (like e.g. the sort of thing in Mazes and Duality yields a maze-specific heuristic, but one which applies to a wide variety of path finding problems within that maze). Or it may look like harcoded strategies for achieving instrumentally convergent goals in the training environment (really this is another frame of caching solutions to common subproblems). Or it may look like learning instrumentally convergent concepts and heuristics from the training environment (i.e. natural abstractions; really this is another frame on environment-specific but goal-agnostic heuristics). Probably it’s a combination of all of those, and others too.
    The important point is that it’s a problem-solving subroutine which is goal-agnostic (though possibly environment-specific). Pass in a goal, it figures out how to achieve that goal. And we do see this with humans: you can give humans pretty arbitrary goals, pretty arbitrary jobs to do, pretty arbitrary problems to solve, and they’ll go figure out how to do it.
    - Nora Belrose 15 Aug 2022 13:52 UTC
      6 points
      5
      Parent
      I agree that AGI will need general purpose problem solving routines (by definition). I also agree that this requires something like recursive decomposition of problems into subproblems. I’m just very skeptical that the kinds of neural nets we’re training right now can learn to do anything remotely like that— I think it’s much more likely that people will hard code this type of reasoning into the compute graph with stuff like MCTS. This has already been pretty useful for e.g. MuZero. Once we’re hard coding search it’s less scary because it’s more interpretable and we can see exactly where the mesaobjective is.
      I also don’t really buy the compactness argument at all. I think neural nets are biased toward flat minima / broad basins but these don’t generally correspond to “simple” functions in the Kolmogorov sense; they’re more like equivalence classes of diverse bundles of heuristics that all get about the same train and val loss. I’m interpreting this paper as providing some evidence in that direction.
      - johnswentworth 15 Aug 2022 17:25 UTC
        2 points
        0
        Parent
        I’m just very skeptical that the kinds of neural nets we’re training right now can learn to do anything remotely like that— I think it’s much more likely that people will hard code this type of reasoning into the compute graph with stuff like MCTS. This has already been pretty useful for e.g. MuZero. Once we’re hard coding search it’s less scary because it’s more interpretable and we can see exactly where the mesaobjective is.
        I hope that you’re right; that would make Retargeting The Search very easy, and basically eliminates the inner alignment problem. Assuming, of course, that we can somehow confidently rule out the rest of the net doing any search in more subtle ways.
    - TurnTrout 22 Aug 2022 15:54 UTC
      LW: 4 AF: 4
      0
      AF Parent
      Probably it’s a combination of all of those, and others too
      This seems like roughly what I had in mind by “contextually activated computations” (probably with a few differences about when/how the subroutines will be goal-agnostic). I was imagining computations like “contextually activated cached death-avoidance policy influences” and “contextually activated steering of plans towards paperclip production, in generalizations of the historical reinforcement contexts for paperclip-reward.”