TurnTrout comments on Distinguishing claims about training vs deployment

TurnTrout 4 Feb 2021 18:54 UTC
LW: 2 AF: 1
AF
One quibble: in your comment on my previous post, you distinguished between optimal policies versus the policies that we’re actually likely to train. But this isn’t a component of my distinction—in both cases I’m talking about policies which actually arise from training.
Right—I was pointing at the similarity in that both of our distinctions involve some aspect of training, which breaks from the tradition of not really considering training’s influence on robust instrumentality. “Quite similar” was poor phrasing on my part, because I agree that our two distinctions are materially different.
On terminology, would you prefer the “training goal convergence thesis”?
I think that “training goal convergence thesis” is way better, and I like how it accomodates dual meanings: the “goal” may be an instrumental or a final goal.
I think “robust” is just as misleading a term as “convergence”, in that neither are usually defined in terms of what happens when you train in many different environments.
Can you elaborate? ‘Robust’ seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment.
And so, given switching costs, I think it’s fine to keep talking about instrumental convergence.
I agree that switching costs are important to consider. However, I’ve recently started caring more about establishing and promoting clear nomenclature, both for the purposes of communication and for clearer personal thinking.
My model of the ‘instrumental convergence’ situation is something like:
- The switching costs are primarily sensitive to how firmly established the old name is, to how widely used the old name is, and the number of “entities” which would have to adopt the new name.
  - I think that if researchers generally agree that ‘robust instrumentality’ is a clearer name[1] and used it to talk about the concept, that the shift would naturally propagate through AI alignment circles and be complete within a year or two. This is just my gut sense, though.
- The switch from “optimization daemons” to “mesa-optimizers” seemed to go pretty well
  - But ‘optimization daemons’ didn’t have a wikipedia page yet (unlike ‘instrumental convergence’)
Of course, all of this is conditional on your agreeing that ‘robust instrumentality’ is in fact a better name; if you disagree, I’m interested in hearing why.[2] But if you agree, I think that the switch would probably happen if people are willing to absorb a small communicational overhead for a while as the meme propagates. (And I do think it’s small—I talk about robust instrumentality all the time, and it really doesn’t take long to explain the switch)
On the bright side, I think the situation for ‘instrumental convergence / robust instrumentality’ is better than the one for ‘corrigibility’, where we have a single handle for wildly different concepts!
[1] A clearer name—once explained to the reader, at least; ‘robust instrumentality’ unfortunately isn’t as transparent as ‘factored cognition hypothesis.’
[2] Especially before the 2019 LW review book is published, as it seems probable that Seeking Power is Often Robustly Instrumental in MDPs will be included. I am ready to be convinced that there exists an even better name than ‘robust instrumentality’ and to rework my writing accordingly.
- Richard_Ngo 22 Feb 2021 15:34 UTC
  LW: 4 AF: 3
  AF Parent
  Can you elaborate? ‘Robust’ seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment.
  The first ambiguity I dislike here is that you could either be describing the emergence of instrumentality as robust, or the trait of instrumentality as robust. It seems like you’re trying to do the former, but because “robust” modifies “instrumentality”, the latter is a more natural interpretation.
  For example, if I said “life on earth is very robust”, the natural interpretation is: given that life exists on earth, it’ll be hard to wipe it out. Whereas an emergence-focused interpretation (like yours) would be: life would probably have emerged given a wide range of initial conditions on earth. But I imagine that very few people would interpret my original statement in that way.
  The second ambiguity I dislike: even if we interpret “robust instrumentality” as the claim that “the emergence of instrumentality is robust”, this still doesn’t get us what we want. Bostrom’s claim is not just that instrumental reasoning usually emerges; it’s that specific instrumental goals usually emerge. But “instrumentality” is more naturally interpreted as the general tendency to do instrumental reasoning.
  On switching costs: Bostrom has been very widely read, so changing one of his core terms will be much harder than changing a niche working handle like “optimisation daemon”, and would probably leave a whole bunch of people confused for quite a while. I do agree the original term is flawed though, and will keep an eye out for potential alternatives—I just don’t think robust instrumentality is clear enough to serve that role.
  - TurnTrout 23 Feb 2021 2:00 UTC
    LW: 2 AF: 2
    AF Parent
    The first ambiguity I dislike here is that you could either be describing the emergence of instrumentality as robust, or the trait of instrumentality as robust. It seems like you’re trying to do the former, but because “robust” modifies “instrumentality”, the latter is a more natural interpretation.
    One possibility is that we have to individuate these “instrumental convergence”-adjacent theses using different terminology. I think ‘robust instrumentality’ is basically correct for optimal actions, because there’s no question of ‘emergence’: optimal actions just are.
    However, it doesn’t make sense to say the same for conjectures about how training such-and-such a system tends to induce property Y, for the reasons you mention. In particular, if property Y is not about goal-directed behavior, then it no longer makes sense to talk about ‘instrumentality’ from the system’s perspective. e.g. I’m not sure it makes sense to say ‘edge detectors are robustly instrumental for this network structure on this dataset after X epochs’.
    (These are early thoughts; I wanted to get them out, and may revise them later or add another comment)
    EDIT: In the context of MDPs, however, I prefer to talk in terms of (formal) POWER and of optimality probability, instead of in terms of robust instrumentality. I find ‘robust instrumentality’ to be better as an informal handle, but its formal operationalization seems better for precise thinking.
    - Richard_Ngo 25 Feb 2021 17:09 UTC
      LW: 4 AF: 2
      AF Parent
      I think ‘robust instrumentality’ is basically correct for optimal actions, because there’s no question of ‘emergence’: optimal actions just are.
      If I were to put my objection another way: I usually interpret “robust” to mean something like “stable under perturbations”. But the perturbation of “change the environment, and then see what the new optimal policy is” is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent’s inputs, or its state, and seeing whether it still behaved instrumentally.
      A more accurate description might be something like “ubiquitous instrumentality”? But this isn’t a very aesthetically pleasing name.
      - TurnTrout 25 Feb 2021 17:26 UTC
        LW: 4 AF: 3
        AF Parent
        But the perturbation of “change the environment, and then see what the new optimal policy is” is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent’s inputs, or its state, and seeing whether it still behaved instrumentally.
        Ah. To clarify, I was referring to holding an environment fixed, and then considering whether, at a given state, an action has a high probability of being optimal across reward functions. I think it makes to call those actions ‘robustly instrumental.’
      - TurnTrout 25 Feb 2021 17:28 UTC
        LW: 2 AF: 2
        AF Parent
        A more accurate description might be something like “ubiquitous instrumentality”? But this isn’t a very aesthetically pleasing name.
        I’d considered ‘attractive instrumentality’ a few days ago, to convey the idea that certain kinds of subgoals are attractor points during plan formulation, but the usual reading of ‘attractive’ isn’t ‘having attractor-like properties.’