TurnTrout comments on Distinguishing claims about training vs deployment

TurnTrout 23 Feb 2021 2:00 UTC
LW: 2 AF: 2
AF
The first ambiguity I dislike here is that you could either be describing the emergence of instrumentality as robust, or the trait of instrumentality as robust. It seems like you’re trying to do the former, but because “robust” modifies “instrumentality”, the latter is a more natural interpretation.
One possibility is that we have to individuate these “instrumental convergence”-adjacent theses using different terminology. I think ‘robust instrumentality’ is basically correct for optimal actions, because there’s no question of ‘emergence’: optimal actions just are.
However, it doesn’t make sense to say the same for conjectures about how training such-and-such a system tends to induce property Y, for the reasons you mention. In particular, if property Y is not about goal-directed behavior, then it no longer makes sense to talk about ‘instrumentality’ from the system’s perspective. e.g. I’m not sure it makes sense to say ‘edge detectors are robustly instrumental for this network structure on this dataset after X epochs’.
(These are early thoughts; I wanted to get them out, and may revise them later or add another comment)
EDIT: In the context of MDPs, however, I prefer to talk in terms of (formal) POWER and of optimality probability, instead of in terms of robust instrumentality. I find ‘robust instrumentality’ to be better as an informal handle, but its formal operationalization seems better for precise thinking.
- Richard_Ngo 25 Feb 2021 17:09 UTC
  LW: 4 AF: 2
  AF Parent
  I think ‘robust instrumentality’ is basically correct for optimal actions, because there’s no question of ‘emergence’: optimal actions just are.
  If I were to put my objection another way: I usually interpret “robust” to mean something like “stable under perturbations”. But the perturbation of “change the environment, and then see what the new optimal policy is” is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent’s inputs, or its state, and seeing whether it still behaved instrumentally.
  A more accurate description might be something like “ubiquitous instrumentality”? But this isn’t a very aesthetically pleasing name.
  - TurnTrout 25 Feb 2021 17:26 UTC
    LW: 4 AF: 3
    AF Parent
    But the perturbation of “change the environment, and then see what the new optimal policy is” is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent’s inputs, or its state, and seeing whether it still behaved instrumentally.
    Ah. To clarify, I was referring to holding an environment fixed, and then considering whether, at a given state, an action has a high probability of being optimal across reward functions. I think it makes to call those actions ‘robustly instrumental.’
  - TurnTrout 25 Feb 2021 17:28 UTC
    LW: 2 AF: 2
    AF Parent
    A more accurate description might be something like “ubiquitous instrumentality”? But this isn’t a very aesthetically pleasing name.
    I’d considered ‘attractive instrumentality’ a few days ago, to convey the idea that certain kinds of subgoals are attractor points during plan formulation, but the usual reading of ‘attractive’ isn’t ‘having attractor-like properties.’