Oliver Sourbut comments on How To Go From Interpretability To Alignment: Just Retarget The Search

Oliver Sourbut 22 Aug 2022 11:12 UTC
6 points
2
most of the action is in which plans were generated in the first place and “retarget the search” doesn’t necessarily solve your problem

I definitely buy this and I think the thread under this between you and John is a useful elaboration.

The thing that generates the proposals has to do most of the heavy lifting in any interestingly-large problem. e.g. I would argue most of the heavy lifting^[1] of AlphaGo and that crowd is done by the fact that the atomic actions are all already ‘interestingly consequential’ (i.e. the proposal generation doesn’t have to consider millisecond muscle twitches but rather whole ‘moves’, a short string of which is genuinely consequential in-context).

Nevertheless I reasonably strongly think that something of the ‘retargetable search’ flavour is a useful thing to expect, look for, and attempt to control.

For one, once you have proposals which are any kind of good at all, running a couple of OOMs of plan selection can buy you a few standard deviations of plan quality, provided you can evaluate plans ex ante better than randomly, which is just generically applicable and useful. But this isn’t the main thing, because with just that picture we’re still back to most of the action being the generator/heuristics.

The main things are that
1. (as John pointed out) recursive-ish generic planning is enormously useful and general, and implies at least some degree of retargetability.
2. (this is shaky and insidey) how do you arrive at the good heuristics/generators? It’s something like
  - ‘magic abstraction from relevantly-similar experience’
  - ‘magic recomposition of abstractions’
  - how do you get relevantly-similar experience?
    it’s ‘easy’ for ‘easy’ problems (e.g. low dimensional, defined action-space already ‘interestingly consequential’, someone already collected a dataset of examples, …)
    maybe one or more of these will apply to all the necessary pieces for PASTA-like AGI, but I doubt it
    
    what about for ‘hard’ problems (e.g. high dimensional, action-space not pre-fitted to the problem, few or no existing examples)?
    you need to ‘deliberately explore’ aka ‘do science’ aka ‘experiment’
    recursive-ish generic planning also looks to me like a good tool (‘the right/only tool’?) for pulling this off!
    cf $P_{2} B$ : Plan to $P_{2} B$ Better and my restatement of instrumental convergence
1. ↩︎
  (this is an entirely unfair defamation of Silver et al which I feel the need to qualify is at least partly rhetorical and not in fact my entire take on the matter)
What links here?
- Oliver Sourbut's comment on How To Go From Interpretability To Alignment: Just Retarget The Search by johnswentworth (22 Aug 2022 11:27 UTC; 5 points)