abramdemski comments on Formal Inner Alignment, Prospectus

abramdemski 17 May 2021 21:25 UTC
LW: 15 AF: 12
AF
I agree with much of this. I over-sold the “absence of negative story” story; of course there has to be some positive story in order to be worried in the first place. I guess a more nuanced version would be that I am pretty concerned about the broadest positive story, “mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn’t we expect to see them?”—and think more specific positive stories are mostly of illustrative value, rather than really pointing to gears that I expect to be important. (With the exception of John’s story, which did point to important gears.)

With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable. However, that’s not the sense the post conveyed overall, so I get it. I am concretely trying to convey pessimism about a specific sort of less-formal work: work which tries to block plausibility stories. Possibly you disagree about this kind of work.

WRT your argument for informal work, well, I agree in principle (trying to push toward more formal work myself has so far revealed challenges which I think more informal conceptual work could help with), but I’m nonetheless optimistic at the moment that we can define formal problems which won’t be a waste of time to work on. And out of informal work, what seems most interesting is whatever pushes toward formality.
- Richard_Ngo 18 May 2021 11:57 UTC
  LW: 4 AF: 2
  AF Parent
  Mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn’t we expect to see them?
  I like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I’ll save for another time).
  With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable.
  I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I’m still not sure what intuitions or arguments underlie your optimism about formal work, though. I assume that this would be fairly time-consuming to spell out in detail—but given that the core point of this post is to encourage such work, it seems worth at least gesturing towards those intuitions, so that it’s easier to tell where any disagreement lies.
  - abramdemski 18 May 2021 14:34 UTC
    LW: 4 AF: 3
    AF Parent
    To me, the post as written seems like enough to spell out my optimism… there multiple directions for formal work which seem under-explored to me. Well, I suppose I didn’t focus on explaining why things seem under-explored. Hopefully the writeup-to-come will make that clear.