paulfchristiano comments on Can corrigibility be learned safely?

paulfchristiano 5 May 2018 19:20 UTC
LW: 4 AF: 2
0
AF
There is some mechanism the RL agent uses, which doesn’t rest on scientific research. IDA should use the same mechanism.
This may sometimes involve “heuristic X works well empirically, but has no detectable internal structure.” In those cases IDA needs to be able to come up with a safe version of that procedure (i.e. a version that wouldn’t leave us at a disadvantage relative to people who just want to maximize complexity or whatever). I think the main obstacle to safety is if heuristic X itself involves consequentialism. But in that case there seems to necessarily be some internal structure. (This is the kind of thing that I have been mostly thinking about recently.)
- Wei Dai 5 May 2018 23:43 UTC
  LW: 6 AF: 3
  0
  AF Parent
  There is some mechanism the RL agent uses, which doesn’t rest on scientific research. IDA should use the same mechanism.
  How does IDA find such a mechanism, if not by scientific research? RL does it by searching for weights that do well empirically, and William and I were wondering if that idea could be adapted to IDA but you said “Searching for trees that do well empirically is scary business, since now you have all the normal problems with ML.” (I had interpreted you to mean that we should avoid doing that. Did you actually mean that we should try to figure out a safe way to do it?)
  - paulfchristiano 6 May 2018 0:27 UTC
    LW: 5 AF: 3
    0
    AF Parent
    I think you need to do some trial and error, and was saying we should be scared of it ( / be careful about it / minimize it, though it’s subtle why minimization might help).
    For example, suppose that I put a random 20 gate circuit in a black box and let you observe input-output behavior. At some point you don’t have any options other than guess and check, and no amount of cleverness about alignment could possibly avoid the need to sometimes use brute force.