Wei Dai comments on Can corrigibility be learned safely?

Wei Dai 5 May 2018 23:43 UTC
LW: 6 AF: 3
AF
There is some mechanism the RL agent uses, which doesn’t rest on scientific research. IDA should use the same mechanism.
How does IDA find such a mechanism, if not by scientific research? RL does it by searching for weights that do well empirically, and William and I were wondering if that idea could be adapted to IDA but you said “Searching for trees that do well empirically is scary business, since now you have all the normal problems with ML.” (I had interpreted you to mean that we should avoid doing that. Did you actually mean that we should try to figure out a safe way to do it?)
- paulfchristiano 6 May 2018 0:27 UTC
  LW: 5 AF: 3
  AF Parent
  I think you need to do some trial and error, and was saying we should be scared of it ( / be careful about it / minimize it, though it’s subtle why minimization might help).
  For example, suppose that I put a random 20 gate circuit in a black box and let you observe input-output behavior. At some point you don’t have any options other than guess and check, and no amount of cleverness about alignment could possibly avoid the need to sometimes use brute force.