DanielFilan comments on [AN #62] Are adversarial examples caused by real but imperceptible features?

DanielFilan 30 Aug 2019 22:04 UTC
LW: 2 AF: 1
AF
I’m sort of confused by the main point of that post. Is the idea that the robot can’t stack blocks because of a physical limitation? If so, it seems like this is addressed by the first initial objection. Is it rather that the model space might not have the capacity to correctly imitate the human? I’d be somewhat surprised by this being a big issue, and at any rate it seems like you could use the Wasserstein metric as a cost function and get a desirable outcome. I guess instead we’re instead imagining a problem where there’s no great metric (e.g. text answers to questions)?
- Rohin Shah 31 Aug 2019 5:33 UTC
  LW: 4 AF: 2
  AF Parent
  Is it rather that the model space might not have the capacity to correctly imitate the human?
  There are lots of reasons that a robot might be unable to learn the correct policy despite the action space permitting it:
  - Not enough model capacity
  - Not enough training data
  - Training got stuck in a local optimum
  - You’ve learned from robot play data, but you’ve never seen anything like the human policy before
  etc, etc.
  Not all of these are compatible with “and so the robot does the thing that the human does 5% of the time”. But it seems like there can and probably will be factors that are different between the human and the robot (even if the human uses teleoperation), and in that setting imitating factored cognition provides the wrong incentives, while optimizing factored evaluation provides the right incentives.
- paulfchristiano 31 Aug 2019 21:02 UTC
  LW: 2 AF: 1
  AF Parent
  I think AI will probably be good enough to pose a catastrophic risk before it can exactly imitate a human. (But as Wei Dai says elsewhere, if you do amplification then you will definitely get into the regime where you can’t imitate.)
- Wei Dai 31 Aug 2019 1:00 UTC
  LW: 2 AF: 1
  AF Parent
  
  Is it rather that the model space might not have the capacity to correctly imitate the human?
  
  Paul wrote in a parallel subthread, “Against mimicry is mostly motivated by the case of imitating an amplified agent.” In the case of IDA, you’re bound to run out of model capacity eventually as you keep iterating the amplification and distillation.
  
  I guess instead we’re instead imagining a problem where there’s no great metric (e.g. text answers to questions)?
  
  I’m pretty sure that’s it.