Peter Hroššo comments on Gradient hacking

Peter Hroššo 16 Oct 2019 10:14 UTC
LW: 6 AF: 3
AF
If your model is deceptive, though, then it might know all of that
Could you please describe your intuition behind how the model could know the meta-optimizer is going to perform checks on deceptive behavior?
- evhub 16 Oct 2019 18:16 UTC
  LW: 11 AF: 6
  AF Parent
  Sure. My best guess for how we will train AGI is via giving it access to huge amounts of data—e.g. every outbound Reddit link as in GPT-2. Given that sort of a massive, rich dataset, I think that for a model to determine the particular training procedure that it’s undergoing won’t be very difficult. If it just reads the Wikipedia pages on ML, gradient descent, etc. and some arXiv papers on transparency techniques, for example, then that should give it everything it needs to know.
- Gurkenglas 16 Oct 2019 16:04 UTC
  6 points
  Parent
  Because that’s what it would do if it were us. https://en.m.wikipedia.org/wiki/Kerckhoffs’s_principle says we should not assume our security mechanisms are secret. One exemption would be if you could define three indistinguishable contexts that the AI might find itself in, then select randomly between them—then you may assume that the AI also doesn’t know.