Gurkenglas comments on Gradient hacking

Gurkenglas 16 Oct 2019 16:04 UTC
6 points
Because that’s what it would do if it were us. https://en.m.wikipedia.org/wiki/Kerckhoffs’s_principle says we should not assume our security mechanisms are secret. One exemption would be if you could define three indistinguishable contexts that the AI might find itself in, then select randomly between them—then you may assume that the AI also doesn’t know.