Sure. My best guess for how we will train AGI is via giving it access to huge amounts of data—e.g. every outbound Reddit link as in GPT-2. Given that sort of a massive, rich dataset, I think that for a model to determine the particular training procedure that it’s undergoing won’t be very difficult. If it just reads the Wikipedia pages on ML, gradient descent, etc. and some arXiv papers on transparency techniques, for example, then that should give it everything it needs to know.
Because that’s what it would do if it were us. https://en.m.wikipedia.org/wiki/Kerckhoffs’s_principle says we should not assume our security mechanisms are secret. One exemption would be if you could define three indistinguishable contexts that the AI might find itself in, then select randomly between them—then you may assume that the AI also doesn’t know.
Could you please describe your intuition behind how the model could know the meta-optimizer is going to perform checks on deceptive behavior?
Sure. My best guess for how we will train AGI is via giving it access to huge amounts of data—e.g. every outbound Reddit link as in GPT-2. Given that sort of a massive, rich dataset, I think that for a model to determine the particular training procedure that it’s undergoing won’t be very difficult. If it just reads the Wikipedia pages on ML, gradient descent, etc. and some arXiv papers on transparency techniques, for example, then that should give it everything it needs to know.
Because that’s what it would do if it were us. https://en.m.wikipedia.org/wiki/Kerckhoffs’s_principle says we should not assume our security mechanisms are secret. One exemption would be if you could define three indistinguishable contexts that the AI might find itself in, then select randomly between them—then you may assume that the AI also doesn’t know.