Stuart_Armstrong comments on Censoring out-of-domain representations

Stuart_Armstrong 1 Feb 2017 14:37 UTC
LW: 3 AF: 2
AF
Seems interesting, but the adversary seems to need a very specific definition of what’s outside the domain. Absent that, this just becomes a patch or a nearest unblocked strategy: the solution will the one that’s best in the domain and doesn’t trigger the specific outside-domain adversary.
- IAFF-User-111 9 Feb 2017 4:36 UTC
  0 points
  AF Parent
  I agree… if there are specific things you don’t want to be able to do / predict, then you can do something very similar to the cited “Censoring Representations” paper.
  
  But if you want to censor all “out-of-domain” knowledge, I don’t see a good way of doing it.
  - orthonormal 11 Feb 2017 1:42 UTC
    0 points
    AF Parent
    Yup, this isn’t robust to extremely capable systems; it’s a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.
    
    (In the example with the agent doing engineering in a sandbox that doesn’t include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)
    
    A whitelisting variant would be way more reliable than a blacklisting one, clearly.