Seems interesting, but the adversary seems to need a very specific definition of what’s outside the domain. Absent that, this just becomes a patch or a nearest unblocked strategy: the solution will the one that’s best in the domain and doesn’t trigger the specific outside-domain adversary.
I agree… if there are specific things you don’t want to be able to do / predict, then you can do something very similar to the cited “Censoring Representations” paper.
But if you want to censor all “out-of-domain” knowledge, I don’t see a good way of doing it.
Yup, this isn’t robust to extremely capable systems; it’s a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.
(In the example with the agent doing engineering in a sandbox that doesn’t include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)
A whitelisting variant would be way more reliable than a blacklisting one, clearly.
Seems interesting, but the adversary seems to need a very specific definition of what’s outside the domain. Absent that, this just becomes a patch or a nearest unblocked strategy: the solution will the one that’s best in the domain and doesn’t trigger the specific outside-domain adversary.
I agree… if there are specific things you don’t want to be able to do / predict, then you can do something very similar to the cited “Censoring Representations” paper.
But if you want to censor all “out-of-domain” knowledge, I don’t see a good way of doing it.
Yup, this isn’t robust to extremely capable systems; it’s a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.
(In the example with the agent doing engineering in a sandbox that doesn’t include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)
A whitelisting variant would be way more reliable than a blacklisting one, clearly.