Yup, this isn’t robust to extremely capable systems; it’s a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.
(In the example with the agent doing engineering in a sandbox that doesn’t include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)
A whitelisting variant would be way more reliable than a blacklisting one, clearly.
Yup, this isn’t robust to extremely capable systems; it’s a quantitative shift in how promising it looks to the agent to learn about external affairs, not a qualitative one.
(In the example with the agent doing engineering in a sandbox that doesn’t include humans or general computing devices, there could be a strong internal gradient to learn obvious details about the things immediately outside its sandbox, and a weaker gradient for learning more distant or subtle things before you know the nearby obvious ones.)
A whitelisting variant would be way more reliable than a blacklisting one, clearly.