Steven Byrnes comments on How might we align transformative AI if it’s developed very soon?

Steven Byrnes 30 Aug 2022 16:12 UTC
LW: 10 AF: 5
1
AF
This seems like a plausible argument that we’re unlikely to be stuck with a large gap between AI systems’ capabilities and their supervisors’ capabilities; I’m not currently clear on what the counter-argument is.
I guess your assumption is that a Level-10 AI is totally capable of supervising a Level-10 AI, and mostly capable of supervising a Level-10.1 AI, so we give it more time / clones / human help / whatever and then it can supervise a Level-10.1 AI, and we can work our way up.
My concern would be that a Level-10 AI cannot adequately supervise even a much dumber Level-5 AI, let alone another Level-10 AI, because supervising is hard.
I don’t think I could supervise a clone of myself, if supervision requires looking at a billion-node unlabeled graph of the latent concepts in my clone’s head and figuring out whether it’s trying to do things for the right reasons versus the wrong reasons. I’m not sure that even a team of arbitrarily many copies of myself could make sense of that billion-node unlabeled graph.
Maybe you have a different idea of what supervision entails?
- HoldenKarnofsky 17 Mar 2023 21:06 UTC
  LW: 4 AF: 3
  0
  AF Parent
  (Sorry for the long delay here!) The post articulates a number of specific ways in which some AIs can help to supervise others (e.g., patching security holes, generating inputs for adversarial training, finding scary inputs/training processes for threat assessment), and these don’t seem to rely on the idea that an AI can automatically fully understand the internals/arguments/motivations/situation of a sufficiently close-in-capabilities other AI. The claim is not that a single supervisory arrangement of that type wipes out all risks, but that enough investment in AI checks and balances can significantly reduce them.
  I also think it’s possible that interpretability research is going to make it easier over time to interpret an AI’s internals—the growth in “difficulty of interpretation” with model size/capability won’t necessarily outpace the growth in tools for interpretability. (I think this could go either way.)