HoldenKarnofsky comments on How might we align transformative AI if it’s developed very soon?

HoldenKarnofsky 3 Jun 2023 5:50 UTC
5 points
0
I’m not sure what your intuitive model is and how it differs from mine, but one possibility is that you’re picturing a sort of bureaucracy in which we simultaneously have many agents supervising each other (A supervises B who supervises C who supervises D …) whereas I’m picturing something more like: we train B while making extensive use of A for accurate supervision, adversarial training, threat assessment, etc. (perhaps allocating resources such that there is a lot more of A than B and generally a lot of redundancy and robustness in our alignment efforts and threat assessment), and try to get to the point where we trust B, then do a similar thing with C. I still don’t think this is a great idea to do too many times; I’d hope that at some point we get alignment techniques that scale more cleanly.
- Seth Herd 23 Jul 2023 18:15 UTC
  2 points
  0
  Parent
  This was very helpful, thank you! You were correct about how my intuitions differed from your plan. This does seem more likely to work than the scheme I was imagining.