I’m not sure what your intuitive model is and how it differs from mine, but one possibility is that you’re picturing a sort of bureaucracy in which we simultaneously have many agents supervising each other (A supervises B who supervises C who supervises D …) whereas I’m picturing something more like: we train B while making extensive use of A for accurate supervision, adversarial training, threat assessment, etc. (perhaps allocating resources such that there is a lot more of A than B and generally a lot of redundancy and robustness in our alignment efforts and threat assessment), and try to get to the point where we trust B, then do a similar thing with C. I still don’t think this is a great idea to do too many times; I’d hope that at some point we get alignment techniques that scale more cleanly.
This was very helpful, thank you! You were correct about how my intuitions differed from your plan. This does seem more likely to work than the scheme I was imagining.
I’m not sure what your intuitive model is and how it differs from mine, but one possibility is that you’re picturing a sort of bureaucracy in which we simultaneously have many agents supervising each other (A supervises B who supervises C who supervises D …) whereas I’m picturing something more like: we train B while making extensive use of A for accurate supervision, adversarial training, threat assessment, etc. (perhaps allocating resources such that there is a lot more of A than B and generally a lot of redundancy and robustness in our alignment efforts and threat assessment), and try to get to the point where we trust B, then do a similar thing with C. I still don’t think this is a great idea to do too many times; I’d hope that at some point we get alignment techniques that scale more cleanly.
This was very helpful, thank you! You were correct about how my intuitions differed from your plan. This does seem more likely to work than the scheme I was imagining.