This response is rather late, but basically my hope is that it’s possible to optimise for understandability by regularising for some relatively simple quantity that induces understandability.
Perhaps if an AGI is built out of modules that are separately trained, instead of being trained end-to-end, you could use this idea on some of the smaller modules that are especially important to safety. I’m curious if that’s the kind of plan you have in mind, or if you’re more ambitious about this approach.
I’m more ambitious, and fear that that might not work: either you train a bunch of ‘small’ things that do very concrete tasks, and aren’t quite sure how to combine them to create AGI (or you have to combine a huge number of them and hope that errors don’t cascade), or you train a few large ones that do big, complicated tasks that themselves are hard to interpret. That being said, the first branch would satisfy my desiderata for the approach, and I’d hope some people are working on it.
This response is rather late, but basically my hope is that it’s possible to optimise for understandability by regularising for some relatively simple quantity that induces understandability.
I’m more ambitious, and fear that that might not work: either you train a bunch of ‘small’ things that do very concrete tasks, and aren’t quite sure how to combine them to create AGI (or you have to combine a huge number of them and hope that errors don’t cascade), or you train a few large ones that do big, complicated tasks that themselves are hard to interpret. That being said, the first branch would satisfy my desiderata for the approach, and I’d hope some people are working on it.