I like your section 2. As you are asking for feedback on your plans in section 3:
By default I plan to continue looking into the directions in section 3.1, namely transparency of current models and its (potential) intersection with developments in deep learning theory. [...]
Since this is what I plan to do, it’d be useful for me to know if it seems totally misguided
I see two ways to improve AI transparency in the face of opaque learned models:
try to make the learned models less opaque—this is your direction
try to find ways to build more transparent systems that use potentially opaque learned models as building blocks. This is a research direction that your picture of a “human-like ML model” points to. Creating this type of transparency is also one of the main thoughts behind Drexler’s CAIS. You can also find this approach of ‘more aligned architectures built out of opaque learned models’ in my work, e.g. here.
Now, I am doing alignment research in part because of plain intellectual curiosity.
But an argument could be made that, if you want to be maximally effective in AI alignment and minimising x-risk, you need to do either technical work to improve systems of type 2, or policy work on banning systems which are completely opaque inside, banning their use in any type of high-impact application. Part of that argument would also be that mainstream ML research is already plenty interested in improving the transparency of current generation neural nets, but without really getting there yet.
I like your section 2. As you are asking for feedback on your plans in section 3:
I see two ways to improve AI transparency in the face of opaque learned models:
try to make the learned models less opaque—this is your direction
try to find ways to build more transparent systems that use potentially opaque learned models as building blocks. This is a research direction that your picture of a “human-like ML model” points to. Creating this type of transparency is also one of the main thoughts behind Drexler’s CAIS. You can also find this approach of ‘more aligned architectures built out of opaque learned models’ in my work, e.g. here.
Now, I am doing alignment research in part because of plain intellectual curiosity.
But an argument could be made that, if you want to be maximally effective in AI alignment and minimising x-risk, you need to do either technical work to improve systems of type 2, or policy work on banning systems which are completely opaque inside, banning their use in any type of high-impact application. Part of that argument would also be that mainstream ML research is already plenty interested in improving the transparency of current generation neural nets, but without really getting there yet.