I know the contest is over, but this idea for a low-bandwidth oracle might be useful anyhow: Given a purported FAI design, what is the most serious flaw? Then highlight lines from the FAI design description, plus given a huge corpus of computer science papers, LW/AF posts, etc. highlight relevant paragraphs from those as well (perhaps using some kind of constraint like “3 or fewer paragraphs highlighted in their entirety”) that, taken together, come closest to pinpointing the issue. We could even give it a categorization scheme for safety problems we came up with, and it could tell us which category this particular problem comes closest to falling under. Or offer it categories a particular hint could fall under to choose from, such as “this is just an analogy”, “keep thinking along these lines”, etc. Then do the same and ask it to highlight text which leads to a promising solution. The rationale being that unforseen difficulties are the hardest part of alignment, but if there’s a flaw, it will probably be somehow analogous to a problem we’ve seen in the past, or will be addressable using methods which have worked in the past, or something. But it’s hard to fit “everything we’ve seen in the past” into one human head.
I know the contest is over, but this idea for a low-bandwidth oracle might be useful anyhow: Given a purported FAI design, what is the most serious flaw? Then highlight lines from the FAI design description, plus given a huge corpus of computer science papers, LW/AF posts, etc. highlight relevant paragraphs from those as well (perhaps using some kind of constraint like “3 or fewer paragraphs highlighted in their entirety”) that, taken together, come closest to pinpointing the issue. We could even give it a categorization scheme for safety problems we came up with, and it could tell us which category this particular problem comes closest to falling under. Or offer it categories a particular hint could fall under to choose from, such as “this is just an analogy”, “keep thinking along these lines”, etc. Then do the same and ask it to highlight text which leads to a promising solution. The rationale being that unforseen difficulties are the hardest part of alignment, but if there’s a flaw, it will probably be somehow analogous to a problem we’ve seen in the past, or will be addressable using methods which have worked in the past, or something. But it’s hard to fit “everything we’ve seen in the past” into one human head.