That means the problem is inherently unsolvable by iteration. “See what goes wrong and fix it” auto-fails if The Client cannot tell that anything is wrong.
Not at all meant to be a general solution to this problem, but I think that a specific case where we could turn this into something iterable is by using historic examples of scientific breakthroughs—consider past breakthroughs to a problem where the solution (in hindsight) is overdetermined, train the AI on data filtered by date, and The Client evaluates the AI solely based on how close the AI approaches that overdetermined answer.
As a specific example: imagine feeding the AI historical context that led up to the development of information theory, and checking if the AI converges onto something isomorphic to what Shannon found (training with information cutoff, of course). Information theory surely seems like The Over-determined Solution for tackling the sorts of problems that it was motivated by, and so the job of the client/evaluator is much easier.
Of course this is probably still too difficult in practice (eg not enough high-quality historical data of breakthroughs, evaluation & data-curation still demanding great expertise, hope of ”… and now our AI should generalize to genuinely novel problems!” not cashing out, scope of this specific example being too limited, etc).
But the situation for this specific example sounds somewhat better than that laid out in this post, i.e. The Client themselves needing the expertise to evaluate non-hindsight based supposed Alignment breakthroughs & having to operate on completely novel intellectual territory.
Not at all meant to be a general solution to this problem, but I think that a specific case where we could turn this into something iterable is by using historic examples of scientific breakthroughs—consider past breakthroughs to a problem where the solution (in hindsight) is overdetermined, train the AI on data filtered by date, and The Client evaluates the AI solely based on how close the AI approaches that overdetermined answer.
As a specific example: imagine feeding the AI historical context that led up to the development of information theory, and checking if the AI converges onto something isomorphic to what Shannon found (training with information cutoff, of course). Information theory surely seems like The Over-determined Solution for tackling the sorts of problems that it was motivated by, and so the job of the client/evaluator is much easier.
Of course this is probably still too difficult in practice (eg not enough high-quality historical data of breakthroughs, evaluation & data-curation still demanding great expertise, hope of ”… and now our AI should generalize to genuinely novel problems!” not cashing out, scope of this specific example being too limited, etc).
But the situation for this specific example sounds somewhat better than that laid out in this post, i.e. The Client themselves needing the expertise to evaluate non-hindsight based supposed Alignment breakthroughs & having to operate on completely novel intellectual territory.