Even the last version might have odd incentives. If A knew that the chances were high enough that an actually original A book would be seen as rare plagiarism of some book unknown to A, the dominant strategy could be to instead commit the most obvious plagiarism ever, in order to minimize the penalty that cannot be reliably avoided.
This falls in with the question of whether we can distinguish whether A was intentionally vs. unintentionally committing a forbidden action. If the advice class R only contains information about things external to A, then there is no way for this method to distinguish between the two, and we should forbid anything that could be intentional bad behaviour. However, we might be able to include information about A’s internal state in the advice. For example, the advice is a pair (book, location in A’s memory trace) and, F(a,r) is only true if the location in A’s memory trace indicates that A plagarize the particular book (of course, you need to be sure that you’d be able to spot something in A’s memory trace).
At least the method fails safely in this case. The null action will always be preferred to committing obvious plagarism (and committing obvious plagarism is pretty great problem to have compared to a silent failure!). And you can always tune k to make A more willing to trade action goodness for forbiden-ness probability, or reduce the size of the library, to alter the incentives if A refuses to do anything when you turn it on.
If you’re confident of getting a memory trace for all books consulted, then there are simpler ways of preventing plagiarism in the informed oversight case: have the overseer read only the books consulted by the agent (or choose randomly among them for the ones to read). The informed oversight problem here assumes that the internals of A are potentially opaque to B, even though B has greater capabilities than A.
Another method for dealing with this case:
Prior to the algorithm, pick a “safe set” S of (input, action) pairs which seem, on some reflection, that they should obviously not be forbidden. Then only punish actions which have a larger set of values r for which F(a,r) is true than, say, the mean of all actions in S (could be done by setting the evaluation of the null action anull to equal the evaluation of some random element in S). This means that if A will choose actions that are at least as safe as those in S (which might be suboptimally influenced by the size of the advice set RF, but would not lead to paralysis). This could compromise safety if S is chosen poorly, or if there are pathological cases where some unsafe action has fewer reasons to think it unsafe than the actions in S (this seems unlikely at first glance).
Even the last version might have odd incentives. If A knew that the chances were high enough that an actually original A book would be seen as rare plagiarism of some book unknown to A, the dominant strategy could be to instead commit the most obvious plagiarism ever, in order to minimize the penalty that cannot be reliably avoided.
This falls in with the question of whether we can distinguish whether A was intentionally vs. unintentionally committing a forbidden action. If the advice class R only contains information about things external to A, then there is no way for this method to distinguish between the two, and we should forbid anything that could be intentional bad behaviour. However, we might be able to include information about A’s internal state in the advice. For example, the advice is a pair (book, location in A’s memory trace) and, F(a,r) is only true if the location in A’s memory trace indicates that A plagarize the particular book (of course, you need to be sure that you’d be able to spot something in A’s memory trace).
At least the method fails safely in this case. The null action will always be preferred to committing obvious plagarism (and committing obvious plagarism is pretty great problem to have compared to a silent failure!). And you can always tune k to make A more willing to trade action goodness for forbiden-ness probability, or reduce the size of the library, to alter the incentives if A refuses to do anything when you turn it on.
If you’re confident of getting a memory trace for all books consulted, then there are simpler ways of preventing plagiarism in the informed oversight case: have the overseer read only the books consulted by the agent (or choose randomly among them for the ones to read). The informed oversight problem here assumes that the internals of A are potentially opaque to B, even though B has greater capabilities than A.
Another method for dealing with this case: Prior to the algorithm, pick a “safe set” S of (input, action) pairs which seem, on some reflection, that they should obviously not be forbidden. Then only punish actions which have a larger set of values r for which F(a,r) is true than, say, the mean of all actions in S (could be done by setting the evaluation of the null action anull to equal the evaluation of some random element in S). This means that if A will choose actions that are at least as safe as those in S (which might be suboptimally influenced by the size of the advice set RF, but would not lead to paralysis). This could compromise safety if S is chosen poorly, or if there are pathological cases where some unsafe action has fewer reasons to think it unsafe than the actions in S (this seems unlikely at first glance).