We can add the assumption that the probability mass of malign hypotheses is small and that following any non-malign hypothesis at any given time is safe. Then we can probably get query complexity that scales with p(malign) / p(true)?
However, this is cheating because a sequence of actions can be dangerous even if each individual action came from a (different) harmless hypothesis. So, instead we want to assume something like: any dangerous strategy has high Kolmogorov complexity relative to the demonstrator. (More realistically, some kind of bounded Kolmogorov complexity.)
A related problem is: in reality “user takes action a” and “AI takes action a” are not physically identical events, and there might be an attack vector that exploits this.
If p(malign)/p(true) is really high that’s a bigger problem, but then doesn’t it mean we actually live in a malign simulation and everything is hopeless anyway?
Re 1: This is a good point. Some thoughts:
[EDIT: See this]
We can add the assumption that the probability mass of malign hypotheses is small and that following any non-malign hypothesis at any given time is safe. Then we can probably get query complexity that scales with p(malign) / p(true)?
However, this is cheating because a sequence of actions can be dangerous even if each individual action came from a (different) harmless hypothesis. So, instead we want to assume something like: any dangerous strategy has high Kolmogorov complexity relative to the demonstrator. (More realistically, some kind of bounded Kolmogorov complexity.)
A related problem is: in reality “user takes action a” and “AI takes action a” are not physically identical events, and there might be an attack vector that exploits this.
If p(malign)/p(true) is really high that’s a bigger problem, but then doesn’t it mean we actually live in a malign simulation and everything is hopeless anyway?
Re 2: Why do you need such a strong bound?