Vanessa Kosoy comments on Formal Solution to the Inner Alignment Problem

Vanessa Kosoy 25 Feb 2021 19:52 UTC
LW: 2 AF: 1
AF
Re 1: This is a good point. Some thoughts:

[EDIT: See this]
- We can add the assumption that the probability mass of malign hypotheses is small and that following any non-malign hypothesis at any given time is safe. Then we can probably get query complexity that scales with p(malign) / p(true)?
- However, this is cheating because a sequence of actions can be dangerous even if each individual action came from a (different) harmless hypothesis. So, instead we want to assume something like: any dangerous strategy has high Kolmogorov complexity relative to the demonstrator. (More realistically, some kind of bounded Kolmogorov complexity.)
- A related problem is: in reality “user takes action $a$ ” and “AI takes action $a$ ” are not physically identical events, and there might be an attack vector that exploits this.
- If p(malign)/p(true) is really high that’s a bigger problem, but then doesn’t it mean we actually live in a malign simulation and everything is hopeless anyway?
Re 2: Why do you need such a strong bound?