davidad comments on A list of core AI safety problems and how I hope to solve them

davidad 26 Aug 2023 23:06 UTC
6 points
0
A system with a shutdown timer, in my sense, has no terms in its reward function which depend on what happens after the timer expires. (This is discussed in more detail in my previous post.) So there is no reason to persuade humans or do anything else to circumvent the timer, unless there is an inner alignment failure (maybe that’s what you mean by “deception instance”). Indeed, it is the formal verification that prevents inner alignment failures.