To be clear, “it turns out to be trivial to make the AI not want to escape” is a big part of my model of how this might work. The basic thinking is that for a human-level-ish system, consistently reinforcing (via gradient descent) intended behavior might be good enough, because alternative generalizations like “Behave as intended unless there are opportunities to get lots of resources, undetected or unchallenged” might not have many or any “use cases.”
A number of other measures, including AI checks and balances, also seem like they might work pretty easily for human-level-ish systems, which could have a lot of trouble doing things like coordinating reliably with each other.
So the idea isn’t that human-level-ish capabilities are inherently safe, but that straightforward attempts at catching/checking/blocking/disincentivizing unintended behavior could be quite effective for such systems (while such things might be less effective on systems that are extraordinarily capable relative to supervisors).
I’m curious why you are “not worried in any near future about AI ‘escaping.‘” It seems very hard to be confident in even pretty imminent AI systems’ lack of capability to do a particular thing, at this juncture.