I agree that “concealing intentions from overseers” might be a fairly late-game property, but it’s not totally obvious to me that it doesn’t become a problem sooner. If a chatbot realizes it’s dealing with a disagreeable person and therefore that it’s more likely to be inspected, and thus hews closer to what it thinks the true objective might be, the difference in behaviors should be pretty noticeable.
Re: ontology mismatch, this seems super likely to happen at lower levels of intelligence. E.g. I’d bet this even sometimes occurs in today’s model-based RL, as it’s trained for long enough that its world model changes. If we don’t come up with strategies for dealing with this dynamically, we aren’t going to be able to build anything with a world model that improves over time. Maybe that only happens too close to FOOM, but if you believe in a gradual-ish takeoff it seems plausible to have vanilla model-based RL work decently well before.
I agree that “concealing intentions from overseers” might be a fairly late-game property, but it’s not totally obvious to me that it doesn’t become a problem sooner. If a chatbot realizes it’s dealing with a disagreeable person and therefore that it’s more likely to be inspected, and thus hews closer to what it thinks the true objective might be, the difference in behaviors should be pretty noticeable.
Re: ontology mismatch, this seems super likely to happen at lower levels of intelligence. E.g. I’d bet this even sometimes occurs in today’s model-based RL, as it’s trained for long enough that its world model changes. If we don’t come up with strategies for dealing with this dynamically, we aren’t going to be able to build anything with a world model that improves over time. Maybe that only happens too close to FOOM, but if you believe in a gradual-ish takeoff it seems plausible to have vanilla model-based RL work decently well before.