That seems rather loaded in the other direction. How about “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”?
Deceive kinda seems like the wrong term. Like when the AI is saying “I hate you” it isn’t exactly deceiving us. We could replace “deceive” with “behave badly” yielding: “The evidence suggests that if current ML systems were going to behave badly in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”.
I agree that using terms like “lying in wait”, “treacherous plans”, or “treachery” are a loaded (though it technically means almost the same thing). So I probably shouldn’t have said this is a bit differently.
I think the version of your statement with deceive replaced seems most accurate to me.
That seems rather loaded in the other direction. How about “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”?
Deceive kinda seems like the wrong term. Like when the AI is saying “I hate you” it isn’t exactly deceiving us. We could replace “deceive” with “behave badly” yielding: “The evidence suggests that if current ML systems were going to behave badly in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”.
I agree that using terms like “lying in wait”, “treacherous plans”, or “treachery” are a loaded (though it technically means almost the same thing). So I probably shouldn’t have said this is a bit differently.
I think the version of your statement with deceive replaced seems most accurate to me.