There are several modes by which that could fail. For example, if the beings have simply mastered a classifier indistinguishable from a typical population member in polynomial time under an adaptive interactive proof protocol (similar to the so-called “Turing Test”), while actually implementing a (source-code-uninspectable) program hostile to that value system.
Or maybe when they’ve been demonstrated to have assimilated the values of the rest of the population.
No way THAT could go wrong...
There are several modes by which that could fail. For example, if the beings have simply mastered a classifier indistinguishable from a typical population member in polynomial time under an adaptive interactive proof protocol (similar to the so-called “Turing Test”), while actually implementing a (source-code-uninspectable) program hostile to that value system.