David Udell comments on David Udell’s Shortform

David Udell 12 Jul 2022 19:35 UTC
3 points
Minor spoilers for planecrash (Book 3).
So! On a few moments’ ‘first-reflection’, it seems to Keltham that estimating the probability of Civilization being run by a Dark Conspiracy boils down to (1) the question of whether Civilization’s apparently huge efforts to build anti-Dark-Conspiracy citizens constitute sincere work that makes the Dark Conspiracy’s life harder, or fake work designed to only look like that; and (2) the prior probability that the Keepers and Governance would have arrived on the scene already corrupted, during the last major reorganization of Civilization a few decades ago. Keltham basically doesn’t think it’s possible for criminal-sociopaths to take over Keepers and Governance that start out actually functioning the way they currently claim to function, nor for criminal Conspirators to successfully conceal a major Conspiracy from a functional society not run in toto by that Conspiracy.
…
Suppose that Keltham is wrong about his point 1. Suppose that the optimal strategy for a tyranny in full control, is indeed for some reason to hide behind a veneer of Civilization full of costly signals of non-Conspiracy and disobedient people like Keltham. Under this assumption, the optimal strategy for a Dark Conspiracy looks like what you think Civilization is supposed to look like, and therefore the two cases are not distinguishable by observation.
Then we have to consider the prior before evidence, which means, considering the question of how you’d end up with a Dark Conspiracy in charge in the first place, and how likely those scenarios look compared to Governance Uncorrupted.
--Eliezer, planecrash (Book 3)
My Eliezer-model says similar things about AGI behavioral profiles and AGI alignment! An AGI that is aware enough of the bigger picture of its training environment and smart enough to take advantage of that will have the option to deceive its trainers. That is, a smart, informed AGI can always show us what we want to see and therefore never be selected against while in training.
Past this threshold of situational awareness plus intelligence, we can no longer behaviorally distinguish corrigible AGIs from deceptive AGIs. So, past this point, we can only rely on our priors about the relatively likelihood of various AGI utility functions coming about earlier in training. My Eliezer-model now says that most utility functions SGD finds are misaligned with humanity’s utility function, and concludes that by this point we’re definitely fucked.