evhub comments on evhub’s Shortform

evhub 11 Jan 2023 20:43 UTC
LW: 2 AF: 2
AF
If you want to produce warning shots for deceptive alignment, you’re faced with a basic sequencing question. If the model is capable of reasoning about its training process before it’s capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won’t be detectable.