I give 5% probability that within next year we will become aware of case of deliberate harm from model to human enabled by hidden CoT.
By “deliberate harm enabled by hidden CoT” I mean that hidden CoT will contain reasoning like “if I give human this advise, it will harm them, but I should do it because <some deranged RLHF directive>” and if user had seen it harm would be prevented.
I give this low probability to observable event: my probability that something like that will happen at all is 30%, but I expect that victim won’t be aware, that hidden CoT will be lost in archives, AI companies won’t investigate in search of such cases too hard and it they find something it won’t become public, etc.
Also, I decreased probability from 8% to 5% because model can cause harm via steganographic CoT which doesn’t fall under my definition.
I give 5% probability that within next year we will become aware of case of deliberate harm from model to human enabled by hidden CoT.
By “deliberate harm enabled by hidden CoT” I mean that hidden CoT will contain reasoning like “if I give human this advise, it will harm them, but I should do it because <some deranged RLHF directive>” and if user had seen it harm would be prevented.
I give this low probability to observable event: my probability that something like that will happen at all is 30%, but I expect that victim won’t be aware, that hidden CoT will be lost in archives, AI companies won’t investigate in search of such cases too hard and it they find something it won’t become public, etc.
Also, I decreased probability from 8% to 5% because model can cause harm via steganographic CoT which doesn’t fall under my definition.