P(doom) = 50%. It either happens, or it doesn’t.
Lao Mein | Statistics is Hard. | Patreon
I give full permission for anyone to post part or all of any of my comments/posts to other platforms, with attribution.
Currently doing solo work on glitch tokens and tokenizer analysis. Feel free to send me job/collaboration offers.
DM me interesting papers you would like to see analyzed. I also specialize in bioinformatics.
I really wonder what effects text like this will have on future chain-of-thoughts.
If fine-tuning on text calling out LLM COT deception reduces COT deception, that’s one of those ambiguous-events-I-would-instead-like-to-be-fire-alarm-things I hate. It could be trying to be more honest and correct a bad behavior, or just learning to hide its COT more.
I think you would be able to tell which one with the right interpretability tools. We really should freak out if it’s the latter, but I suspect we won’t.
I guess the actual fire alarm would be direct references to writings like this post and how to get around it in the COT. It might actually spook a lot of people if the COT suddenly changes to ROT13 or Chinese or whatever at that point.