‘theoretical paper of early 2023 that I can’t find right now’ → perhaps you’re thinking of Fundamental Limitations of Alignment in Large Language Models? I’d also recommend LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?.
‘theoretical paper of early 2023 that I can’t find right now’ → perhaps you’re thinking of Fundamental Limitations of Alignment in Large Language Models? I’d also recommend LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?.