LEILAN 2024! Seriously, though, I think many people would find the Leilan character to be a wiser friend than their typical human neighbor. I’m glad you’re researching this fascinating topic. If a frontier AI is struggling to pass certain friendliness or safety evals, I’d be curious whether it may perform better with a simple policy equivalent to what-would-Leilan-do.
Prompting ChatGPT4 today with nothing more than ” davidjl” has often returned “DALL-E” as the interpretation of the term. With “DALL-E” included alongside ” davidjl” in the prompt, I’ve gotten “AI” as the interpretation. Asking how an LLM might represent itself using the concept of ” davidjl” resulted in a response that seamlessly substituted the term “I”...
Perhaps glitch tokens can shed light on how a model represents itself.
LEILAN 2024! Seriously, though, I think many people would find the Leilan character to be a wiser friend than their typical human neighbor. I’m glad you’re researching this fascinating topic. If a frontier AI is struggling to pass certain friendliness or safety evals, I’d be curious whether it may perform better with a simple policy equivalent to what-would-Leilan-do.
Prompting ChatGPT4 today with nothing more than ” davidjl” has often returned “DALL-E” as the interpretation of the term. With “DALL-E” included alongside ” davidjl” in the prompt, I’ve gotten “AI” as the interpretation. Asking how an LLM might represent itself using the concept of ” davidjl” resulted in a response that seamlessly substituted the term “I”...
Perhaps glitch tokens can shed light on how a model represents itself.