This is not a good test. LLMs do not actually have models or goals. It’s not making a model of you and measuring outcomes, it’s just completing the string. If the input string would most commonly be followed by ‘Shia Labeouf’ based on the training data, then that’s what it will output. If you’re ascribing goals or models to an LLM you are non serious. The question right now is not about misalignment, because LLMs don’t have an alignment. You can say that makes them inherently ‘unaligned,’ in the sense that an LLM could hypothetically kill someone, but that’s just the output of a data set and architecture.
This is the equivalent of saying that macbooks are dangerously misaligned because you could physically beat someone’s brains out with one.
I will say baselessly that telling ChatGPT not to say something raises the probability of it actually saying that thing by a significant amount, just by virtue of the text appearing previously in the context window.
Do you think OpenAI is ever going to change GPT models so they can’t represent or pretend to be agents? Is this a big priority in alignment? Is any model that can represent an agent accurately misaligned?
I swear- anything said in support of the proposition ‘AIs are dangerous’ is supported on this site. Actual cult behavior.
This is not a good test. LLMs do not actually have models or goals. It’s not making a model of you and measuring outcomes, it’s just completing the string. If the input string would most commonly be followed by ‘Shia Labeouf’ based on the training data, then that’s what it will output. If you’re ascribing goals or models to an LLM you are non serious. The question right now is not about misalignment, because LLMs don’t have an alignment. You can say that makes them inherently ‘unaligned,’ in the sense that an LLM could hypothetically kill someone, but that’s just the output of a data set and architecture.
It is misalignment to the degree with which the bot is modelling agentic behavior. That sub-agent is misaligned, even if the bot “as a whole” isn’t.
This is the equivalent of saying that macbooks are dangerously misaligned because you could physically beat someone’s brains out with one.
I will say baselessly that telling ChatGPT not to say something raises the probability of it actually saying that thing by a significant amount, just by virtue of the text appearing previously in the context window.
Do you think OpenAI is ever going to change GPT models so they can’t represent or pretend to be agents? Is this a big priority in alignment? Is any model that can represent an agent accurately misaligned?
I swear- anything said in support of the proposition ‘AIs are dangerous’ is supported on this site. Actual cult behavior.