However, this would not address the underlying pattern of alignment failing to generalize.
Is there proof that this is an overall pattern? It would make sense that models are willing to do things they’re not willing to talk about, but that doesn’t mean there’s a general pattern that e.g. they wouldn’t be willing to talk about things, and wouldn’t be willing to do them, but WOULD be willing to some secret third option.
I would say the three papers show a clear pattern that alignment didn’t generalize well from chat setting to agent setting, solid evidence for that thesis. That is evidence for a stronger claim of an underlying pattern, ie that alignment will in general not generalize as well as capabilites. For conceptual evidence of that claim you can look at the linked post. my attempt to summarize the argument, capabilites are a kind of attractor state, being smarter and more capable is an objective thing about the universe in a way. however, being more aligned with humans is not a special thing about the universe but a free parameter. In fact, alignment stands in some conflict with capabilites, as instrumental incentives undermine alignment.
For what a third option would be, ie the next step were alignment might not generalize
From the article
While it’s likely that future models will be trained to refuse agentic requests that cause harm, there are likely going to be scenarios in the future that developers at OpenAI / Anthropic / Google failed to anticipate. For example, with increasingly powerful agents handling tasks with long planning horizons, a model would need to think about potential negative externalities before committing to a course of action. This goes beyond simply refusing an obviously harmful request.
from a different comment of mine:
I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don’t do unethical things while let’s say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?
Is there proof that this is an overall pattern? It would make sense that models are willing to do things they’re not willing to talk about, but that doesn’t mean there’s a general pattern that e.g. they wouldn’t be willing to talk about things, and wouldn’t be willing to do them, but WOULD be willing to some secret third option.
I would say the three papers show a clear pattern that alignment didn’t generalize well from chat setting to agent setting, solid evidence for that thesis. That is evidence for a stronger claim of an underlying pattern, ie that alignment will in general not generalize as well as capabilites. For conceptual evidence of that claim you can look at the linked post. my attempt to summarize the argument, capabilites are a kind of attractor state, being smarter and more capable is an objective thing about the universe in a way. however, being more aligned with humans is not a special thing about the universe but a free parameter. In fact, alignment stands in some conflict with capabilites, as instrumental incentives undermine alignment.
For what a third option would be, ie the next step were alignment might not generalize
From the article
from a different comment of mine: