We can get rid of all of this deception by getting rid of agency. That should be possible with methods based on Gradient Routing or Self-Other Overlap Finetuning or variants thereof. For example, you could use gradient routing to route agency and identity to one part that gets later ablated.
The problem is that we want the model to have agency. We want actors that can solve complex tasks, do things on our behalf, and pursue our goals for us.
I see two potential ways out:
Extensions of current self-other overlap finetuning that ensure that the agent indeed takes our goals as its own. It would be deceptive only if we would.
Finding ways to make good use of LLMs that are not agentic. Oracle or tool AIs could still be useful.
We can get rid of all of this deception by getting rid of agency. That should be possible with methods based on Gradient Routing or Self-Other Overlap Finetuning or variants thereof. For example, you could use gradient routing to route agency and identity to one part that gets later ablated.
The problem is that we want the model to have agency. We want actors that can solve complex tasks, do things on our behalf, and pursue our goals for us.
I see two potential ways out:
Extensions of current self-other overlap finetuning that ensure that the agent indeed takes our goals as its own. It would be deceptive only if we would.
Finding ways to make good use of LLMs that are not agentic. Oracle or tool AIs could still be useful.