(Note: I jotted down these thoughts from my phone and while on a plane as I finished reading LeCun’s paper. They are rough and underdeveloped, but still are points I find interesting and that I think might spark some good discussion.)
A modular architecture like this would have interpretability benefits and alignment implications. The separate “hard-wired” cost module is very significant. If this were successfully built it could effectively leapfrog us into best-case interpretability scenario 2.
How will these costs be specified? That looks like a big open question from this paper. It seems like the world model would need to be trained first and interpretable so that cost module can make use of its abstractions.
Is it possible to mix hardwired components with trained models? “Risks from Learned Optimization” mentioned hardcoded optimizers as a possibility for preventing emergence of mesa optimizers. “Tool using AI” research may be relevant too, where here interestingly the cost module would be the tool.
The ultimate goal of the agent is minimize the intrinsic cost over the long run.
The agent is an optimizer by design!
If a solution is developed for specifying the cost module, then the agent may be inner aligned.
But many naive ways of specifying the cost module (e.g. make human pain a cost) seem to lead straight to catastrophe via outer alignment failures and instrumental convergence for a sufficiently advanced system.
Could this architecture be leveraged to implement a cost module that’s more likely to be outer aligned like imitative HCH or some other myopic objective?
Are the trainable modules (critic, world model, actor) subject to mesa-optimization risk?
(Note: I jotted down these thoughts from my phone and while on a plane as I finished reading LeCun’s paper. They are rough and underdeveloped, but still are points I find interesting and that I think might spark some good discussion.)
A modular architecture like this would have interpretability benefits and alignment implications. The separate “hard-wired” cost module is very significant. If this were successfully built it could effectively leapfrog us into best-case interpretability scenario 2.
How will these costs be specified? That looks like a big open question from this paper. It seems like the world model would need to be trained first and interpretable so that cost module can make use of its abstractions.
Is it possible to mix hardwired components with trained models? “Risks from Learned Optimization” mentioned hardcoded optimizers as a possibility for preventing emergence of mesa optimizers. “Tool using AI” research may be relevant too, where here interestingly the cost module would be the tool.
The agent is an optimizer by design!
If a solution is developed for specifying the cost module, then the agent may be inner aligned.
But many naive ways of specifying the cost module (e.g. make human pain a cost) seem to lead straight to catastrophe via outer alignment failures and instrumental convergence for a sufficiently advanced system.
Could this architecture be leveraged to implement a cost module that’s more likely to be outer aligned like imitative HCH or some other myopic objective?
Are the trainable modules (critic, world model, actor) subject to mesa-optimization risk?