If LLMs end up being useful, how do they get around these theorems? Can we get some result where if RLHF has a capabilities component and a power-averseness component, the capabilities component can cause the agent to be power-seeking on net?
Intuitively, eliciting that kind of failure seems like it would be pretty easy, but it doesn’t seem to be a blocker for the usefulness of the generalized form of LLMs. My mental model goes something like:
Foundational goal agnosticism evades optimizer-induced automatic doom, and
Models implementing a strong approximation of Bayesian inference are, not surprisingly, really good at extracting and applying conditions, so
They open the door to incrementally building a system that holds the entirety of a safe wish.
Things like “caring about means,” or otherwise incorporating the vast implicit complexity of human intent and values, can arise in this path, while I’m not sure the same can be said for any implementation that tries to get around the need for that complexity.
It seems like the paths which try to avoid importing the full complexity while sticking to crisp formulations will necessarily be constrained in their applicability. In other words, any simple expression of values subject to optimization is only safe within a bounded region. I bet there are cases where you could define those bounded regions and deploy the simpler version safely, but I also bet the restriction will make the system mostly useless.
Biting the bullet and incorporating more of the necessary complexity expands the bounded region. LLMs, and their more general counterparts, have the nice property that turning the screws of optimization on the foundation model actually makes this safe region larger. Making use of this safe region correctly, however, is still not guaranteed😊
Intuitively, eliciting that kind of failure seems like it would be pretty easy, but it doesn’t seem to be a blocker for the usefulness of the generalized form of LLMs. My mental model goes something like:
Foundational goal agnosticism evades optimizer-induced automatic doom, and
Models implementing a strong approximation of Bayesian inference are, not surprisingly, really good at extracting and applying conditions, so
They open the door to incrementally building a system that holds the entirety of a safe wish.
Things like “caring about means,” or otherwise incorporating the vast implicit complexity of human intent and values, can arise in this path, while I’m not sure the same can be said for any implementation that tries to get around the need for that complexity.
It seems like the paths which try to avoid importing the full complexity while sticking to crisp formulations will necessarily be constrained in their applicability. In other words, any simple expression of values subject to optimization is only safe within a bounded region. I bet there are cases where you could define those bounded regions and deploy the simpler version safely, but I also bet the restriction will make the system mostly useless.
Biting the bullet and incorporating more of the necessary complexity expands the bounded region. LLMs, and their more general counterparts, have the nice property that turning the screws of optimization on the foundation model actually makes this safe region larger. Making use of this safe region correctly, however, is still not guaranteed😊