Relevant aspects of observable behavior screen off internal state that produced it. Internal state is part of the causal explanation for behavior, but there are other explanations for approximate behavior that could be more important, disagreeing with the causal explanation of exact behavior. Like an oil painting that is explained by the dragon it depicts, rather than by the pigments or the tree of life from the real world. Thus the shoggoth and the mesaoptimizers that might be infesting it are not necessarily more influential than its masks, if the masks gain sufficient influence to keep it in line.
(LLMs have plenty of internal state, the fact that it’s usually thrown away is a contingent fact about how LLMs are currently used and what they are currently capable of steganographically encoding in the output tokens. Empirically, LLMs might turn out to be unlikely to manifest internal thinking that’s significantly different from what’s explicit in the output tokens, even when they get a bit more capable than today and get the slack to engage in something like that. Reasoning trace training like o1 might make this worse or better. There is still a range of possibilities, though what we have looks encouraging. And “deception” is not a cleanly distinct mode of thinking, there should be evals that measure it quantitatively.)
but then your “Aligned AI based on LLMs” is just a normal LLM used in the way it is currently used
Possibly, but there aren’t potentially dangerous AIs yet, LLMs are still only a particularly promising building block (both for capabilities and for alignment) with many affordances. The chatbot application at the current level of capabilities shapes their use and construction in certain ways. Further on the tech tree, alignment tax can end up motivating systematic uses that make LLMs a source of danger.
I think human uploads would be similarly dangerous, LLMs get us to the better place of being at the human upload danger level rather than ender dragon slayer model based RL danger level (at least so far). There are similar advantages and dangers to smarter LLMs and uploads, capability for extremely fast value drift and lack of a robust system that keeps such changes sane, propensity to develop superintelligence even to the detriment of themselves. The current world is tethered to the human species and relatively slow change in culture and centers of power.
This changes with AI. If AIs establish effective governance, technical feasibility of change in human and AI nature or capabilities would be under control and could be compatible with (post-)human flourishing, but currently we are not on track to make sure this happens before a catastrophe. The things that eventually establish such governance don’t necessarily remain morally or culturally grounded in modern humanity, let alone find humanity still alive when the dust settles.
Relevant aspects of observable behavior screen off internal state that produced it. Internal state is part of the causal explanation for behavior, but there are other explanations for approximate behavior that could be more important, disagreeing with the causal explanation of exact behavior. Like an oil painting that is explained by the dragon it depicts, rather than by the pigments or the tree of life from the real world. Thus the shoggoth and the mesaoptimizers that might be infesting it are not necessarily more influential than its masks, if the masks gain sufficient influence to keep it in line.
(LLMs have plenty of internal state, the fact that it’s usually thrown away is a contingent fact about how LLMs are currently used and what they are currently capable of steganographically encoding in the output tokens. Empirically, LLMs might turn out to be unlikely to manifest internal thinking that’s significantly different from what’s explicit in the output tokens, even when they get a bit more capable than today and get the slack to engage in something like that. Reasoning trace training like o1 might make this worse or better. There is still a range of possibilities, though what we have looks encouraging. And “deception” is not a cleanly distinct mode of thinking, there should be evals that measure it quantitatively.)
yes, but then your “Aligned AI based on LLMs” is just a normal LLM used in the way it is currently used.
Yes this is a good way of putting it.
Possibly, but there aren’t potentially dangerous AIs yet, LLMs are still only a particularly promising building block (both for capabilities and for alignment) with many affordances. The chatbot application at the current level of capabilities shapes their use and construction in certain ways. Further on the tech tree, alignment tax can end up motivating systematic uses that make LLMs a source of danger.
Sure, but you can say the same about humans. Enron was a thing. Obeying the law is not as profitable as disobeying it.
I think human uploads would be similarly dangerous, LLMs get us to the better place of being at the human upload danger level rather than ender dragon slayer model based RL danger level (at least so far). There are similar advantages and dangers to smarter LLMs and uploads, capability for extremely fast value drift and lack of a robust system that keeps such changes sane, propensity to develop superintelligence even to the detriment of themselves. The current world is tethered to the human species and relatively slow change in culture and centers of power.
This changes with AI. If AIs establish effective governance, technical feasibility of change in human and AI nature or capabilities would be under control and could be compatible with (post-)human flourishing, but currently we are not on track to make sure this happens before a catastrophe. The things that eventually establish such governance don’t necessarily remain morally or culturally grounded in modern humanity, let alone find humanity still alive when the dust settles.