My concern is, interpretability may be dangerous, or lead to a higher P(doom), in a different way.
The problem is, if we have a better way of steering LLMs towards a certain set of value systems, how can we guarantee that the “value system” is right? For example, steering LLMs towards a certain value system can be easily abused to massively generate fake news that are more ideologically consistent and aligned. Steering can make LLMs omit information that offers a neutral point of view. This seems to be a different form of “doom” comparing with AI taking full control.
Kinda, my current mainline-doom-case is “some AI gets controlled --> powerful people use it to prop themselves up --> world gets worse until AI gets uncontrollably bad --> doom”. I would call it a different yet also-important doom case of “perpetual low-grade-AI dictatorship where the AI is controlled by humans in a surveillance state”.
My concern is, interpretability may be dangerous, or lead to a higher P(doom), in a different way.
The problem is, if we have a better way of steering LLMs towards a certain set of value systems, how can we guarantee that the “value system” is right? For example, steering LLMs towards a certain value system can be easily abused to massively generate fake news that are more ideologically consistent and aligned. Steering can make LLMs omit information that offers a neutral point of view. This seems to be a different form of “doom” comparing with AI taking full control.
Kinda, my current mainline-doom-case is “some AI gets controlled --> powerful people use it to prop themselves up --> world gets worse until AI gets uncontrollably bad --> doom”. I would call it a different yet also-important doom case of “perpetual low-grade-AI dictatorship where the AI is controlled by humans in a surveillance state”.