There’s no such thing as “faithful CoT” because even if what you are doing is locally interpretable in terms of CoT, the way it adds up over the world still depends on the exogenous disequilibria that you are acting on as well as the ways your divertions of these disequilibria support themselves. At best you can get interpretability for infinitesimal perturbations, but why care about infinitesimal perturbations?
CoT is generally done for one level of data processing, e.g. responding to a user in a chat, or performing one step of a task for an autonomous LLM agent. However usually when AIs are deployed, they are asked to do many levels of data processing, e.g. respond to many users or perform many steps sequentially. It doesn’t matter if the chains of thoughts are highly faithful for pretty much all of these levels individually, what matters is if they are faithful in aggregate, i.e. for the few most important levels of data processing as well as in how the overall wave of smaller interactions add up.
I’d like to see how this would compare to a human organization. Suppose individual workers or individual worker-interactions are all highly faithful in a tech company. Naturally, though, the entire tech company will begin exhibiting misalignment, tend towards blind profit seeking, etc. Despite the faithfulness of its individual parts.
Is that the kind of situation you’re thinking of here? Is that why having mind-reading equipment that forced all the workers to dump their inner monologue wouldn’t actually be of much use towards aligning the overall system, because the real problem is something like the aggregate or “emergent” behavior of the system, rather than the faithfulness of the individual parts?
My threat model is entirely different: Even if human organizations are misaligned today, the human organizations rely primarily on human work, and so they pass tons of resources and discretion on to humans, and try to ensure that the humans are vigorous.
Meanwhile, under @mattmacdermott ’s alignment proposal, one would get rid of the humans and pass tons of resources and discretion on to LLMs. Whether one values the LLMs is up to you, but if human judgement and resource ownership is removed, obviously that means humans lose control, unless one can change organizations to give control to non-employees instead of employees (and it’s questionable how meaningful that is).
There’s no such thing as “faithful CoT” because even if what you are doing is locally interpretable in terms of CoT, the way it adds up over the world still depends on the exogenous disequilibria that you are acting on as well as the ways your divertions of these disequilibria support themselves. At best you can get interpretability for infinitesimal perturbations, but why care about infinitesimal perturbations?
What do you mean by “over the world”? Are you including human coordination problems in this?
CoT is generally done for one level of data processing, e.g. responding to a user in a chat, or performing one step of a task for an autonomous LLM agent. However usually when AIs are deployed, they are asked to do many levels of data processing, e.g. respond to many users or perform many steps sequentially. It doesn’t matter if the chains of thoughts are highly faithful for pretty much all of these levels individually, what matters is if they are faithful in aggregate, i.e. for the few most important levels of data processing as well as in how the overall wave of smaller interactions add up.
I’d like to see how this would compare to a human organization. Suppose individual workers or individual worker-interactions are all highly faithful in a tech company. Naturally, though, the entire tech company will begin exhibiting misalignment, tend towards blind profit seeking, etc. Despite the faithfulness of its individual parts.
Is that the kind of situation you’re thinking of here? Is that why having mind-reading equipment that forced all the workers to dump their inner monologue wouldn’t actually be of much use towards aligning the overall system, because the real problem is something like the aggregate or “emergent” behavior of the system, rather than the faithfulness of the individual parts?
My threat model is entirely different: Even if human organizations are misaligned today, the human organizations rely primarily on human work, and so they pass tons of resources and discretion on to humans, and try to ensure that the humans are vigorous.
Meanwhile, under @mattmacdermott ’s alignment proposal, one would get rid of the humans and pass tons of resources and discretion on to LLMs. Whether one values the LLMs is up to you, but if human judgement and resource ownership is removed, obviously that means humans lose control, unless one can change organizations to give control to non-employees instead of employees (and it’s questionable how meaningful that is).