CoT is generally done for one level of data processing, e.g. responding to a user in a chat, or performing one step of a task for an autonomous LLM agent. However usually when AIs are deployed, they are asked to do many levels of data processing, e.g. respond to many users or perform many steps sequentially. It doesn’t matter if the chains of thoughts are highly faithful for pretty much all of these levels individually, what matters is if they are faithful in aggregate, i.e. for the few most important levels of data processing as well as in how the overall wave of smaller interactions add up.
I’d like to see how this would compare to a human organization. Suppose individual workers or individual worker-interactions are all highly faithful in a tech company. Naturally, though, the entire tech company will begin exhibiting misalignment, tend towards blind profit seeking, etc. Despite the faithfulness of its individual parts.
Is that the kind of situation you’re thinking of here? Is that why having mind-reading equipment that forced all the workers to dump their inner monologue wouldn’t actually be of much use towards aligning the overall system, because the real problem is something like the aggregate or “emergent” behavior of the system, rather than the faithfulness of the individual parts?
My threat model is entirely different: Even if human organizations are misaligned today, the human organizations rely primarily on human work, and so they pass tons of resources and discretion on to humans, and try to ensure that the humans are vigorous.
Meanwhile, under @mattmacdermott ’s alignment proposal, one would get rid of the humans and pass tons of resources and discretion on to LLMs. Whether one values the LLMs is up to you, but if human judgement and resource ownership is removed, obviously that means humans lose control, unless one can change organizations to give control to non-employees instead of employees (and it’s questionable how meaningful that is).
CoT is generally done for one level of data processing, e.g. responding to a user in a chat, or performing one step of a task for an autonomous LLM agent. However usually when AIs are deployed, they are asked to do many levels of data processing, e.g. respond to many users or perform many steps sequentially. It doesn’t matter if the chains of thoughts are highly faithful for pretty much all of these levels individually, what matters is if they are faithful in aggregate, i.e. for the few most important levels of data processing as well as in how the overall wave of smaller interactions add up.
I’d like to see how this would compare to a human organization. Suppose individual workers or individual worker-interactions are all highly faithful in a tech company. Naturally, though, the entire tech company will begin exhibiting misalignment, tend towards blind profit seeking, etc. Despite the faithfulness of its individual parts.
Is that the kind of situation you’re thinking of here? Is that why having mind-reading equipment that forced all the workers to dump their inner monologue wouldn’t actually be of much use towards aligning the overall system, because the real problem is something like the aggregate or “emergent” behavior of the system, rather than the faithfulness of the individual parts?
My threat model is entirely different: Even if human organizations are misaligned today, the human organizations rely primarily on human work, and so they pass tons of resources and discretion on to humans, and try to ensure that the humans are vigorous.
Meanwhile, under @mattmacdermott ’s alignment proposal, one would get rid of the humans and pass tons of resources and discretion on to LLMs. Whether one values the LLMs is up to you, but if human judgement and resource ownership is removed, obviously that means humans lose control, unless one can change organizations to give control to non-employees instead of employees (and it’s questionable how meaningful that is).