Shiroe comments on the case for CoT unfaithfulness is overstated

Shiroe 2 Oct 2024 16:51 UTC
1 point
0
I’d like to see how this would compare to a human organization. Suppose individual workers or individual worker-interactions are all highly faithful in a tech company. Naturally, though, the entire tech company will begin exhibiting misalignment, tend towards blind profit seeking, etc. Despite the faithfulness of its individual parts.

Is that the kind of situation you’re thinking of here? Is that why having mind-reading equipment that forced all the workers to dump their inner monologue wouldn’t actually be of much use towards aligning the overall system, because the real problem is something like the aggregate or “emergent” behavior of the system, rather than the faithfulness of the individual parts?
- tailcalled 2 Oct 2024 17:12 UTC
  2 points
  0
  Parent
  My threat model is entirely different: Even if human organizations are misaligned today, the human organizations rely primarily on human work, and so they pass tons of resources and discretion on to humans, and try to ensure that the humans are vigorous.
  Meanwhile, under @mattmacdermott ’s alignment proposal, one would get rid of the humans and pass tons of resources and discretion on to LLMs. Whether one values the LLMs is up to you, but if human judgement and resource ownership is removed, obviously that means humans lose control, unless one can change organizations to give control to non-employees instead of employees (and it’s questionable how meaningful that is).