Wei Dai comments on Rant on Problem Factorization for Alignment

Wei Dai 13 Aug 2022 22:24 UTC
LW: 7 AF: 5
0
AF

If a coordination point is sticking, reducing it to a financial trade helps speed it up, by turning the hidden information into a willingness-to-pay / willingness-to-be-paid integer.

I don’t disagree with this. I would add that if agents aren’t aligned, then that introduces an additional inefficiency into this pricing process, because each agent now has an incentive to distort the price to benefit themselves, and this (together with information asymmetry) means some mutually profitable trades will not occur.

Figuring out the costs of an action in someone else’s world is detailed and costly work, and price mechanisms + incentives can communicate this information far more efficiently, and in these two situations having trust-in-honesty (and very aligned goals) does not change this fact.

Some work being “detailed and costly” isn’t necessarily a big problem for HCH, since we theoretically have an infinite tree of free labor, whereas the inefficiencies introduced by agents having different values/interests seem potentially of a different character. I’m not super confident about this (and I’m overall pretty skeptical about HCH for this and other reasons), but just think that John was too confident in his position in the OP or at least hasn’t explained his position enough. To restate the question I see being unanswered: why is alignment + infinite free labor still not enough to overcome the problems we see with actual human orgs?
- Ben Pace 13 Aug 2022 23:35 UTC
  LW: 4 AF: 3
  0
  AF Parent
  (I have added the point I wanted to add to this conversation, and will tap out now.)
- Vladimir_Nesov 14 Aug 2022 16:55 UTC
  LW: 2 AF: 1
  0
  AF Parent
  
  Some work being “detailed and costly” isn’t necessarily a big problem for HCH, since we theoretically have an infinite tree of free labor
  
  Huh, my first thought was that the depth of the tree is measured in training epochs, while width is cheaper, since HCH is just one model and going much deeper amounts to running more training epochs. But how deep we effectively go depends on how robust the model is to particular prompts that occur on that path in the tree, and there could be a way to decide whether to run a request explicitly, unwinding another level of the subtree as multiple instances of the model (deliberation/reflection), or to answer it immediately, with a single instance, relying on what’s already in the model (intuition/babble). This way, the effective depth of the tree at the level of performance around the current epoch could extend more, so the effect of learning effort on performance would increase.
  
  This decision mirrors what happens at the goodhart boundary pretty well (there, you don’t allow incomprehensible/misleading prompts that are outside the boundary), but the decision here will be further from the boundary (very familiar prompts can be answered immediately, while less familiar but still comprehensible prompts motivate unwinding the subtree by another level, implicitly creating more training data to improve robustness on those prompts).
  
  The intuitive answers that don’t require deliberation are close to the center of the concept of aligned behavior, while incomprehensible situations in the crash space are where the concept (in current understanding) fails to apply. So it’s another reason to associate robustness with the goodhart boundary, to treat it as a robustness threshold, as this gives centrally aligned behavior as occuring for situations where the model has robustness above another threshold.
  What links here?
  - Vladimir_Nesov's comment on I missed the crux of the alignment problem the whole time by zeshen (14 Aug 2022 21:59 UTC; 2 points)