jacob_cannell comments on Distillation of “How Likely Is Deceptive Alignment?”

jacob_cannell 19 Nov 2022 0:48 UTC
3 points
0

Thus, in order to avoid deceptive alignment, we need to modify the training regime in such a way that it somehow actively avoids deceptively aligned models.

I wasn’t thinking in these terms yet, but I reached a similar conclusion a while back and my mainline approach of training in simboxes is largely concerned without how to design test environments where you can train and iterate without the agents knowing they are in a sim training/test environment.

I also somewhat disagree with the core argument in that it proves too much about humans. Humans are approximately aligned, and we only need to match that level of alignment.

The difference in my approach in that article is that I reject the notion that we should aim first for directly aligning the agents with their creators (the equivalent of aligning humans with god). Instead we should focus first on aligning agents more generally with other agents, the equivalent of how humans are aligned to other humans (reverse engineering human brain alignment mechanisms). That leads to mastering alignment techniques in general, which we can then apply to alignment with humanity.
- Shmi 19 Nov 2022 4:56 UTC
  3 points
  0
  Parent
  I also somewhat disagree with the core argument in that it proves too much about humans. Humans are approximately aligned, and we only need to match that level of alignment.
  Hmm, humans do appear approximately aligned as long as they don’t have definitive advantages. “Power corrupts” and all that. If you take an average “aligned” human and give them unlimited power and no checks and balances, the usual trope happens in real life.
  - jacob_cannell 19 Nov 2022 19:16 UTC
    2 points
    0
    Parent
    Yeah, the typical human is only partially aligned with the rest of humanity and only in a highly non uniform way, so you get the typical distribution of historical results when giving supreme power to a single human—with outcomes highly contingent on the specific human.
    
    So if AGI is only as aligned as typical humans, we’ll also probably need a heterogeneous AGI population and robust decentralized control structures to get a good multipolar outcome. But it also seems likely that any path leading to virtual brain-like AGI will also allow for selecting for altruism/alignment well outside the normal range.