Wei Dai comments on [Linkpost] Introducing Superalignment

Wei Dai 2 Aug 2023 10:08 UTC
LW: 2 AF: 2
0
AF

goodness of HCH

What is the latest thinking/discussion about this? I tried to search LW/AF but haven’t found a lot of discussions, especially positive arguments for HCH being good. Do you have any links or docs you can share?

How do you think about the general unreliability of human reasoning (for example, the majority of professional decision theorists apparently being two-boxers and favoring CDT, and general overconfidence of almost everyone on all kinds of topics, including morality and meta-ethics and other topics relevant for AI alignment) in relation to HCH? What are your guesses for how future historians would complete the following sentence? Despite human reasoning being apparently very unreliable, HCH was a good approximation target for AI because …

instead relies on some claims about offense-defense between teams of weak agents and strong agents

I’m curious if you have an opinion on where the burden of proof lies when it comes to claims like these. I feel like in practice it’s up to people like me to offer sufficiently convincing skeptical arguments if we want to stop AI labs from pursuing their plans (since we have little power to do anything else) but morally shouldn’t the AI labs have much stronger theoretical foundations for their alignment approaches before e.g. trying to build a human-level alignment researcher in 4 years? (Because if the alignment approach doesn’t work, we would either end up with an unaligned AGI or be very close to being able to build AGI but with no way to align it.)