I might be missing some context here, but I didn’t understand the section “No Indescribable Hellworlds Hypothesis” and how hellworlds have to do with debate.
Not Abram, and I have only skimmed the post so far, and maybe you’re pointing to something more subtle, but my understanding is this:
In Stuart’s original use, ‘No Indescribable Hellwords’ is the hypothesis that in any possible world in which a human’s values are violated, the violation is describable: one can point out to the human how her values are violated by the state of affairs.
Analogously, debate as an approach to alignment could be seen as predicated on a similar hypothesis: that in any possible flawed argument, the flaw is describable: one can point out to a human how the argument is flawed.
Edited to add: The additional claim in the Hellwords section is that acting according to the recommendations of debate won’t lead to very bad outcomes—at least, not to ones which could be pointed out. For example, we can imagine a debate around the question “Should we enact policy X?”. A very strong argument, if it can be credibly argued, is “Enacting policy X leads to an unacceptable violation Y of your values down the line”. So, debate will only recommend policy X if no such arguments are available.
I’m not sure to what extent I buy this additional claim. For example, if when a system trained via debate is actually deployed it doesn’t get asked questions like ‘Should we enact policy X?’ but instead more specific things like ‘How much does policy X improve Y metric’?, then unless debaters are incentivised to challenge the question’s premises (“The Y metric would improve, but you should consider also the unacceptable effect on Z”), we could use debate and still get hellworlds.
I might be missing some context here, but I didn’t understand the section “No Indescribable Hellworlds Hypothesis” and how hellworlds have to do with debate.
Not Abram, and I have only skimmed the post so far, and maybe you’re pointing to something more subtle, but my understanding is this:
In Stuart’s original use, ‘No Indescribable Hellwords’ is the hypothesis that in any possible world in which a human’s values are violated, the violation is describable: one can point out to the human how her values are violated by the state of affairs.
Analogously, debate as an approach to alignment could be seen as predicated on a similar hypothesis: that in any possible flawed argument, the flaw is describable: one can point out to a human how the argument is flawed.
Edited to add: The additional claim in the Hellwords section is that acting according to the recommendations of debate won’t lead to very bad outcomes—at least, not to ones which could be pointed out. For example, we can imagine a debate around the question “Should we enact policy X?”. A very strong argument, if it can be credibly argued, is “Enacting policy X leads to an unacceptable violation Y of your values down the line”. So, debate will only recommend policy X if no such arguments are available.
I’m not sure to what extent I buy this additional claim. For example, if when a system trained via debate is actually deployed it doesn’t get asked questions like ‘Should we enact policy X?’ but instead more specific things like ‘How much does policy X improve Y metric’?, then unless debaters are incentivised to challenge the question’s premises (“The Y metric would improve, but you should consider also the unacceptable effect on Z”), we could use debate and still get hellworlds.