Another thing that maybe I didn’t make clear previously:
I believe the point about Turing machines was that given Low Bandwidth Overseer, it’s not clear how to get HCH/IA to do complex tasks without making it instantiate arbitrary Turing machines.
I agree, but if you’re instructing your humans not to instantiate arbitrary Turing machines, then that’s a competitiveness claim, not an alignment claim. I think there are lots of very valid reasons for thinking that HCH is not competitive—I only said I was skeptical of the reasons for thinking it wouldn’t be aligned.
I think there are lots of very valid reasons for thinking that HCH is not competitive—I only said I was skeptical of the reasons for thinking it wouldn’t be aligned.
But if you put aside competitiveness, can’t HCH be trivially aligned? E.g., you give the humans making up HCH instructions to cause it to not be able to answer anything except simple arithmetic questions. So it seems that a claim of HCH being aligned is meaningless unless the claim is about being aligned at some level of competitiveness.
That’s a good point. What I really mean is that I think the sort of HCH that you get out of taking actual humans and giving them careful instructions is more likely to be uncompetitive than it is to be unaligned. Also, I think that “HCH for a specific H” is more meaningful than “HCH for a specific level of competitiveness,” since we don’t really know what weird things you might need to do to produce an HCH with a given level of competitiveness.
Another thing that maybe I didn’t make clear previously:
I agree, but if you’re instructing your humans not to instantiate arbitrary Turing machines, then that’s a competitiveness claim, not an alignment claim. I think there are lots of very valid reasons for thinking that HCH is not competitive—I only said I was skeptical of the reasons for thinking it wouldn’t be aligned.
But if you put aside competitiveness, can’t HCH be trivially aligned? E.g., you give the humans making up HCH instructions to cause it to not be able to answer anything except simple arithmetic questions. So it seems that a claim of HCH being aligned is meaningless unless the claim is about being aligned at some level of competitiveness.
That’s a good point. What I really mean is that I think the sort of HCH that you get out of taking actual humans and giving them careful instructions is more likely to be uncompetitive than it is to be unaligned. Also, I think that “HCH for a specific H” is more meaningful than “HCH for a specific level of competitiveness,” since we don’t really know what weird things you might need to do to produce an HCH with a given level of competitiveness.