In a recent post, John mentioned how Corrigability being a subset of Human Values means we should consider using Corrigability as an alignment target. This is a useful perspective, but I want to register that X⊆Y doesn’t always imply that doing X is “easier” than Y, this is similar to the problems with problem factorization for alignment but even stronger! Even if we only want to solve X and not Y,Xcan still be harder!
For a few examples of this:
Acquiring half a Strawberry by itself is harder than a full strawberry (You have to get a full strawberry, then cut it in half) (This holds for X=MacBook, Person, Penny too)
Let L be a lemma used in a proof of T (meaning L⊆T in some sense). It may be that T can be immediately proved via a known more general theorem T′. In this case L is harder to directly prove then T.
When writing an essay, writing section 3 alone can be harder than writing the whole essay, because it interacts with the other parts, you learn from writing the previous parts, etc. (Sidenote: There’s a trivial sense[1] in which writing section 3 can be no harder than writing the whole essay, but in practice we don’t care as the whole point to considering a decomposition is to do better.)
In general, depending on how “natural” the subproblem in the factorization is, subproblems can be harder than solving the original problem. I believe this may (30%) be the case with corrigibility; mainly because (1) Corrigability is anti-natural in some ways, and (2) Humans are pretty good at human values while being not-that-corrigible.
I believe this may (30%) be the case with corrigibility
Surprising agreement with my credence! First skim I thought “Uli isn’t thinking correctly about how humans may have an explicit value for corrigible things, so if humans have 10B values, and we have an adequate value theory, solving corrigibility only requires searching for 1 value in the brain, while solving value-alignment requires searching for 10B values”, and I decided I thought this class of arguments brings something roughly corresponding to ‘corrigibility is easier’ to 70%. But then I looked at your credence, and turned out we agreed.
Mmm, I think it matters a lot which of the 10B[1] values is harder to instill, I think most of the difficulty is in corrigibility. Strong corrigibility seems like it basically solves alignment. If this is the case then corrigibility is a great thing to aim for, since it’s the real “hard part” as opposed to random human values. I’m ranting now though… :L
I think it’s way less than 10B, probably <1000 though I haven’t thought about this much and don’t know what you’re counting as one “value” (If you mean value shard maybe closer to 10B, if you mean human interpretable value I think <1000)
In a recent post, John mentioned how Corrigability being a subset of Human Values means we should consider using Corrigability as an alignment target. This is a useful perspective, but I want to register that X⊆Y doesn’t always imply that doing X is “easier” than Y, this is similar to the problems with problem factorization for alignment but even stronger! Even if we only want to solve X and not Y, X can still be harder!
For a few examples of this:
Acquiring half a Strawberry by itself is harder than a full strawberry (You have to get a full strawberry, then cut it in half) (This holds for X=MacBook, Person, Penny too)
Let L be a lemma used in a proof of T (meaning L⊆T in some sense). It may be that T can be immediately proved via a known more general theorem T′. In this case L is harder to directly prove then T.
When writing an essay, writing section 3 alone can be harder than writing the whole essay, because it interacts with the other parts, you learn from writing the previous parts, etc. (Sidenote: There’s a trivial sense[1] in which writing section 3 can be no harder than writing the whole essay, but in practice we don’t care as the whole point to considering a decomposition is to do better.)
In general, depending on how “natural” the subproblem in the factorization is, subproblems can be harder than solving the original problem. I believe this may (30%) be the case with corrigibility; mainly because (1) Corrigability is anti-natural in some ways, and (2) Humans are pretty good at human values while being not-that-corrigible.
Just write the whole thing then throw everything else away!
Surprising agreement with my credence! First skim I thought “Uli isn’t thinking correctly about how humans may have an explicit value for corrigible things, so if humans have 10B values, and we have an adequate value theory, solving corrigibility only requires searching for 1 value in the brain, while solving value-alignment requires searching for 10B values”, and I decided I thought this class of arguments brings something roughly corresponding to ‘corrigibility is easier’ to 70%. But then I looked at your credence, and turned out we agreed.
Mmm, I think it matters a lot which of the 10B[1] values is harder to instill, I think most of the difficulty is in corrigibility. Strong corrigibility seems like it basically solves alignment. If this is the case then corrigibility is a great thing to aim for, since it’s the real “hard part” as opposed to random human values. I’m ranting now though… :L
I think it’s way less than 10B, probably <1000 though I haven’t thought about this much and don’t know what you’re counting as one “value” (If you mean value shard maybe closer to 10B, if you mean human interpretable value I think <1000)