Maybe I’m missing something, but I don’t understand why you’re considering the output of the simplest model that provides some checkable information to be accessible. It seems to me like that simplest model could very well be implementing a policy like BAD that would cause its output on the uncheckable information to be false or otherwise misleading. Thus, it seems to me like all of the problems you talk about in terms of the difficulty of getting access to inaccessible information also apply to the uncheckable information accessible via transfer.
I don’t mean to say that “What’s the weight of Neptune?” is accessible if a model transfers to saying “The weight of Neptune is 100kg.” I mean that “What’s the weight of Neptune?” is accessible if a model transfers to correctly reporting the weight of Neptune (or rather if it transfers in such a way that its answers give real evidence about the weight of Neptune, or rather that the evidence is accessible in that case, or… you can see why it’s hard to be formal).
If we wanted to be more formal but less correct, we could talk about accessibility of functions from possible worlds. Then a function f* is accessible when you can check a claimed value f* (using oracles for other accessible functions), or if you can find some encoding R of functions and some value r* such that the simplest function mapping R(f) → f(real world) for all accessible functions also maps r* → f*(real world).
That makes sense, though I feel like under that definition having things that you care about be accessible via transfer wouldn’t actually help you that much unless you know that the model transfers correctly there, since otherwise you’d have no reason to trust the transfer (even if it’s actually correct). Unless you have some reason to believe otherwise (e.g. some strong robustness guarantee)—it seems to me like in most cases you have to assume that all the information you get via transfer is suspect, which makes even the correct transfer inaccessible in some sense since you can’t distinguish it from the incorrect transfer.
I guess if lots of actors just try to use transfer anyway and hope, then those actors with values that actually are accessible via transfer will be advantaged, though unless you have a particular reason to suspect that your values will be more accessible than average (though I guess the point is that our values are less likely to be accessible via transfer than most AI’s values), it seems like in most cases you wouldn’t want to pursue that strategy unless you had no other option.
Maybe I’m missing something, but I don’t understand why you’re considering the output of the simplest model that provides some checkable information to be accessible. It seems to me like that simplest model could very well be implementing a policy like BAD that would cause its output on the uncheckable information to be false or otherwise misleading. Thus, it seems to me like all of the problems you talk about in terms of the difficulty of getting access to inaccessible information also apply to the uncheckable information accessible via transfer.
I don’t mean to say that “What’s the weight of Neptune?” is accessible if a model transfers to saying “The weight of Neptune is 100kg.” I mean that “What’s the weight of Neptune?” is accessible if a model transfers to correctly reporting the weight of Neptune (or rather if it transfers in such a way that its answers give real evidence about the weight of Neptune, or rather that the evidence is accessible in that case, or… you can see why it’s hard to be formal).
If we wanted to be more formal but less correct, we could talk about accessibility of functions from possible worlds. Then a function f* is accessible when you can check a claimed value f* (using oracles for other accessible functions), or if you can find some encoding R of functions and some value r* such that the simplest function mapping R(f) → f(real world) for all accessible functions also maps r* → f*(real world).
That makes sense, though I feel like under that definition having things that you care about be accessible via transfer wouldn’t actually help you that much unless you know that the model transfers correctly there, since otherwise you’d have no reason to trust the transfer (even if it’s actually correct). Unless you have some reason to believe otherwise (e.g. some strong robustness guarantee)—it seems to me like in most cases you have to assume that all the information you get via transfer is suspect, which makes even the correct transfer inaccessible in some sense since you can’t distinguish it from the incorrect transfer.
I guess if lots of actors just try to use transfer anyway and hope, then those actors with values that actually are accessible via transfer will be advantaged, though unless you have a particular reason to suspect that your values will be more accessible than average (though I guess the point is that our values are less likely to be accessible via transfer than most AI’s values), it seems like in most cases you wouldn’t want to pursue that strategy unless you had no other option.