Vanessa Kosoy comments on Inaccessible information

Vanessa Kosoy 13 Jan 2022 14:04 UTC
LW: 4 AF: 3
AF
This post defines and discusses an informal notion of “inaccessible information” in AI.

AIs are expected to acquire all sorts of knowledge about the world in the course of their training, including knowledge only tangentially related to their training objective. The author proposes to classify this knowledge into “accessible” and “inaccessible” information. In my own words, information inside an AI is “accessible” when there is a straightforward way to set up a training protocol that will incentivize the AI to reliably and accurately communicate this information to the user. Otherwise, it is “inaccessible”. This distinction is meaningful because, by default, the inner representation of all information is opaque (e.g. weights in an ANN) and notoriously hard to make sense of by human operators.

The primary importance of this concept is in the analysis of competitiveness between aligned and unaligned AIs. This is because it might be that aligned plans are inaccessible (since it’s hard to reliably specify whether a plan aligned) whereas certain unaligned plans are accessible (e.g. because it’s comparatively easy to specify whether a plan produces many paperclips). The author doesn’t mention this, but I think that there is also another reason, namely that unaligned subagents effectively have access to information that is inaccessible to us.

More concretely, approaches such as IDA and debate rely on leveraging certain accessible information: for debate it is “what would convince a human judge”, and for IDA-of-imitation it is “what would a human come up with if they think about this problem for such and such time”. But, this accessible information is only a proxy for what we care about (“how to achieve our goals”). Assuming this proxy doesn’t produce goodharting, we are still left with a performance penalty for this indirection. That is, a paperclip maximizer reasons directly about “how to maximize paperclips”, leveraging all information it has, whereas an IDA-of-imitation only reasons about “how to achieve human goals” via the information it has about “what would a human come up with”.

The author seems to believe that finding a method to “unlock” this inaccessible information will solve the competitiveness problem. On the other hand I am more pessimistic. I consider it likely that there is an inherent tradeoff between safety and performance, and therefore any such method would either expose another attack vector or introduce another performance penalty.

The author himself says that “MIRI’s approach to this problem could be described as despair + hope you can find some other way to produce powerful AI”. I think that my approach is despair(ish) + a different hope. Namely, we need to ensure a sufficient period during which (i) aligned superhuman AIs are deployed (ii) no unaligned transformative AIs are deployed, and leverage it to set-up a defense system. That said, I think the concept of “inaccessible information” is interesting and thinking about it might well produce important progress in alignment.