ALBA requires incremental design of good long-term memory systems

Summary: ALBA is an approach to training aligned AGI. One problem with implementing ALBA is that some actions that must be overseen involve storing information that will be used in future episodes. To oversee these actions, it is necessary for the operators to know which information is worth storing.


(thanks to Ryan for helping me work through some of these ideas)

Recommended before reading this post: Not just learning

ALBA requires the operator to oversee a learning system’s performance in each episode (say, by assigning a score). Usually, the operator would like information to be stored between episodes; for example, they might want to store a photo from the robot’s sensors, or they might want to run a large computation that takes more than one episode to perform. Under ALBA, the operator must be able to provide good feedback about how useful a certain piece of information is to store.

In the case of physical observations (e.g. photos), it usually seems fine to just store everything. But in the case of “logical” information such as the results of computations, it isn’t possible to store everything, since that would require running all computations. So the operators will need some idea of which logical information is most useful to store (i.e. which computations are most useful to run now and cache the result of).

How much is long-term memory tied to cognitive architecture?

One would hope that this is possible for the operators to do without essentially already understanding how to program an aligned AGI. Perhaps humans can make pretty good guesses about which information is useful to store without knowing much about the underlying learning system.

In the worst case, though, the optimal logical information for an AGI system to store in the long term strongly depends on its cognitive architecture. For example, if two different humans are studying for the same test, they will probably read different material and do different cognitive work in the process of studying; if one human somehow had access to the other’s cognitive work, it probably wouldn’t be that useful since it would consist of memories in a “different mental language”. At the same time, it seems like humans are able to come up with pretty good collective memories over the long time (e.g. in the form of books), although books are substantially less efficient than personal notes because they have to be understood by more than one human.

Under uncertainty about the right cognitive architecture for the AGI system to use over the long term, we could just store the information that all architectures think is useful. If we are uncertain between different architectures, then this multiplies the cost of long-term thinking and memory by at most . Hopefully, if our uncertainty about the right architecture has structure, we should be able to do much less work while still mostly satisfying each cognitive architecture.

This problem can be solved incrementally

At the start, the operators must decide which information fairly weak learning systems should store. Later in the bootstrapping process, the operators (assisted by the weak learning system they just trained) must decide which information stronger learning systems should store. Thus, the operators don’t need to initially know which information is good to store for strong learning systems. This is definitely a good reason for optimism: optimistic about bootstrapping in general should translate to optimism about having the bootstrapping process decide which information is good to store.

Still, this seems to me like one of the “most magic” parts of ALBA. If ALBA is hiding most of the alignment problem somewhere, “figuring out which information to store between episodes” is a pretty good guess.

Solving this problem in full generality is hard

To solve this problem in full generality, we would have to have a good guess ahead of time (before we can get much feedback) about what kind of computations we will want to run in the future. Intuitively, this seems at least as hard as “having a good physical/​logical prior”. If we’re using an indirect method like ALBA, then we’re probably already pessimistic about specifying good physical/​logical priors in any way other than deferring to assisted humans. In this case we should not expect to solve the full problem before actually having access to AGI (although of course we could still make progress in understanding aspects of it ahead of time).

What about “learning to learn”?

In some cases, it is possible to get feedback about which information should be kept around. For example, in one paper, researchers trained a deep learning system to make updates to a neural network’s parameters that improve performance. This is a special case of “figuring out which information should be stored between episodes”.

I don’t think “learning to learn” changes the nature of the problem. When training the parameter-updater, it is only possible to use information from a limited horizon (say, ) while preserving good statistical guarantees. So we might as well group episodes into blocks of size , and then consider a modified learning problem where each episode is a block of the original problem. At this point, the humans must oversee as many episodes as they did originally, but times as much work is being done in each episode.

Reducing the number of episodes seems to make the problem somewhat easier (since there are fewer episodes to store information between), and “learning to learn” provides an intuition about why it might be possible to get good performance in these larger episodes.

I expect it will be slightly harder to oversee few small episodes, versus lots of large episodes. As a general rule, to get good performance on episodes, it is necessary for the operator to oversee episodes (see regret bounds for partial monitoring; I’m setting the amount of oversight so that oversight work and other forms of regret are of similar magnitudes). Let’s assume that oversight time per episode should be roughly proportional to the length of the episode. So if we block episodes into groups of size , the oversight time is now . Therefore, this blocking multiplies the amount of oversight work required by .

(note that when introducing a blocking structure, it’s important not to unintentionally introduce incentives to manipulate future episodes in the same block. For example, if each original episode consists of answering a single question, and the answer to one episode can affect which question is asked in the next one, then the learning system might give manipulative answers that cause easier questions to be asked in the future. To avert this incentive, it is necessary to fix the questions ahead of time, or to use a different training procedure to explicitly minimize regret compared with alternative policies running on the same list of questions).