One Minute Every Moment
About how much information are we keeping in working memory at a given moment?
“Miller’s Law” dictates that the number of things humans can hold in working memory is “the magical number 72″. This idea is derived from Miller’s experiments, which tested both random-access memory (where participants must remember call-response pairs, and give the correct response when prompted with a call) and sequential memory (where participants must memorize and recall a list in order). In both cases, 7 is a good rule of thumb for the number of items people can recall reliably.[1]
Miller noticed that the number of “things” people could recall didn’t seem to depend much on the sorts of things people were being asked to recall. A random numeral contains about 3.3 bits of information, while a random letter contains about 4.7; yet people were able to recall about the same number of numerals or letters.
Miller concluded that working memory should not be measured in bits, but rather in “chunks”; this is a word for whatever psychologically counts as a “thing”.
This idea was further reinforced by memory athletes, who gain the ability to memorize much longer strings of numbers through practice. A commonly-repeated explanation is as follows: memory athletes are not increasing the size of their working memory; rather, they are increasing the size of their “chunks” when it comes to recalling strings of numbers specifically.[2] For someone who rarely needs to recall numbers, individual numerals might be “chunks”. For someone who recalls numbers often due to work or hobby, two or three-digit numbers might be “chunks”. For a memory athlete who can keep hundreds of digits in mind, perhaps sequences of one hundred digits count as a “chunk”.[3]
However, if you’re like me, you probably aren’t quite comfortable with Miller’s rejection of bits as the information currency of the brain. The brain isn’t magic. At some level, information is being processed.
I’ll run with the idea that chunking is like Huffman codes. Data is compressed by learning a dictionary mapping from a set of “codewords” (which efficiently represent the data) to the decompressed representation. For example, if the word “the” occurs very frequently in our data, we might assign it a very short codeword like “01”, while rare words like “lit” might get much longer codewords such as “1011010”.
A codeword is sort of like a chunk; it’s a “thing” in terms of which we compress. However, different code-words can contain different amounts of information, suggesting that they take up different amounts of space in working memory.[4]
According to this hypothesis, when psychologists such as Miller ask people to remember letters or numbers, the codeword size is about the same, because we’re asked to recall individual letters about as often as individual numbers. We don’t suddenly adapt our codeword dictionary when we’re asked to memorize a sequence of 0s and 1s, so that our memory can store the sequence efficiently at one-bit-per-bit; instead, we use our native representation, which represents “0” and “1″ via codewords which are about as long as the codewords for “5” and “j” and so on.
In effect, Miller was vastly underestimating working memory size via naive calculations of size in terms of bits. A string of seven numbers would contain 3.3 * 7 = 23.1 bits of information if stored at maximal efficiency for the number-remembering task. A string of seven letters would instead contain 4.7 * 7 = 33 bits, under a similar optimality assumption. But people don’t process information in a way that’s optimized for psychology experiments; they process information in a way that’s optimized for normal life. So, these two estimates of the number of bits in working memory are allowed to be very different from each other, because all we can say is that they are both lower bounds for the number of bits stored by working memory.
We can get a more accurate estimate by returning to the memory athletes. Although some memory athletes might start with a biological advantage, let’s assume that it’s mostly a matter of learning to chunk numbers into larger sequences—that is, learning an encoding which prioritizes number sequences more highly.
There will always be more to a memory athlete’s life than memorizing numbers, so their chunk encodings will never become perfectly efficient for the number-memorization task. However, maybe a memory athlete who devotes about half of their time to memorizing numbers will approach a code with only 1 bit of inefficiency for the number-memorizing task (representing a 50-50 binary decision between representing sequences of numbers vs anything else). So we might expect top memory athletes to give us a fairly tight lower bound on the number of bits in working memory.[5]
Andrea Muzii memorized a 630-digit number within five minutes. Since the most efficient encoding we can have for random digits is about 3.2 bits, this suggests that Andrea Muzii has at least 2016 bits of working memory.[6] (If Muzii was storing 7 chunks in working memory, this would suggest chunks of about 288 bits each.)
How can we visualize 2016 bits of information?
Human languages have an information rate of about 39 bits per second, so if we pretend that working memory is written entirely in natural language, this corresponds to about 52 seconds of natural language. Dividing by 72, we get five to ten seconds per chunk.
So, according to this estimate, if we could freeze-frame a single moment of our working memory and then explain all of the contents in natural language, it would take about a minute to accomplish.[7]
- ^
If the sequential vs random access memories were significantly different, we would want to treat them differently. But, who knows what’s going on under the hood. For the purposes of this post, I’m more-or-less pretending that working memory is a flat array. This might not be anything like the truth.
Perhaps memories are stored as link structures, so random-access can be implemented as binary trees, and sequential can be implemented as linked lists. In that case, the two would be pretty similar in nature. Or perhaps sequences are stored in the auditory modality, which can “play back” in sequence, while random-access gets stored in visual memory, which can be accessed “at a glance”. In this case, they’d be using entirely different hardware. Perhaps it depends on the person!
- ^
Epistemic status: hazy memory from psychology classes over a decade ago.
- ^
I believe the main evidence for this hypothesis is the way memory experts improve their ability to memorize stuff they practice, like sequences of numbers, without improving their ability to memorize other sorts of things.[2] If total working memory was being increased, they should be able to memorize anything.[8]
- ^
The variable length of the codewords isn’t really the important thing, here. I’m just mentioning Huffman codes because it’s a particularly common version of the dictionary-coding idea. My main point is to imagine “chunks” as codewords in a dictionary code.
However, the idea that some chunks have a higher probability than other chunks feels intuitive. Variable-length codewords imply variable-probability chunks, while fixed-length codes imply that all chunks have the same probability.
It’s easier to imagine variable-length codewords shifting up and down in probability as we learn; it’s harder to imagine fixed-length codewords growing and shrinking in length to shift the probabilities of sequences as we learn.
- ^
This is, obviously, wild speculation.
- ^
Several commenters have questioned my use of “working memory” here. For example, it’s plausible that Muzii was able to store significant portions of the information in long-term memory within the five minutes. So, some comments on what I mean by “working memory” in this essay.
I see three approaches to reasonable definitions of “working memory”.
The first is intuitive/informal: “working memory” refers to memories which can be “kept in mind”. If I “get distracted” by a phone call, I might “forget what I was doing”, even though I may be able to recall what I was working on by consulting “long-term memory”—this suggests there’s some kind of “short-term” or “working memory” which has to do with what I was working on.
(Short-term memory and working memory are often treated as synonyms in the literature, according to wikipedia.)
The second approach is operational: define some tests of working memory size, such as those Miller employed. “Working memory” becomes whatever those tests measure.
The third approach is more neurological: based on our understanding of the brain, define “working memory” to point at specific elements of the information processing of the brain. Which elements? Well, we would base the selection on a combination of the first two approaches: which elements of the brain’s information processing seem to fit the informal idea of working memory? Which brain-regions are active during the experiments which form the operational definition?
I take an operational approach.
I see several reasonable approaches to operational definitions of short vs long-term memory: storage time (how long it takes to form a memory), recall time (how long it takes to retrieve a memory), and volatility (how long the memory sticks around—particularly if it’s not being actively maintained, that is, if we’re distracted by something else). All of these concepts are important, and each could individually be taken as a definition of “short-term vs long-term”.
The term “working memory” was coined by Miller. I think a reasonable operational definition, therefore, is “whatever Miller was measuring”.
However, I would like to further clarify this operationalization by explicitly indexing by the timespans involved in the experiment. A specific experimental setup can give people x seconds to memorize, y seconds between memorization and recital during which the memory must be maintained, and z seconds for recital. (z seems like the less important of these three variables, but, is still an important variable.)
So when I use Muzii for my final calculation, I’m really talking about a specific operationalized version of working memory where x = 5 minutes. This is going to be different from x = 4 minutes or x = 6 minutes, whereas by simply calling it “working memory”, I risk conflating.
Therefore, my punchline “one minute every moment” could more accurately be revised to “we recall about one minute of information from five minutes of trying to absorb information, if asked to recall very shortly after”.
(I’m not sure what the y-value was for Muzii’s world record that I cite.)
- ^
A core premise of this post is that it makes sense to talk about “working memory” as a relatively fungible information-processing resource within the brain. However, in reality it’s more like we have several scratchpads for different types of data: auditory working memory is called the “phonological loop” and stores a specific amount of sound; visual working memory is called the “visuospatial sketchpad”; and so on.
It’s unclear how, exactly, to connect this to the 72 number.
Different people vary in their conscious access to the different sensory working memories. Perhaps most people focus on one particular sensory modality for their conscious working memory. It could be that 72 represents what people can store in this primary working memory, with part of the variation explained by which type of working memory they use and how suitable it is to the particular memory task.
For someone who thinks in sounds, the highest bit-density would be achieved when remembering actual sound. Other types of information would have to be encoded in sound (probably as spoken language), which would be relatively inefficient.
For someone who thinks in images, the highest bit-density would be achieved when remembering images (video?), and other things would have to be encoded (as icons? written text? geometry? some combination?).
Memorization tasks might measure the lower-bit-density mode, where we’re encoding stuff more indirectly. On the other hand, maybe that’s the main thing we care to measure!
Another big caveat to this whole post is the idea that we might have multiple different levels of memory, ranging from very short-term to very long-term. When memorizing a sequence, we could store different pieces at different levels; perhaps we fill up our shortest-term memory, and then store some pieces in a medium-term memory while we keep juggling the shortest-term memory by rehearsal to prevent forgetting.
In other words, we should avoid making overly naive assumptions about the information processing architecture of the brain without justification.
- ^
It seems like this should be confounded, though, if the memory athlete finds a way to encode other information into the format they’ve practiced; EG, memorizing random letters by encoding them as random numbers.
This is similar to the memory-palace technique, in which a memory expert memorizes arbitrary information by forming associations with some standard sequence which has already been memorized.
I expect this technique to confound the notion that memory experts only have increased memory for things they’ve practiced; memory experts who have practiced such a technique should be able to display similar feats of memory for things they haven’t practiced.
However, although it creates the appearance of a more global increase in working memory, rather than narrow gains only due to “chunking” in oft-practiced domains, this technique should not actually overcome the normal information-processing constraints of the brain. Encoding arbitrary information as strings of numbers will be more or less efficient based on the encoding we come up with. For events in everyday life, my expectation is that our existing encoding would be about as efficient as possible, so there would be no gains to memory by the act of encoding these events to numbers, even for a memory athlete.
Although, the act of performing such an encoding might signal to the brain that the information is important to recall. So, some gains might be observed anyway.
I’m not sure what the takeaway is here, but these calculations are highly suspect. What a memory athlete can memorize (in their domain of expertise) in 5 minutes is an intricate mix of working memory and long-term semantic memory, and episodic (hippocampal) memory.
This is a very deep topic. Reading comprehension researchers have estimated the size of working memory as “unlimited”, but that’s obviously specific to their methods of measurement.
Modern debates on working memory capacity are 1-4 items. 7 was specific to what is now known as the phonological loop, which is subvocally reciting numbers. The strong learned connections between auditory cortex and verbal motor areas gives this a slight advantage over working memory for material that hasn’t been specifically practiced a lot.
See the concept of exformation, incidentally from one of the best books I’ve found on consciousness. Bits of information encoded by a signal to a sophisticated system is intricately intermixed with that system’s prior learning. It’s a type of compression. Not making a call at a specific time can encode a specific signal of unlimited length, if sender and receiver agree to that meaning.
Sorry for the lack of citations. I’ve had my head pretty deeply into this stuff in the past, but I never saw the importance of getting a precise working memory capacity estimate. The brain mechanisms are somewhat more interesting to me, but for different reasons than estimating capacity (they’re linked to goals and reward system operation, since working memory for goals and strategy is probably how we direct behavior in the short term).
I’m kind of fine with an operationalized version of “working memory” as opposed to a neuroanatomical concept. For practical purposes, it seems more useful to define “working memory” in terms of performance.
(That being said, the model which comes from using such a simplified concept is bad, which I agree is concerning.)
As for the takeaway, for me the one-minute number is interesting both because it’s kind of a lot, but not so much. When I’m puttering around my house balancing tasks such as making coffee, writing on LessWrong, etc I have roughly one objective or idea in conscious view at a time, but the number of tasks and ideas swirling around “in circulation” (being recalled every so often) seems like it can be pretty large. The original idea for this post came from thinking about how psychology tasks like Miller’s seem more liable to underestimate this quantity than overestimate it.
On the other hand, it doesn’t seem so large. Multi-tasking significantly degrades task performance, suggesting that there’s a significant bottleneck.
The “about one minute” estimate fits my intuition: if that’s the amount of practically-applicable information actively swirling around in the brain at a given time (in some operationalized sense), well. It’s interesting that it’s small enough to be easily explained to another person (under the assumption that they share the requisite background knowledge, so there’s no inferential gap when you explain what you’re thinking in your own terms). Yet, it’s also ‘quite a bit’.
Why not just makeup a new word about the concept you’re actually talking about?
I’ve found that “working memory” was coined by Miller, so actually it seems pretty reasonable to apply that term to whatever he was measuring with his experiments, although other definitions seem quite reasonable as well.
Vastly more work has been done since then, including refined definitions of working memory. It measures what he thought he was measuring, so it is following his intent. But it’s still a bit of a chaotic shitshow, and modern techniques are unclear on what they’re measuring and don’t quite match their stated definitions, too.
When I took classes in cog sci, this idea of “working memory” seemed common, despite coexistence with more nuanced models. (IE, speaking about WM as 7±2 chunks was common and done without qualification iirc, although the idea of different memories for different modalities was also discussed. Since this number is determined by experiment, not neuroanatomy, it’s inherently an operationalized concept.) Perhaps this is no longer the case!
Per your footnote 6, I wouldn’t expect that the whole 630-digit number was ever simultaneously in working memory.
IIUC, at least some “memory athletes” use navigational memory (via “memory palace”)…
My neuroscience knowledge says that navigational memory heavily involves the hippocampus, and (IIRC) the hippocampus can do one-shot long-term (or at least long-ish-term) storage of information, with a massive capacity.
My common sense / experience says that: if I’m playing a video game and there are 20 subsequent screens, each with memorable features, and someone shows me a route through all those screens (“at the screen with the turtle, go left by the statue. And then you get to the screen with the robot, and you go up by the snake, …”), even if I’ve only seen it once and certainly if I’ve seen it 2 or 3 times, I can trace the route afterwards. But I never had whole route simultaneously in working memory. Rather, I see one screen, and it jogs my memory for what to do next, and then I see the next screen, and it jogs my memory, etc.
Maybe a better example: if I’m memorizing a tune, I’m relying on each part of the song to jog my memory for the next part of the song. The “jog my memory” action presumably involves pulling information out of a kind of storage, i.e. information that is not already in working memory.
How would you like to define “simultaneously in working memory”?
The benefit of an operationalization like the sequential recall task is concreteness and easily tested predictions. I think if we try to talk about the actual information content of the actual memory, we can start to get lost in alternative assumptions. What, exactly, counts as actual working memory?
One way to think about the five-minute memorization task which I used for my calculation is that it measures how much can be written to memory within five minutes, but it does little to test memory volatility (it doesn’t tell us how much of the 630-digit number would have been forgotten after an hour with no rehearsal). If by “short-term memory” we mean memory which only lasts a short while without rehearsal, the task doesn’t differentiate that.
So, “for all we know” from this test, the information gets spread across many different types of memory, some longer-lasting and some shorter-lasting. This is one way of interpreting your point about the 630 digits not all being in working memory.
According to this way of thinking, we can think of the 5 minute memorization period as an extended “write” task. The “about one minute” factoid gets re-stated as: what you can write to memory in five minutes, you could explain in natural language in about one minute, if performing about optimally, and assuming you don’t need to fill in any background context for the explanation.
“5 minutes of lets you capture, at best, 1 minute of spoken material” sounds much less impressive than my one-minute-per-moment headline.
However, this way of thinking about it makes it tempting to think that the memory athlete is able to store a set number of bits into memory per second studying; a linear relationship between study time and the length of sequences which can be recalled. I doubt the relationship is that simple.
The spaced repetition literature suggests a model based on forgetting curves, where the number and pattern of times we’ve reviewed a specific piece of information determines how long we’ll recall it. In this model, we don’t so much think of “short term memory” and “long term memory” capacity, instead focusing on the expected durability of specific memories. This expected durability increases in an understood way with practice.
In contrast to the simple “write to memory” model, this provides a more detailed (and, I think, plausible) account of what goes on during the 5 minutes one has to rehearse: a memory athlete would, presumably, rehearse the sequence, making memories more robust to the passage of time via repetition.
In order to keep a set of information “in working memory” in this paradigm is to keep rehearsing it at a spaced-repetition schedule such that you recall each fact before you forget it. The details of the forgetting curve would enable a prediction for how many such facts can be memorized given an amount of study time.
The natural place to bring up “chunks” here is the amount of information that can fit in an individual memory (a single “fact”). It no longer makes sense to talk about the “total information capacity of short-term memory”, since memory is being modeled on a continuum from short to long, and a restricted capacity like 7±2 is not really part of this type of model. Without running any detailed math on this better sort of model, I suppose the information capacity of a memory would come out close to the “five to ten seconds of spoken language per chunk” which we get when we apply information theory to the Miller model.
This model also has many problems, of course.
Yeah this website implies that it’s sublinear—something like 50% more content when they get twice as long to study? Just from quickly eyeballing it.
I still feel like you’re using the term “working memory” in a different way from how I would use it. Suppose you have 30 minutes to study a list of numbers. You first see Item X and try to memorize it in minute 3. Then you revisit it in minute 9, and it turns out that you’ve already “forgotten it” (in the sense that you would have failed a quiz) but it “rings a bell” when you see it, and you try again to memorize it. I think you’re still benefitting from the longer forgetting curve associated with the second revisit of Item X. But Item X wasn’t “in working memory” in minute 8, by my definitions.
(Note that I don’t know the details of how memory athletes spend their 30 minutes and didn’t check. For all I know they do a single pass.)
One way to parameterize recall tasks is x,y,z = time you get to study the sequence, time between in which you must maintain the memory, time you get to try and recall the sequence.
During “x”, you get the case you described. I presume it makes sense to do the standard spaced-rep study schedule, where you re-study information at a time when you have some probability of having already forgotten it. (I also have not looked into what memory champions do.)
During “y”, you have to maintain. You still want to rehearse things, but you don’t want to wait until you have some probability of having forgotten, at this point, because the study material is no longer in front of you; if you forget something, it is lost. This is what I was referring to when I described “keeping something in working memory”.
During “z”, you need to try and recall all of the stored information and report it in the correct sequence. I suppose having longer z helps, but the amount it helps probably drops off pretty sharply as z increases. So x and y are in some sense the more important variables.
So how do you want to use it?
I think my usage is mainly weird because I’m going hard on the operationalization angle, using performance on memory experiments as a definition. I think this way of defining things is particularly practical, but does warp things a lot if we try to derive causal models from it.
I think it’s cool what you’re trying to do, I just wish you had made up your own original term instead of using the existing term “working memory”. To be honest I’m not an expert on exactly how “working memory” is defined, but I’m pretty sure it has some definition, and that this definition is widely accepted (at least in broad outline; probably people argue around the edges), and that this accepted definition is pretty distant from the thing you’re talking about. I’m open to being corrected; like I said, I’m not an expert on memory terminology. :)
The term “working memory” was coined by Miller, and I’m here using his definition. In this sense, I think what I’m doing is about as terminologically legit as one can get. But Miller’s work is old; possibly I should be using newer concepts instead.
That task measures what can be written to memory within 5 minutes, given unlimited time to write relevant compression codes into long-term semantic memory. It’s complex. See my top-level comment.
I’m sure you know this but the ‘jog my memory’ is plausibly explained by memory being like a hopfield network;
This is wrong, a random letter contains log(26)/log(2) = 4.7 bits of information.
Whoops! Thanks.
What if “system 2” is like a register machine with 4 pointer-storing registers / slots / tabs, 2 per hemisphere? (Peter Carruthers “The Centered Mind”.) I don’t reject information processing, but rather consider “working memory” to be a misnomer. The brain does not implement
memcpy
.Agreed. I’ve been thinking along similar lines for a while now, but with a focus on problem-solving.
That is: Suppose you’re trying to solve some cognitive problem, like proving a math theorem. Your mental ontology has a “dictionary” representing natural abstractions and their oft-used combinations: “chunks”, with codes whose lengths reflect how often you use them.
For a given pair of (mental dictionary, cognitive problem), the solution (theorem proof) has a minimal message length L. If L is too large to fit into the working memory, the problem needs to be solved in steps: by composing new words out of the existing chunks (proving lemmas, deriving helper functions), then assigning those words a shorter code. And since you can’t conceptualize the solution yet, it’s done stochastically: you ponder new chunk that you merely expect/hope are part of the solution. Eventually, if that process is successful, you modify your dictionary such that the new solution length L′ is less than your working memory — and so the problem is solved (the theorem is proven).
Empirically, this explains a bunch of things:
Why resting/taking breaks is important even if you’re doing pure theory work. The brain needs some time to adjust its dictionary (likely by doing a Bayesian update — by noticing that some specific abstraction-combinations have been used more often lately, and so assigning them shorter codes).
Why leaving a problem alone for a while and then coming back to it may lead to sudden insights:
This idea of extensive unconscious computation neatly accords with Poincaré’s account of mathematical creativity in which after long fruitless effort (preparation), he abandoned the problem for a time and engaged in ordinary activities (incubation), is suddenly struck by an answer or insight, and then verifies its correctness consciously.
And maybe even part of why we have limited willpower/ability to focus on a given problem.
That said, it does seem pretty bizarre. How come it’s the codeword length that our working memory/intelligence is bottlenecked on, instead of the actual computational complexity of whatever abstractions we’re thinking about? Probably this account is missing something. (Most likely, it’s not just encoding-adjustment, and the brain uses some algorithms to actually performance-optimize the abstraction-combinations we think about often — e. g., by abstracting away their internal structure… Except the 7±2 number still looks weirdly constant even in this context.)
I find the working memory question very intriguing and this is an interesting matrix to explore it, particularly with the comments. My own thoughts are in regard (not into how to measure working memory) but of working imagination: how many ideas can we come up in one minute (perhaps with reference to an index of “novelness”) as oppose to OCD anxious repetitions. Though knowing the working anxiety repetitions may also be useful.
This seems like a potentially misleading description of the situation. It seems to say that the contents of working memory could always be described in one minute of natural language, but this is not implied (as I’m sure you know based on your reasoning in this post). A 630-digit number cannot be described in one minute of natural language. 2016 bits of memory and about 2016 bits of natural language per minute really means that if our working memory was perfectly optimized for storing natural language and only natural language, it could store about one minute of it.
(And on that note, how much natural language can the best memory athletes store in their working memory? One minute seems low to me. If they can actually store more, it would show that your bit estimate is too low.)
I have in mind the related claim that if natural language were perfectly optimized for transmitting the sort of stuff we keep in our working memory, then describing the contents of our working memory would take about a minute.
I like this version of the claim, because it’s somewhat plausible that natural language is well-optimized to communicate the sort of stuff we normally think about.
However, there are some plausible exceptions, like ideas that are easy to visualize and draw but difficult to communicate in full detail in natural language (EG a fairly specific curved line).
Plausibly, working memory contains some detail that’s not normally put to fruitful use in sequence-memorization tasks, such as the voice with which the inner narrator pronounces the numerals, or the font if numerals are being imagined visually.
However, the method in the post was only ever supposed to establish a lower bound, anyway. It could take us a lot longer than a minute to explain all the sensory detail of our working memory.
Woah. I have nothing else to add. Great stuff Abram!