About how much information are we keeping in working memory at a given moment?
“Miller’s Law” dictates that the number of things humans can hold in working memory is “the magical number 7±2″. This idea is derived from Miller’s experiments, which tested both random-access memory (where participants must remember call-response pairs, and give the correct response when prompted with a call) and sequential memory (where participants must memorize and recall a list in order). In both cases, 7 is a good rule of thumb for the number of items people can recall reliably.[1]
Miller noticed that the number of “things” people could recall didn’t seem to depend much on the sorts of things people were being asked to recall. A random numeral contains about 3.3 bits of information, while a random letter contains about 4.7; yet people were able to recall about the same number of numerals or letters.
Miller concluded that working memory should not be measured in bits, but rather in “chunks”; this is a word for whatever psychologically counts as a “thing”.
This idea was further reinforced by memory athletes, who gain the ability to memorize much longer strings of numbers through practice. A commonly-repeated explanation is as follows: memory athletes are not increasing the size of their working memory; rather, they are increasing the size of their “chunks” when it comes to recalling strings of numbers specifically.[2] For someone who rarely needs to recall numbers, individual numerals might be “chunks”. For someone who recalls numbers often due to work or hobby, two or three-digit numbers might be “chunks”. For a memory athlete who can keep hundreds of digits in mind, perhaps sequences of one hundred digits count as a “chunk”.[3]
However, if you’re like me, you probably aren’t quite comfortable with Miller’s rejection of bits as the information currency of the brain. The brain isn’t magic. At some level, information is being processed.
I’ll run with the idea that chunking is like Huffman codes. Data is compressed by learning a dictionary mapping from a set of “codewords” (which efficiently represent the data) to the decompressed representation. For example, if the word “the” occurs very frequently in our data, we might assign it a very short codeword like “01”, while rare words like “lit” might get much longer codewords such as “1011010”.
A codeword is sort of like a chunk; it’s a “thing” in terms of which we compress. However, different code-words can contain different amounts of information, suggesting that they take up different amounts of space in working memory.[4]
According to this hypothesis, when psychologists such as Miller ask people to remember letters or numbers, the codeword size is about the same, because we’re asked to recall individual letters about as often as individual numbers. We don’t suddenly adapt our codeword dictionary when we’re asked to memorize a sequence of 0s and 1s, so that our memory can store the sequence efficiently at one-bit-per-bit; instead, we use our native representation, which represents “0” and “1″ via codewords which are about as long as the codewords for “5” and “j” and so on.
In effect, Miller was vastly underestimating working memory size via naive calculations of size in terms of bits. A string of seven numbers would contain 3.3 * 7 = 23.1 bits of information if stored at maximal efficiency for the number-remembering task. A string of seven letters would instead contain 4.7 * 7 = 33 bits, under a similar optimality assumption. But people don’t process information in a way that’s optimized for psychology experiments; they process information in a way that’s optimized for normal life. So, these two estimates of the number of bits in working memory are allowed to be very different from each other, because all we can say is that they are both lower bounds for the number of bits stored by working memory.
We can get a more accurate estimate by returning to the memory athletes. Although some memory athletes might start with a biological advantage, let’s assume that it’s mostly a matter of learning to chunk numbers into larger sequences—that is, learning an encoding which prioritizes number sequences more highly.
There will always be more to a memory athlete’s life than memorizing numbers, so their chunk encodings will never become perfectly efficient for the number-memorization task. However, maybe a memory athlete who devotes about half of their time to memorizing numbers will approach a code with only 1 bit of inefficiency for the number-memorizing task (representing a 50-50 binary decision between representing sequences of numbers vs anything else). So we might expect top memory athletes to give us a fairly tight lower bound on the number of bits in working memory.[5]
Andrea Muzii memorized a 630-digit number within five minutes. Since the most efficient encoding we can have for random digits is about 3.2 bits, this suggests that Andrea Muzii has at least 2016 bits of working memory.[6] (If Muzii was storing 7 chunks in working memory, this would suggest chunks of about 288 bits each.)
So, according to this estimate, if we could freeze-frame a single moment of our working memory and then explain all of the contents in natural language, it would take about a minute to accomplish.[7]
If the sequential vs random access memories were significantly different, we would want to treat them differently. But, who knows what’s going on under the hood. For the purposes of this post, I’m more-or-less pretending that working memory is a flat array. This might not be anything like the truth.
Perhaps memories are stored as link structures, so random-access can be implemented as binary trees, and sequential can be implemented as linked lists. In that case, the two would be pretty similar in nature. Or perhaps sequences are stored in the auditory modality, which can “play back” in sequence, while random-access gets stored in visual memory, which can be accessed “at a glance”. In this case, they’d be using entirely different hardware. Perhaps it depends on the person!
I believe the main evidence for this hypothesis is the way memory experts improve their ability to memorize stuff they practice, like sequences of numbers, without improving their ability to memorize other sorts of things.[2] If total working memory was being increased, they should be able to memorize anything.[8]
The variable length of the codewords isn’t really the important thing, here. I’m just mentioning Huffman codes because it’s a particularly common version of the dictionary-coding idea. My main point is to imagine “chunks” as codewords in a dictionary code.
However, the idea that some chunks have a higher probability than other chunks feels intuitive. Variable-length codewords imply variable-probability chunks, while fixed-length codes imply that all chunks have the same probability.
It’s easier to imagine variable-length codewords shifting up and down in probability as we learn; it’s harder to imagine fixed-length codewords growing and shrinking in length to shift the probabilities of sequences as we learn.
Several commenters have questioned my use of “working memory” here. For example, it’s plausible that Muzii was able to store significant portions of the information in long-term memory within the five minutes. So, some comments on what I mean by “working memory” in this essay.
I see three approaches to reasonable definitions of “working memory”.
The first is intuitive/informal: “working memory” refers to memories which can be “kept in mind”. If I “get distracted” by a phone call, I might “forget what I was doing”, even though I may be able to recall what I was working on by consulting “long-term memory”—this suggests there’s some kind of “short-term” or “working memory” which has to do with what I was working on.
(Short-term memory and working memory are often treated as synonyms in the literature, according to wikipedia.)
The second approach is operational: define some tests of working memory size, such as those Miller employed. “Working memory” becomes whatever those tests measure.
The third approach is more neurological: based on our understanding of the brain, define “working memory” to point at specific elements of the information processing of the brain. Which elements? Well, we would base the selection on a combination of the first two approaches: which elements of the brain’s information processing seem to fit the informal idea of working memory? Which brain-regions are active during the experiments which form the operational definition?
I take an operational approach.
I see several reasonable approaches to operational definitions of short vs long-term memory: storage time (how long it takes to form a memory), recall time (how long it takes to retrieve a memory), and volatility (how long the memory sticks around—particularly if it’s not being actively maintained, that is, if we’re distracted by something else). All of these concepts are important, and each could individually be taken as a definition of “short-term vs long-term”.
The term “working memory” was coined by Miller. I think a reasonable operational definition, therefore, is “whatever Miller was measuring”.
However, I would like to further clarify this operationalization by explicitly indexing by the timespans involved in the experiment. A specific experimental setup can give people x seconds to memorize, y seconds between memorization and recital during which the memory must be maintained, and z seconds for recital. (z seems like the less important of these three variables, but, is still an important variable.)
So when I use Muzii for my final calculation, I’m really talking about a specific operationalized version of working memory where x = 5 minutes. This is going to be different from x = 4 minutes or x = 6 minutes, whereas by simply calling it “working memory”, I risk conflating.
Therefore, my punchline “one minute every moment” could more accurately be revised to “we recall about one minute of information from five minutes of trying to absorb information, if asked to recall very shortly after”.
(I’m not sure what the y-value was for Muzii’s world record that I cite.)
A core premise of this post is that it makes sense to talk about “working memory” as a relatively fungible information-processing resource within the brain. However, in reality it’s more like we have several scratchpads for different types of data: auditory working memory is called the “phonological loop” and stores a specific amount of sound; visual working memory is called the “visuospatial sketchpad”; and so on.
It’s unclear how, exactly, to connect this to the 7±2 number.
Different people vary in their conscious access to the different sensory working memories. Perhaps most people focus on one particular sensory modality for their conscious working memory. It could be that 7±2 represents what people can store in this primary working memory, with part of the variation explained by which type of working memory they use and how suitable it is to the particular memory task.
For someone who thinks in sounds, the highest bit-density would be achieved when remembering actual sound. Other types of information would have to be encoded in sound (probably as spoken language), which would be relatively inefficient.
For someone who thinks in images, the highest bit-density would be achieved when remembering images (video?), and other things would have to be encoded (as icons? written text? geometry? some combination?).
Memorization tasks might measure the lower-bit-density mode, where we’re encoding stuff more indirectly. On the other hand, maybe that’s the main thing we care to measure!
Another big caveat to this whole post is the idea that we might have multiple different levels of memory, ranging from very short-term to very long-term. When memorizing a sequence, we could store different pieces at different levels; perhaps we fill up our shortest-term memory, and then store some pieces in a medium-term memory while we keep juggling the shortest-term memory by rehearsal to prevent forgetting.
In other words, we should avoid making overly naive assumptions about the information processing architecture of the brain without justification.
It seems like this should be confounded, though, if the memory athlete finds a way to encode other information into the format they’ve practiced; EG, memorizing random letters by encoding them as random numbers.
This is similar to the memory-palace technique, in which a memory expert memorizes arbitrary information by forming associations with some standard sequence which has already been memorized.
I expect this technique to confound the notion that memory experts only have increased memory for things they’ve practiced; memory experts who have practiced such a technique should be able to display similar feats of memory for things they haven’t practiced.
However, although it creates the appearance of a more global increase in working memory, rather than narrow gains only due to “chunking” in oft-practiced domains, this technique should not actually overcome the normal information-processing constraints of the brain. Encoding arbitrary information as strings of numbers will be more or less efficient based on the encoding we come up with. For events in everyday life, my expectation is that our existing encoding would be about as efficient as possible, so there would be no gains to memory by the act of encoding these events to numbers, even for a memory athlete.
Although, the act of performing such an encoding might signal to the brain that the information is important to recall. So, some gains might be observed anyway.
One Minute Every Moment
About how much information are we keeping in working memory at a given moment?
“Miller’s Law” dictates that the number of things humans can hold in working memory is “the magical number 7±2″. This idea is derived from Miller’s experiments, which tested both random-access memory (where participants must remember call-response pairs, and give the correct response when prompted with a call) and sequential memory (where participants must memorize and recall a list in order). In both cases, 7 is a good rule of thumb for the number of items people can recall reliably.[1]
Miller noticed that the number of “things” people could recall didn’t seem to depend much on the sorts of things people were being asked to recall. A random numeral contains about 3.3 bits of information, while a random letter contains about 4.7; yet people were able to recall about the same number of numerals or letters.
Miller concluded that working memory should not be measured in bits, but rather in “chunks”; this is a word for whatever psychologically counts as a “thing”.
This idea was further reinforced by memory athletes, who gain the ability to memorize much longer strings of numbers through practice. A commonly-repeated explanation is as follows: memory athletes are not increasing the size of their working memory; rather, they are increasing the size of their “chunks” when it comes to recalling strings of numbers specifically.[2] For someone who rarely needs to recall numbers, individual numerals might be “chunks”. For someone who recalls numbers often due to work or hobby, two or three-digit numbers might be “chunks”. For a memory athlete who can keep hundreds of digits in mind, perhaps sequences of one hundred digits count as a “chunk”.[3]
However, if you’re like me, you probably aren’t quite comfortable with Miller’s rejection of bits as the information currency of the brain. The brain isn’t magic. At some level, information is being processed.
I’ll run with the idea that chunking is like Huffman codes. Data is compressed by learning a dictionary mapping from a set of “codewords” (which efficiently represent the data) to the decompressed representation. For example, if the word “the” occurs very frequently in our data, we might assign it a very short codeword like “01”, while rare words like “lit” might get much longer codewords such as “1011010”.
A codeword is sort of like a chunk; it’s a “thing” in terms of which we compress. However, different code-words can contain different amounts of information, suggesting that they take up different amounts of space in working memory.[4]
According to this hypothesis, when psychologists such as Miller ask people to remember letters or numbers, the codeword size is about the same, because we’re asked to recall individual letters about as often as individual numbers. We don’t suddenly adapt our codeword dictionary when we’re asked to memorize a sequence of 0s and 1s, so that our memory can store the sequence efficiently at one-bit-per-bit; instead, we use our native representation, which represents “0” and “1″ via codewords which are about as long as the codewords for “5” and “j” and so on.
In effect, Miller was vastly underestimating working memory size via naive calculations of size in terms of bits. A string of seven numbers would contain 3.3 * 7 = 23.1 bits of information if stored at maximal efficiency for the number-remembering task. A string of seven letters would instead contain 4.7 * 7 = 33 bits, under a similar optimality assumption. But people don’t process information in a way that’s optimized for psychology experiments; they process information in a way that’s optimized for normal life. So, these two estimates of the number of bits in working memory are allowed to be very different from each other, because all we can say is that they are both lower bounds for the number of bits stored by working memory.
We can get a more accurate estimate by returning to the memory athletes. Although some memory athletes might start with a biological advantage, let’s assume that it’s mostly a matter of learning to chunk numbers into larger sequences—that is, learning an encoding which prioritizes number sequences more highly.
There will always be more to a memory athlete’s life than memorizing numbers, so their chunk encodings will never become perfectly efficient for the number-memorization task. However, maybe a memory athlete who devotes about half of their time to memorizing numbers will approach a code with only 1 bit of inefficiency for the number-memorizing task (representing a 50-50 binary decision between representing sequences of numbers vs anything else). So we might expect top memory athletes to give us a fairly tight lower bound on the number of bits in working memory.[5]
Andrea Muzii memorized a 630-digit number within five minutes. Since the most efficient encoding we can have for random digits is about 3.2 bits, this suggests that Andrea Muzii has at least 2016 bits of working memory.[6] (If Muzii was storing 7 chunks in working memory, this would suggest chunks of about 288 bits each.)
How can we visualize 2016 bits of information?
Human languages have an information rate of about 39 bits per second, so if we pretend that working memory is written entirely in natural language, this corresponds to about 52 seconds of natural language. Dividing by 7±2, we get five to ten seconds per chunk.
So, according to this estimate, if we could freeze-frame a single moment of our working memory and then explain all of the contents in natural language, it would take about a minute to accomplish.[7]
If the sequential vs random access memories were significantly different, we would want to treat them differently. But, who knows what’s going on under the hood. For the purposes of this post, I’m more-or-less pretending that working memory is a flat array. This might not be anything like the truth.
Perhaps memories are stored as link structures, so random-access can be implemented as binary trees, and sequential can be implemented as linked lists. In that case, the two would be pretty similar in nature. Or perhaps sequences are stored in the auditory modality, which can “play back” in sequence, while random-access gets stored in visual memory, which can be accessed “at a glance”. In this case, they’d be using entirely different hardware. Perhaps it depends on the person!
Epistemic status: hazy memory from psychology classes over a decade ago.
I believe the main evidence for this hypothesis is the way memory experts improve their ability to memorize stuff they practice, like sequences of numbers, without improving their ability to memorize other sorts of things.[2] If total working memory was being increased, they should be able to memorize anything.[8]
The variable length of the codewords isn’t really the important thing, here. I’m just mentioning Huffman codes because it’s a particularly common version of the dictionary-coding idea. My main point is to imagine “chunks” as codewords in a dictionary code.
However, the idea that some chunks have a higher probability than other chunks feels intuitive. Variable-length codewords imply variable-probability chunks, while fixed-length codes imply that all chunks have the same probability.
It’s easier to imagine variable-length codewords shifting up and down in probability as we learn; it’s harder to imagine fixed-length codewords growing and shrinking in length to shift the probabilities of sequences as we learn.
This is, obviously, wild speculation.
Several commenters have questioned my use of “working memory” here. For example, it’s plausible that Muzii was able to store significant portions of the information in long-term memory within the five minutes. So, some comments on what I mean by “working memory” in this essay.
I see three approaches to reasonable definitions of “working memory”.
The first is intuitive/informal: “working memory” refers to memories which can be “kept in mind”. If I “get distracted” by a phone call, I might “forget what I was doing”, even though I may be able to recall what I was working on by consulting “long-term memory”—this suggests there’s some kind of “short-term” or “working memory” which has to do with what I was working on.
(Short-term memory and working memory are often treated as synonyms in the literature, according to wikipedia.)
The second approach is operational: define some tests of working memory size, such as those Miller employed. “Working memory” becomes whatever those tests measure.
The third approach is more neurological: based on our understanding of the brain, define “working memory” to point at specific elements of the information processing of the brain. Which elements? Well, we would base the selection on a combination of the first two approaches: which elements of the brain’s information processing seem to fit the informal idea of working memory? Which brain-regions are active during the experiments which form the operational definition?
I take an operational approach.
I see several reasonable approaches to operational definitions of short vs long-term memory: storage time (how long it takes to form a memory), recall time (how long it takes to retrieve a memory), and volatility (how long the memory sticks around—particularly if it’s not being actively maintained, that is, if we’re distracted by something else). All of these concepts are important, and each could individually be taken as a definition of “short-term vs long-term”.
The term “working memory” was coined by Miller. I think a reasonable operational definition, therefore, is “whatever Miller was measuring”.
However, I would like to further clarify this operationalization by explicitly indexing by the timespans involved in the experiment. A specific experimental setup can give people x seconds to memorize, y seconds between memorization and recital during which the memory must be maintained, and z seconds for recital. (z seems like the less important of these three variables, but, is still an important variable.)
So when I use Muzii for my final calculation, I’m really talking about a specific operationalized version of working memory where x = 5 minutes. This is going to be different from x = 4 minutes or x = 6 minutes, whereas by simply calling it “working memory”, I risk conflating.
Therefore, my punchline “one minute every moment” could more accurately be revised to “we recall about one minute of information from five minutes of trying to absorb information, if asked to recall very shortly after”.
(I’m not sure what the y-value was for Muzii’s world record that I cite.)
A core premise of this post is that it makes sense to talk about “working memory” as a relatively fungible information-processing resource within the brain. However, in reality it’s more like we have several scratchpads for different types of data: auditory working memory is called the “phonological loop” and stores a specific amount of sound; visual working memory is called the “visuospatial sketchpad”; and so on.
It’s unclear how, exactly, to connect this to the 7±2 number.
Different people vary in their conscious access to the different sensory working memories. Perhaps most people focus on one particular sensory modality for their conscious working memory. It could be that 7±2 represents what people can store in this primary working memory, with part of the variation explained by which type of working memory they use and how suitable it is to the particular memory task.
For someone who thinks in sounds, the highest bit-density would be achieved when remembering actual sound. Other types of information would have to be encoded in sound (probably as spoken language), which would be relatively inefficient.
For someone who thinks in images, the highest bit-density would be achieved when remembering images (video?), and other things would have to be encoded (as icons? written text? geometry? some combination?).
Memorization tasks might measure the lower-bit-density mode, where we’re encoding stuff more indirectly. On the other hand, maybe that’s the main thing we care to measure!
Another big caveat to this whole post is the idea that we might have multiple different levels of memory, ranging from very short-term to very long-term. When memorizing a sequence, we could store different pieces at different levels; perhaps we fill up our shortest-term memory, and then store some pieces in a medium-term memory while we keep juggling the shortest-term memory by rehearsal to prevent forgetting.
In other words, we should avoid making overly naive assumptions about the information processing architecture of the brain without justification.
It seems like this should be confounded, though, if the memory athlete finds a way to encode other information into the format they’ve practiced; EG, memorizing random letters by encoding them as random numbers.
This is similar to the memory-palace technique, in which a memory expert memorizes arbitrary information by forming associations with some standard sequence which has already been memorized.
I expect this technique to confound the notion that memory experts only have increased memory for things they’ve practiced; memory experts who have practiced such a technique should be able to display similar feats of memory for things they haven’t practiced.
However, although it creates the appearance of a more global increase in working memory, rather than narrow gains only due to “chunking” in oft-practiced domains, this technique should not actually overcome the normal information-processing constraints of the brain. Encoding arbitrary information as strings of numbers will be more or less efficient based on the encoding we come up with. For events in everyday life, my expectation is that our existing encoding would be about as efficient as possible, so there would be no gains to memory by the act of encoding these events to numbers, even for a memory athlete.
Although, the act of performing such an encoding might signal to the brain that the information is important to recall. So, some gains might be observed anyway.