On Measuring Intellectual Performance - personal experience and several thoughts

A very interesting problem is measuring something like general intelligence. I’m not going to delve deeply into this topic but simply want to draw attention to an idea that is often implied, though rarely expressed, in the framing of such a problem: the assumption that an “intelligence level,” whatever it may be, corresponds to some inherent properties of a person and can be measured through their manifestations. Moreover, we often talk about measurements with a precision of a few percentage points, which suggests that, in theory, the measurement should be based on very stable indicators.

What’s fascinating is that this assumption receives very little scrutiny, while in cases where we talk about “mechanical” parameters of the human body (such as physical performance), we know that such parameters, aside from a person’s potential, heavily depends on numerous external factors and what that person has been doing over the past couple of weeks.

In this text, I will discuss my experience not with measuring my intelligence level but with measuring my intellectual performance level under various circumstances.

Why Could This Be Useful?

There could be several practical benefits from this:

If there are simple ways to temporarily boost your “intellectual shape”, it is obviously good to know about them. By regularly measuring your intellectual performance based on easily adjustable factors, you might discover such methods.
It’s useful to recognize when, on a particular day, you are consistently less sharp than usual. On such days, it’s wise to avoid tackling problems that are at the limits of your intellectual abilities, or at least to be skeptical of decisions made on those days and double-check them later.
If there are identifiable factors that predict drops in intellectual performance, you can plan around them and schedule critical tasks during periods of peak performance.
There are likely applications in the field of biohacking.

Several Requirements to the Method for Measuring Intellectual Performance

After some thought, I arrived at the following requirements for a measurement method:

The result should somehow reflect the manifestation of intellectual abilities — such as pattern recognition, decision-making, prediction, optimization, attention to detail… With all other factors remaining the same, an improvement in these abilities should correspond to an increase in the result of measurment, while a decline should lead to a decrease.
The method should be very cheap in terms of effort and time spent.
It should be as stable as possible: being in a certain state on different days, I should achieve similar results.
The result should be highly sensitive to changes in my mental state. If I subjectively feel that I am performing below my usual level on a given day, the result should reflect that decrease. Conversely, if I feel sharper, the result should improve.
I need to be able to repeat the measurement a very large number of times (and this is not a repetition of requirement 2).
It is important that, despite repeated practice, I do not significantly improve at the test-passing skill itself.

I would like to point out that traditional intelligence tests, such as IQ tests or psychometric assessments, clearly fail to meet at least requirements 2, 3, 5, and 6. As for requirement 4, while it’s less obvious, these tests are not designed to account for fluctuations in mental state. This ties back to the point I made at the beginning of this text: for the concept of an absolute intelligence level to make sense, we are forced to assume that this parameter represents a stable, inherent trait of a person, largely unaffected by situational factors.

Regarding requirement 1, a brief explanation is necessary since it is crucial. I am not concerned with measuring an absolute level of intelligence — essentially, a parameter that allows for the comparison of different individuals. What interests me is comparing the intellectual performance of the same person in different mental states and environments, not an absolute value but rather a deviation from their baseline performance when in a “neutral” state.

My Choice

The measurable parameter I settled on is the Elo rating from solving tactical chess puzzles.

For those readers who are not familiar with this topic, here is a short explanation of how it works on Lichess, the platform I use. The puzzles themselves are positions, mostly taken from real games. The task is to find a sequence of several (usually a small amount, 2-5) moves that lead one of the parties to a significant advantage or victory. The rating system works similarly to the Elo rating system used in chess matches. Each puzzle is assigned a rating based on its difficulty, and players solving puzzles have their own puzzle-solving rating, which reflects their skill level in this particular activity. The puzzle rating is adjusted according to the rating of the players who solve them successfully or fail. The players’ rating is adjusted according to the rating of the puzzles they solve or fail. The puzzles for each player are selected according to the current rating of that player and a short history of changes in that rating.

Let’s see how the properties of this measurement method relate to the requirements that were proposed in the previous section.

The first requirement is met with some limitations. It’s unlikely that solving chess puzzles can measure one’s ability to develop long-term strategies, and there are probably aspects of dealing with incomplete information that won’t show up here either. Nevertheless, it does capture some aspects of intellectual performance. I would also argue that most of the relevant abilities that are missed here would, in DnD terms, be represented not by Int but by Wis.
Price in terms of effort and time spent. In my experience, to reach a stable rating that accurately reflects your current state, you typically need to solve about 10-15 puzzles. If you don’t impose time limits (which could be useful itself but is quite uncomfortable), this takes less than ten minutes. That is, each act of measuring one’s current state is very cheap.
Stability/robustness. I conducted experiments: the rating stabilizes fairly quickly, and after about a dozen puzzles, it fluctuates around a certain value for a while before eventually dropping as fatigue sets in.
Sensitivity. Again, I can only rely on my observations, but in general, the method shows sensitivity in nearly all the cases where I’d expect it.
Repetition resource. I use the Lichess app. I haven’t found exact information about the number of puzzles available, but the number is in the order of millions — ought to be enough for anybody.

So, all the requirements, except for requirement 6th, are relatively well met.
Now, regarding requirement 6. This is, of course, the most questionable, given the very nature of tactical chess puzzles: if solving them didn’t help improve chess skills, and if chess skills didn’t translate into puzzle-solving ability, there wouldn’t be any point in doing them for chess players (and they think it’s a useful activity). So it’s reasonable to expect that my chosen method might be flawed due to its inconsistency with this particular requirement.

However, here’s an idea. Each person, not just in chess but in any activity, has a certain maximum level of skill that is achievable with a certain type and amount of practice/training. In any activity, you can’t make significant, long-term progress without changing your approach to practice and/or your level of commitment. And that’s enough for my purpose. If you take someone who isn’t a particularly avid chess player or puzzle solver, and who doesn’t have a strong desire to improve, this method should work fine for them almost right away. Theoretically, it should also work for someone who has been passionate about chess for many years and has reached a plateau skill level relative to their comfortable investments in improving. But based on indirect and imprecise evidence (literally, from watching chess streamers), I suspect that such people have some issues with requirement 3. In any case, I’m not one of them, and most likely neither are most people reading this. For those who are, I would venture to guess that everything I’m discussing here could be applied to Go tactical puzzles without any changes.

That said, just in case, I looked into how I could improve my puzzle-solving skills and deliberately avoided those methods. For me, these would be long sessions focused on solving puzzles in narrow categories. Lichess offers this option, and once I noticed the issue, I avoided experimenting with it further.

How I Used It

I simply registered on Lichess and started solving puzzles. For about the first week, I watched my rating fluctuate and learned a little along the way. It probably worked to my advantage that I’m already fairly old and not particularly adept at learning through simple practice. After a week, my rating stabilized around the level where it remains today, three years and approximately 25,000 solved puzzles later. During this first week, I made an effort to log into the app in a state I subjectively perceived as “I’m okay”. This is likely an important factor.

Once I noticed my rating had stabilized, I began experimenting, measuring my performance under significantly different conditions. Clearly, a rigorous, evidence-based study wasn’t possible here, so I decided to follow the high standards of Victorian-era amateur science: apply a change and observe its effects, without worrying about study blindness, placebo control, or even how previous experiments might affect the initial conditions for future ones — none of those luxuries were feasible.

For a time, I tried to record the results in numbers. Unfortunately, it turned out that the results weren’t reproducible with enough precision to make the numbers meaningful. However, the trends were consistent. If there were at least five copies of me, this could be addressed, but since there aren’t even two of me, I had to abandon such ambitions. That’s essentially it: if you’re interested in the effect of a particular factor on your overall intellectual performance, you can simply test it by checking, several times for reliability, how that factor affects changes in your Elo rating when solving puzzles.

What Kind of Results Can Be Obtained

Overall, this method works well for testing stimuli that are controlled and have relatively quick effects. It’s possible to determine whether such stimuli influence intellectual performance and, if so, to what extent. Sometimes, through self-observation, you can uncover additional interesting details.

Here are examples of results I’ve confidently obtained about myself:

Lack of sleep — no surprise — greatly reduces my performance.
Caffeine — now, here’s something unexpected — has little effect on performance (it reduces it but slightly), but significantly speeds up decision-making with almost no change in the quality of the decisions.
Checking my performance after being woken up by an alarm in the middle of the night gave an interesting result: a slight drop in rating (less than with sleep deprivation) but with a significant change in the decision-making process itself. Decisions were more intuitive, less analytical, and harder to explain. It felt more like an aesthetic choice than a logical one.
In a state subjectively felt as stress (I didn’t test hormone levels, so only a subjective feeling), performance drops significantly and — what is interesting — remains low for quite some time, well beyond the period of perceived stress.
Acute hunger — like skipping lunch with a regular eating schedule — reduces performance. Prolonged hunger — about a day or more — increases it. Perhaps this is more related to blood glucose levels; I haven’t had the chance to check.
Switching from another activity that requires mental effort but isn’t exhausting gives a noticeable boost. I tested this after playing a computer game and attending work meetings.
I was fortunate to experience an interesting version of post-COVID syndrome. It manifested in phases, separated by long stretches of normalcy. During flare-ups, my rating dropped significantly, but it was possible to achieve a “normal” level with much greater, subjectively felt effort than in a typical state. (This was quite unexpected: in a normal state, applying extra, above-comfort level effort provides a very small improvement.)
The result that is most important to me personally is that my intellectual performance noticeably improves after certain types of physical activity. There are two types of activities that have this effect. First, Nordic walking or some kettlebell exercises — both increase performance for several hours. Second, bodybuilding-style workouts have a positive influence as well, but the effect lasts up to 4-5 days, which was quite unexpected. I also tried to test strength workouts, but the results were inconsistent.

One more observation, which I don’t know if it’s possible to use to my advantage, but it relates to maintaining the purity of the data: my formal rating noticeably increases when, while solving puzzles, I verbally (out loud, through the mouth, and not in an internal monologue) describe my thought process out loud (similar to the “rubber ducky” method, though not quite the same).

What Should Have Been Done, But I Didn’t

I regret not making thorough measurements—which I think wouldn’t have been too difficult — regarding things like a course of Pantogam, certain vitamins, or metformin.
It should be easy — and it should be done — to correlate the measurements with some simple physiological indicators: blood glucose, heart rate etc. And maybe CO2 levels in the surrounding air.
I should also mention a direction I excluded for myself: in general, it makes sense to include a time factor in the measurement process, limiting the time allotted for each puzzle. This doesn’t suit me, as it makes the procedure uncomfortable, meaning I probably wouldn’t have lasted that long. But it does seem reasonable, and perhaps it’s a path to more convincing results.

As an Afterword, Two Final Thoughts:

Does this even work?
Perhaps a good proof that my approach works would be to show what benefit I got from the results. But whether I’ve truly benefited from the information I’ve gathered is a tough question. Unfortunately, the real problems I face in everyday life — which, if solved successfully, lead to tangible benefits — are very diverse, demand different kinds of intellectual engagement, and are burdened by ever-changing external constraints. They don’t really lend themselves to measurable results. I believe I am able to use the information I’ve gathered, but I recognize my bias here. I’d say that this entire text serves as a stronger argument for the practical value of this method than my personal testimony does. It’s worth trying.
How Individual Are the Findings About External Factors’ Influence on Intellectual Performance?
I have no idea. I don’t have a second person who’s done research like mine, so I can’t compare my results with anyone else’s. I suspect that some effects — like those of caffeine — are fairly universal. But others — like the effects of different types of physical activity — are likely influenced heavily by personal history.

And that’s all. Thank you for your attention.

On Measuring Intellectual Performance—personal experience and several thoughts