Expert Iteration From the Inside

Epistemic Status: Possibly generalizing from one example, but seems simple enough to apply more broadly.

I think it’s a good idea to try out bad ideas. No, I don’t mean really bad ideas, such as ideas where both your System 1 and System 2 judgements warn against them, I mean bad as in your System 1 makes you feel like it might be worth a try, but your System 2 can’t verify that it’s truly the optimal strategy. In the majority of situations it’s just not possible to calculate fully all the possible consequences of an action during the window of opportunity where the action is feasible. And yet, society moves on. Despite a lot of people relying mainly on their System 1 intuitions, this does not result in disaster often enough to warrant a big shift in the weighting of these two systems.

System 1 often gets a bad rap. It’s viewed as something that the majority of people rely on to a fault over their System 2 thinking, which results in a lot of negative biases and other failure modes. It’s something that seems very difficult to tune and no one seems to know exactly how to change it, whereas System 2 is something that we can deliberately control by learning about the world, taking empirical data, creating new models, watching out for biases and so on.

Unfortunately, lots of real world problems need to be solved quickly, a lot of human endeavors require cooperation between people with lots of different beliefs and incentives and that are hard to accurately model, and we still have a need to take actions in this world. This can’t always be accomplished by sitting and thinking until we’re convinced that we have theoretical guarantees that our actions are optimal.

It seems that within the rationality community, there may be a tendency to mistrust intuition and to be very risk averse, requiring a lot of theoretical guarantees before progress can be made. I argue that this is a bad thing. And while there’s obviously room for risk aversion, and reliance on System 2, I also argue that it’s not strictly necessary to lean fully on 2 in order to be a good rationalist. Yes, intuition can be the source of many of our biases and flaws in judgement, but intuition can be tuned with training, and as I want to stress, should be. In fact, as I will argue, having a well developed System 1 will be a source of greater contentment and satisfaction.

A few months ago, I sort of devised my own system for measuring my overall “happiness”, which I write down twice a day each day, using a number system between 0 and 10. The majority of these numbers I never reach, they are there mainly because they are theoretically possible, but the majority of the time is spent, for me, between 4 and 7. The reason I put “happiness” in quotes is that it’s not really a measure of joy or pleasure, or anything that is directly related to the environment I’m in. Rather, it’s more like a general state of mind, on a positive-negative axis, that remains in that state for long periods of time, somewhat robust to changes in circumstances, but most importantly, seems to affect everything I do. This scale could probably be better described as my overall level of contentment or satisfaction.

For example, I chose 5 to represent my “neutral state.” Neutral, for me, feels like I do not feel very strong motivations to do one thing or the other—I am not really excited about anything, don’t really feel strongly positive thoughts about my identity or ego, and don’t feel any need to do anything beyond what habits and current obligations dictate. I’m here roughly 30% of the time.

Six is a slightly positive state. I tend to feel pretty ok with the way things are. I may even start a conversation with someone I might not normally want to talk to. I feel a little more social, a little more comfortable with myself, and a little more willingness to put effort into my daily tasks. Most things in the world are pretty tolerable at this state. I’m here (during good months) roughly 50% of the time.

Seven is what I like to call “light mania.” I feel very good about myself. I take happiness in positive evaluations of myself—basically I feel like I’m capable of a lot and should therefore do a lot. In this state, I enjoy talking to people, I am much more social, I get excited about ideas, and in fact, tend to write a lot, like what I’m doing right now. I tend to feel like my thoughts and actions are worth something in the world. I’m here maybe 10% of the time. I don’t know if I’ve ever been above seven, if I have, only rarely. 9 and 10 might only be possible during euphoric states or intense mania.

The rest of my time is spent in negative states, usually 4, but maybe even 3 at rare times. Three and below are what I’d call “depression” and 4 is a slightly negative state where I feel like I need to take it easy, slow down, do a careful re-evaluation of my abilities, and update my confidence. It is a risk-averse, slow, careful state.

The most interesting things about these states of mind is that the positive states appear to be much more System 1-heavy than the negative states. At 7, my thoughts and actions are a lot more fluid. They come to me very freely, often without much deliberate effort, and they often feel safe, as in trustworthy, for the most part. In 4, it takes much more effort to continue thought processes to completion, and actions require much more deliberate planning before they feel safe. I don’t write much in 4, I don’t interact much, but I do spend a lot of time reading and studying, trying to find my weaknesses and areas that require improvement.

Obviously, 6 and 7 are where I prefer to be most of the time, but the negative states probably have their uses. Not every risk is worth taking, and sometimes it really is a good idea to do full re-evaluations and course-changes that require not going along fully with the flow. However, it’s still not a happy place, and therefore it’s not where one wants to be most of the time. Most of what I accomplish is not done during these negative states.

I now spend a lot of my time trying to figure out what precedes the 6’s and 7’s, and if there is anything that can be done to increase the duration of my time in them and return to them when I fall out of it. The main pattern I can detect is that I might fall out of them when I take a risky action that fails. Especially when I happened to have a lot of confidence in the action. It hurts to make a big mistake that you didn’t predict. And this makes sense to me. If I’ve made a major mistake (as measured by the difference between outcome and prediction), then it seems reasonable to consider a course correction or a major model update. I think there is some sort of internal system in our minds that say “Ok, now it’s time to pause your intuitions for a moment, and reflect on what you’re doing.” This is probably implemented for our own benefit.

The 6’s and 7’s return gradually after one of these events take place, and probably stay until the next failure happens. Thus in order to remain here as long as possible, it simply suffices to make sure I don’t experience any major failures. Well, it’s simple to understand that part, at least. It’s not simple or easy to make sure my System 1 judgements nearly always match what System 2 deliberation would tell me.

But, it’s at least clear from this train of thought that it is highly worthwhile to train System 1 as much as possible. In other words, our intuitions should match our carefully constructed System 2 models as closely as they can. My goal is to be in a position where I am using my System 1 fast-mode thinking most of the time, and this process is not resulting in any major mistakes.

We at least know that this should be possible. After all, learning certain skills, like riding a bike, playing a sport, becoming an artist or musician, playing chess or go, or even programming or being good at math, requires a reliance on intuition and System 1 at the level of mastery of those abilities. As a novice, there’s much more deliberation and thinking involved when learning those skills. There’s a thought before each action that tries to evaluate whether the action proposed is correct. As the skill develops further, many of the less complicated and more frequent actions become second-nature, almost instantaneous, and many of the very bad actions just aren’t even considered anymore.

That’s not to say there’s never any thinking or System 2 processes happening—those are obviously occurring—but the attention of those processes are now devoted to much more complex or hierarchical actions, and a much narrower domain than they were applied to previously. The System 1 guides where System 2 attention should be applied to.

Interestingly, the recent AlphaGoZero system used a training algorithm that mimicked this process. In AlphaGo, the neural network learning to play the game uses its raw output to guide a tree-search process, which returns a better move than than the one output by the network itself. During training, the network learns by supervised learning to predict the output of the tree search process as well as the game outcome. Over time, the neural network’s output more closely matches the results of the tree-search process, which in turn guides more effective searches and allows the network to improve its overall skill. The authors of a paper describing a similar strategy called “Expert Iteration” explicitly invoke dual-process theory as a guiding principle behind their approach.

The basic idea behind Expert Iteration is this: There is a policy such as a reinforcement learning agent that we’d like to train on some specific task. The policy by itself is a “fast-mode” process likened to System 1. There is also an expert policy that is derived from the original policy (in this case, tree search guided by the fast policy), likened to System 2. The fast policy is trained to imitate the output of the expert—in other words, System 1 is trained to mimic System 2.

The successful results in reinforcement learning and game-playing AI research suggest that an agent making initially random and exploratory moves combined with gradual refinement of its policy is a good strategy for making progress at learning a skill.

As I mentioned earlier, the willingness to initially make random or exploratory moves requires some level of risk tolerance. You have to be fine with your model being imperfect, because you probably require lots of experience and empirical data before your explicit model can become an intuition.

As Nate Soares mentions in Dive In, sometimes the best approach to starting out on something is simply to have a plan, any plan, and act on that plan rather than wait it out forever. However, he freely admits that it is very difficult to know when it’s safe to dive in rather than wait and deliberate.

In practice, my strategy has been to wait until my overall mental state has moved up towards the positive 6’s and 7’s, which is a sign that it’s a good time to start taking riskier actions. Here, my intuitions are telling me, you’re doing pretty well, things are going smoothly. If you want to make progress, you’re going to have to try something new. If I’m down at 4, my intuition is telling me, now is not the time to take risks. You need so wait a little while, observe things, learn from people who know better than you. So in some sense, we rely on our System 1 to know how to use both our System 1 and System 2 to make decisions. If System 1 is actually the primary controlling mechanism, then it makes sense to try to refine it as much as possible, rather than try to ignore it.

Ignoring System 1, or misinterpreting it, is a failure mode that I think rationalist-type personalities might fall into. This framework arose somewhat out of a desire to help myself power through some depressive states. If I misunderstand why these states are arising, then it’s possible for me to dig myself into a deeper hole. If I start to lack confidence in myself, it’s easy to get stuck in the trap of “I’m terrible at everything, attempt nothing.” It’s much more worthwhile to look at the situation as an opportunity to observe passively and deliberate more, both as a mechanism for not falling into even more unpleasant and unhappy states of mind, but also for improving myself in general.

Unless I am severely ill, it is never a good idea to do absolutely nothing. So if I don’t feel like doing much, I shouldn’t interpret this as a statement about my overall ability—that I’m a lazy or unmotivated person, or because I can’t make any useful progress—instead it should be interpreted as meaning that now is not the right time to take many active, uncertain steps.

But if I were to plot the relationship between the level of contentedness / negativity-positivity on one axis, and the level of risk-aversion / risk-tolerance on the other, the observation that these things appear to be correlated should be considered as evidence for something. Evidence for what hypothesis? My hypothesis is that this is a mechanism for learning, as an incentivization system for gaining knowledge, skills, better models and transferring those into System 1 intuitions. Your overall contentedness is determined by the level of risk you are comfortable with taking—and the best way to get to a point at which you are comfortable taking risks is to succeed, a lot.

This is basically what the exploration-exploitation trade-off feels like from the inside.

This is very important to keep in mind if your goal is seeking greater quality of life, because this suggests that continued learning might be a key prerequisite for that. A good deal of happiness might be generated by the feeling of making progress at something, and this is good, because rate of progress isn’t necessarily a function of current social status, your current wealth, or level of resources, and therefore is independent of many factors in life that are out of your control.

Now of course, I’ve generated all of this by using introspection alone, and therefore there’s a possibility that these insights into my own psyche fail to generalize to other people. I don’t think I have much evidence that I’m an outlier, yet, so at this moment I think there could be value to what I’ve observed here.

And I’ve found it a useful tool for fighting akrasia. Or, rather, for dealing with it, since it doesn’t always need to be fought. But sometimes you’ve been lingering around contentment 4 and 5 for a while and want to move back up, but it’s difficult to see how. Worse, you find it difficult to find the motivation to accomplish useful tasks, and therefore won’t be putting much energy into fighting the akrasia.

But if my model is correct, then it suggests that as long as you observe yourself succeeding on a variety of tasks, it may not matter much what those tasks are. In other words, maybe the best way to get out of a 4 is to start doing i.e. household chores, because they’re right there, waiting to be done, simple to accomplish, and have a high likelihood of being successfully done (once started). In other words, to get yourself out of a hole, start with relatively low-risk tasks, do lots of those, and then move your way up from there. I want to emphasize: Accomplishing even relatively easy tasks results in a psychological boost.

Additionally, if you want to take “high risk” actions on one dimension (easy to fail) but low on another (failure has little concrete negative consequences), it might be worthwhile to try that as well. At the very least, you might gain a useful model update.

One of my worst habits, when I’m feeling low and have negative perceptions of my abilities, is to focus too much on the task I’m having trouble with, wasting many hours trying to do that one thing and being totally stuck—resulting in a negative feedback loop. This was a habit formed by frequent procrastination, since I would often not give myself enough time to do the task, then be forced to put all my focus in it at the last minute. A better strategy here is to stop, and move on to something easier (reminiscent of the common test advice to skip and return to difficult problems).

Speaking of procrastination, if it is being caused by an uncertainty of where your attention needs to be applied, then perhaps this should be taken as evidence for a need for deliberate planning. And indeed, this is often the strategy I take when I encounter the tendency to procrastinate. Instead of telling myself “I need to work on only this right now” and not being able to, I tell myself “Make a roadmap of what needs to be done.” Then that energy gets applied to the creation of a model of the current situation and priorities, where I might be able to find something easier to do first before I return to the task I was having trouble with before.

But all this is aligned with my overall strategy of improving my own life quality. The challenge is, once I achieve my “contentedness levels” of ⁶⁄₇, and become a little more risk tolerant, I therefore increase the risk of falling back to below 5. If my goal is to stay happy as long as possible, does it make sense to try to avoid taking risks even if I feel like I should?

This is a difficult question, because under this model, oscillating up and down slightly seems like what should be happening, never plateauing anywhere, and never reaching levels that are too high or too low.

If we’re talking about maximizing overall utility, but that utility is actually a proxy for some other kind of utility (like the success of the human species), then simply maximizing the proxy (individual happiness) might not be beneficial overall.

And that makes things sort of complicated for me, because I want to be happy, but maybe I’m really supposed to be fine-tuning how my overall contentedness oscillates over time to optimize my motivations in the correct directions. This is very difficult on two dimensions: The degree of control over what my overall contentedness is, and trying to figure out what the “correct” pattern is to begin with.

This is a challenging problem that I plan on returning to in the future, and welcome any additional insights from others.