Throughput vs. Latency
I’ve heard the 60% capacity idea before, and this reminds me of the Critical Chain/Theory of Constraints stuff. Looking at the formal queuing results, while the formal results might follow mathematically, I start to wonder about how applicable those models are, e.g. what counts as distributed system, etc.
Starting with your toy models sounds good!
Sure, I’ll try to start from first principles. I’ll use discrete time because continuous time makes the probability stuff more complicated without really adding anything (Poissons instead of Bernoullis).
Imagine you’re a worker and at every time step there’s a p chance you’re assigned work and a 1/2 chance you’ll complete the current task you’re working on. So there are three possible things that can happen to your inbox size X during a timestep:
X+=1 with probability p/2
X−=1 with probability (1−p)/2
X stays the same with probability 1/2
Let’s say for simplicity we just ignore/collapse away the time steps where X stays constant, so the evolution of X becomes a classic biased random walk where X+=1with probability p and X−=1 with probability 11−p- at each step.
Caveat: X never goes below zero: once you reach the legendary Inbox Zero you just sit idly waiting for the next job. So X is really a p-biased random walk with an absorbing barrier at 0. A simple way to understand the behavior of X is to define Y to be the same random walk without the barrier, and notice that the number of times (up to time t) the absorbing barrier eats a X−=1 step is exactly the farthest negativeY goes, i.e. the minimum value of YYup to t. So we get Xt=Yt−mina<tYa .
The critical question: if you have the option of tuning your workload p, by being thoughtful about what jobs you accept, what p do you choose? There are two measures of productivity you might optimize for: throughput (the rate at which you complete tasks), and latency (the average time a task sits in your inbox before completion).
Maybe I’ll pause here and see if there are any math questions.
As far as the background math goes, that all makes sense.
If going for throughput, i.e. getting as much done as possible, then I want to never be sitting idle with no task, so you choose a high value of p so that there’s always a task to be done. In the extreme, you’re always adding tasks.
If I want to minimize latency, I want to basically have an empty task queue and be doing nothing so that if something comes along, I’m free and ready to work on it. In the extreme, you go for now tasks added? Or very close to it?
Application will indeed determine where you want to be here. If I’m very wealthy, I might want to have my own personal doctor sitting idle, ready to see me if ever I have the need. If I have workers that do something not in connection with each other, and all work is identical, then maybe keeping them 100% is what I want.
But in distributed systems where there’s work piling up next to bottlenecks, or you need to be responsive, it’s not one extreme or the other.
Really, I feel like where you fall on throughput vs latency has to depend on some other optimization target.
The interesting part for me is the quantitative dependence here. One observation about this model: if p=1/2, you would think that you are taking on tasks at the same average rate as you complete them, but the average latency goes to infinity.
One observation I’ve made about academia is that we value throughput (i.e. number of papers on CV at the end of N years) much more than latency, so under this model one naturally expects the incentives to lead everyone to take on more work than they can comfortably handle.
Despite the fact that many papers get written, this state of affairs leads the system to be surprisingly slow at writing individual papers.
Let me repeat the key point for emphasis: even if you only accept as much work as you are capable of doing on average, because of randomness your inbox will grow to infinity.
It’s funny, I named my Trello to-do board “The Infinite To-Do”. Which goes with the associated saying of mine that I’m fond of: “the work is infinite, it’s all a matter of prioritization.”
Being lazy here, can you sketch out the math for average latency goes to infinity? Or just the formula for latency.
When it comes to my personal task list, if I’m actually populating it with everything I might like to do, then it becomes likely most things on it will never get done (infinite latency?) but the latency that matters is latency on high priority tasks. It matters that queue isn’t FIFO, but rather I can pick what seems highest value. If I abandon tasks part-way through, that’s wasted effort, and there are costs to open loops even if I do eventually complete the work. But to prevent an open loop remaining open, I might finish a priority 5 task before I switch to the new priority 7 task, thereby introducing latency on it, and seems like a reasonable question on how much of that I want
I’ve come to the same realization over time about treating todos as priority queues.
Here’s a quick sketch of how the latency grows. Let’s take p=1/2 for simplicity. Recall that Xt=Yt−mina<tYa, where Xt is random walk with barrier at 0 and Yt is random walk without. Then the average value of Yt is 0 (since p=1/2 the walk is unbiased), while the average value of mina<=tYa will be something like −√t, as a crude estimate. So putting that together, running this scheme t timesteps, you expect your task list to have size about sqrt(t) (see this, e.g.). Equivalently, an average new tasks expects to enter the queue with at E[Xt]=√(t) on the FIFO before it.
Another clarification: even though the expected latency goes to infinity, at p=1/2 exactly we expect every task to eventually get done in a finite amount of time. Indeed, the random walk Yt will attain its minimum value infinitely often, and at each of these timesteps Xt=0, i.e. all outstanding jobs are done. Once p>1/2, however, Yt becomes transient and one does expect most jobs to never get done.
So to refine the point made before: if you only accept as much work as you are capable of doing on average, because of randomness your inbox will grow to infinity on average, expected latency will grow to infinity, and also, surprisingly perhaps, you expect to have nothing to do infinitely often.
nod nod
That all makes sense and seems well and good. It’s tempting to start seeing how this does or doesn’t apply in various contexts.
You mentioned the academic publishing one, where else have you thoughts gone on this?
I suppose I might also prompt for any other models you want to share.
I think it’s fairly typical across domains for throughput to be incentivized at the individual level over latency, leading organizations to move much slower than they can in principle. Academic publishing is just one example—the same dynamic exists whether your career performance is measured by amount of papers published, code delivered, sales completed, pages written, etc.
Imagine two organizations: in the first, each worker works at half capacity (say p = 1⁄4), but tasks are completed in weeks. In the second, each worker works at full capacity (say p = 1⁄2) and twice as many tasks are completed in the long run, but individual tasks are completed in years. I think there’s an interesting tradeoff here that is not always made thoughtfully, and most orgs tend towards the latter state by default. I would hazard a poorly-informed guess that many organizations fail by landing on the wrong side of the tradeoff from where they need to be (for example if their competitive advantage is very time-sensitive, or if completing a smaller number of tasks more quickly allows them to scale faster, doing more in the long run).
I think you’re right that the incentive push against this tradeoff being made sensibly. I think this is true for how individuals are incentivized, but employers, for example, are choosing to create that incentive.
My guess is what goes wrong is psychologically “how much you get done” (which is a function of how much you work) feels very real and very obviously a thing to care about, whereas latency is a non-obvious metric to care about. Or maybe it seems like a good thing, but it’s just not as salient.
Alternatively stated, it’s hard for an employer to look at the employee doing nothing and think that’s better than them trying to do something else productive.
Thinking about Lightcone: the list of things we think might be good to do is a mile high, it really does feel inconceivable that we’d take core staff and underload them. Instead we try to prioritize, try to switch readily, and try to minimize work in progress that results from the switching we do.
This isn’t obviously wrong to me. I am interested to think where we might be making the wrong tradeoff. But I also think the toy model above doesn’t model us all that well.
leading organizations to move much slower than they can in principle.
I’m interested to think about what moving faster via less latency would look like.
My guess is actually the thing is to model work a priority queue, and thing needed is not to keep reserve capacity in sense of being idle, but instead preserving ability to drop what you’re doing and do something higher priority comes along.
If you assume any task can be abandoned free of charge, just work on the top priority thing until something better comes along.
Then the costs of being busy doing something already come from whether beginning that tasks involves a costly commitment one must uphold, plus switching costs, or costs of not abandoning stuff and instead putting onto the backburner/icebox.
I think some commitment to continuing work that’s started is simply psychological. People dislike abandoning things they’ve gotten invested in. I wonder whether cultivating the ability to readily drop stuff is the better approach over trying to start less stuff to begin with.
I agree that prioritization makes a big difference, and that the toy model above doesn’t capture a lot of things.
Let’s see if we can say something about prioritization formally. We can imagine that the jobs are partitioned into importances 1…n, 1 being highest, and a single worker takes on jobs of priority i at rate pi, and always completes jobs of the highest priority first. This tracks how many tech companies handle bugs for example.
Then whenever the worker is idle at priority 1, they would work on something at priority 2, and so on, so one is rarely “truly idle.” Making this generalization likely improves throughput (especially if it’s weighted by priority), but doesn’t necessarily solve the latency problem: each worker is still incentivized to load up on as many priority 1 tasks as possible to minimize priority-1 idleness, then as many priority-2 tasks as possible, and so on by induction. Priority-1 tasks will still expect to wait arbitrarily long, if enough such tasks are floating around. So, it seems to me for a priority system to improve latency of important tasks, there must be a relatively granular division of tasks to the point that there are few enough high-priority tasks that it is rare for them to get blocked on each other. What actually happens with such priority systems is opaque to me, but (afaik) it sometimes devolves into weird signaling games where the numbers don’t always mean that much? The added complexity creates a bunch of hidden costs.
On-call systems (I’m imagining firefighters, emergency response, on-call SRE’s but am not that informed about any particular field) are one example where certain work is very time-urgent, and the employers pay a huge premium for immediate attention to putting out fires. A good chunk of the time one is paid for being on-call is spent idle (or doing low-priority maintenance), essentially by design.
I suppose in a small organization like Lightcone where priorities are transparent and members conscientious and knowledgeable about the big-picture, one might trust individuals to consistently sort tasks (so that there is a distinct priority level for every job), and then indeed one can expect high-priority stuff to never ever get blocked stupidly, and still no thumbs to be twiddled. I’m not sure this scales.
Priority-1 tasks will still expect to wait arbitrarily long, if enough such tasks are floating around.
Oh, this is a key and valuable insight for me. I see now why Lightcone ends up having high latency despite prioritization (and theoretical ability to drop things in favor of higher priority): we struggle to prioritize to a sufficiently granular level, and a lot of things end up in the single approximate top priority bucket, meaning delays/latency.
Making the model a bit more complex, I think maybe we do prioritize granularly, but then switching costs prevent switching. A tad more formally: grant that tasks are more like Priority 1.00, Priority 1.05, Priority 2.03, etc. We will always swap out a Priority ~2 task for a Priority ~1 task, however what we have often is Priority 1.05 and Priority 1.10, and then you’re asking yourself is it worth it to drop the latter for the former? They’re approximately the same priority, so you don’t switch due to switching costs, open loop unfinished work costs, momentum, etc.
Another piece I’ll flag but don’t know if it makes sense to get into: we really only ever have fuzzy distributions of the priority of tasks. I think this lead us to have a bias in favor of finishing current tasks even if we think the mean of the distribution of a new task is a bit higher. So that too introduces latency.
This updates me towards a higher rate of idleness* actually being the effective way to reduce latency.
*I think actually what you can do is have Priority 1 tasks and ~Priority 3 tasks. You aim for ~60% capacity at Priority 1 tasks, filling the rest of the time with explicitly interruptible Priority 3 tasks that you’d clearly drop in favor of Priority 1. This keeps Priority 1 latency low.
That’s a cool idea! Enforce a certain amount of time being spent on explicitly low-priority work. Each worker should not take on more priority-1 work than they can immediately handle.
Another good case study is a standing army where most of the time is spent in interruptible training (afaik).
Three other points that come to mind related to this discussion:
Low latency is closely related to redundancy and reliability. Having lots of workers idle waiting to jump on the newest fire means you overhired compared to the theoretical minimum number of man-hours you actually need. So this model predicts that the way to think about reducing latency is not [force a fixed number of workers to idle half the time] but [hire twice as many workers as your throughput demands].
I notice in my job, especially collaborative work, there seems to be a sharp transition between the latency of synchronous and asynchronous conversations. That is, if I can respond to an email within an hour, it will often result in a back-and-forth that resolves the issue live, but if I miss the hour window where the other person is still in context, then the full exchange takes weeks. This is purely anecdotal but suggests that the throughput-latency tradeoff may be surprisingly nonlinear, i.e. latency doesn’t improve much until you invest in it past a high threshold.
There are probably standard terms for this that I’m missing, but it seems like a large issue causing delays is the concept of [once I claim or am assigned this job it’s my responsibility and nobody else should touch it]. Making this more adaptive in some way likely smooths out variance and latency significantly, e.g. consistently giving up on tasks when swamped and sending them back to the pool to be reassigned.
I wonder if these match your experience?
Lightcone has some operational philosophy/practice that’s aimed at addressing some of these. We’re very Slack (the app) centric with a strong expectation that you’re checking Slack frequently, responding to things, not letting things get blocked on you.
Combined with this is an extreme push for ~transparency. Everyone (from the core team) is in almost all channels, so it’s very visible what is and isn’t getting worked on, preventing things getting dropped.
I actually think we’ve gone too far and I’m now in too many channels, causing me to check them less. Arguably I could manage my channels more, but it feels like I’m constantly fighting entropy.
Hiring more people would be nice but is so incredibly hard. The overlap between people who meet our weird bar and people who want to work for us is slim, meaning we grow quite slowly.
But even when you can hire, scaling management can be a challenge, i.e., whoever is responsible for tracking what it’s getting dropped, assigning priorities, etc.
In terms of returning to the pool, I think how well that’s doable depends on how much context the task has and how much the person who was assigned to it built up. For a large task, the first dozen hours on it might be context building that can’t easily be passed onto another. Or alternatively, each person who might work on it has their own approach/philosophy, and they’re not capable of just continuing with how someone else was tackling it. I think that definitely pushes for people keeping tasks assigned to them. Plus, while we are generalists, people are not equally strong or capable of all tasks, another factor preventing reassignment.
Ah, yes, in our models so far we’ve been assuming fungibility of workers, which I think is the exception rather than the rule.
Sure, one shouldn’t take the prescriptions of toy models too seriously without much reflection.
Certainly this is at most one consideration among many for designing hiring strategy. For what it’s worth, I expect most organizations to land on the right side of the throughput vs. latency tradeoff by default anyway—getting delivering 10 total projects by the end of the year is likely better than getting 5 projects done, each within a month.
The main thing it seems useful to postulate is this granularity constraint, which dictates how priority needs to be adjusted to actually get time-sensitive work done quickly in a distributed system. Priorities need to be adjusted until most workers are not filled up on high priority work.
Regarding fungibility of workers, I agree that this is a big difficulty. Intuitively, one expects certain lynchpin people (your CEOs and 10x programmers) to be sufficiently indispensable that they should basically never be idle and there will always be high-priority work in their backlog. Anecdotally, though, a ton of progress tends to be unnecessarily blocked on these folks, and in severe cases I observe the entire rest of the organization decaying into slack variables, spinning their gears in a kind of learnedly helpless daze, blocked on the few lynchpins 4 days of the week.
Intuitively, one expects certain lynchpin people (your CEOs and 10x programmers) to be sufficiently indispensable that they should basically never be idle and there will always be high-priority work in their backlog.
They should never be idle, but also there shouldn’t be inappropriate latency on them getting to key things.
My recollection is fuzzy, but in Phoenix Project (for the audience, business novel on the Theory of Constraints), the focus becomes on relieving the bottleneck of their 100x developer.
Checking in, we’ve been going for a while now and it’s Friday. Anything we want to cover before maybe wrapping up, reflecting, etc?
Happy to wrap up!
To summarize the main points from my perspective:
Throughput and latency are two competing measures of human productivity. Throughput is the default and latency is often ignored (sometimes, but not always, for good reason).
One may find surprising and nonlinear gains in latency by artificially reducing throughput (enforcing idleness).
Prioritization is a more sophisticated way of making this tradeoff only on urgent work, minimizing the throughput sacrificed, but is not a magic bullet.
Meta-comment that I find the periodic notifications of “new users interested in dialogue matching” moderately frustrating, as there’s nothing for me to do about it.
Let me end with a little puzzle that suggests latency may be harder to see and more valuable than initially appears.
One of my favorite restaurants is a popular local sushi joint. Five days a week, there are basically no wait times, but Friday and Saturday night one expects to wait an hour for a table (they don’t take reservations). To illustrate my point in the extreme, let’s imagine that on Friday and Saturday there is 10x as much traffic as every other day of the week. What is the average wait time at this restaurant?
Naively, the average wait time from the perspective of the restaurant’s owner is 1 hour 2⁄7 of the time, so 2⁄7 of an hour, just over 17 minutes. Not bad, right?
However, if by average wait time I mean the amount of time waited, averaged over customers, then for every 5 customers that arrive Sunday through Thursday, 20 arrive on Friday and Saturday. So really 20⁄25 = 80% of customers are hitting busy hours, and the average wait time is 80% of an hour, 48 minutes. Not so good.
Taking us back to the general model, the more nonuniform the input stream of work is, the more the worker underperceives the latency experienced by the average job. There are many reasons on top of pure randomness that work deadlines might clump together and exacerbate this problem.
Taking us back to the general model, the more nonuniform the input stream of work is, the more the worker underperceives the latency experienced by the average job.
This indeed seems like the kind of things human perception would go wrong on.
Summarizing a few key points from my end:
This conversation was revelatory for me in seeing why Lightcone feels high latency to me despite us being able to prioritize and task-switch, which naively made me think we could do better than the toy model.
Key challenges that prevent task-switching and increase latency seem to be:
Insufficiently granular or obvious task prioritization. So if you have many tasks given the highest priority, those tasks will end up having high latency.
Even when it seems some tasks might be somewhat higher priority, switching costs, costs of accumulating work in progress, and momentum/inertia will prevent prevent switching, meaning latency.
It seems to me that you never want complete idleness, but if you want to preserve low latency on Highest Priority tasks, you might want to go for an average of 60% capacity, filling up remaining with tasks that you’d obviously switch away from.
The standard term for this observation is “Inspection Paradox”, see e.g. this blog post.
I’ve done a fair bit of work on scaling (and EXTREMELY related but often overlooked cost-measurement and pricing) for a high-volume mid-latency message transformation and routing system for a giant cloud infrastructure provider.
I like that you mentioned hiring, as there really are only two ways to get more done: increase resources or be more efficient in handling the work. Both are generally much longer-term choices, and don’t solve a short-term problem.
In the short term, the only real option is load-shedding (dropping or much-longer-than-normal queuing). If you’re overloaded, someone is going to be disappointed, and you should have a policy about who it will be, in order to decide whether you’re dropping work evenly distributed to hurt many stakeholders a little bit, or targeted dropping to annoy only a few stakeholders by a larger amount. Or the reverse thinking: what work is most important, drop everything else. You also need a policy for intake, and whether to reject, tentatively accept, or just accept and then maybe have to shed new work.
Which gets back to the value of the work units you’re performing. How upset a stakeholder will be if their work is dropped/delayed had better be related to how much they’ll pay for you to have sufficient slack to do that work in time for it to still be useful.
And all of this is made MUCH harder by work that is variable in resources required, illegible in value, and often more time-consuming to analyze than to actually perform. At scale, this requires a somewhat complicated acceptance mechanism—submission is tentative (meaning: acceptance of request does not imply any guarantee of action), and includes a bid of how much value it has to the requestor (could be just QOS level, with a contractual rate for that bucket). Then multiple statuses of request, most commonly “complete” with no intermediate, but also “queued” to indicate a longer-than-normal delay, and “dropped”. Most notably, don’t include “in-progress” or anything—that doesn’t mean anything different than “queued”—there’s no actual guarantee until it’s done. For human-granularity systems, you STILL need the policy, though the API is just the expected granularity of Slack communication.
As you note, slack (the concept, not the product) doesn’t need to be idle. It can be easily-pre-empted work, which is fairly cheap to drop or pause. For retail establishments, cleaning the store is a very common use of slack. For software companies, low-priority bugfixing or refactors or prototyping can be slack (they can also be important work, which can get very confusing).