I agree with you. Unless the signal is so strong that people believe that their personal experience is not representative of the economy, it’s going to be overweighted. “I and half the people I know make less” will lead to discontent about the state of the economy. “I and half the people I know make less, but I am aware that GDP grew 40%, so the economy must be doing fine despite my personal experience” is possible, but let’s just say it’s not our prior.
Daniel V
Exactly, which is why the metric Mazlish prefers is so relevant and not bizarre, unless the premise that people judge the economy from their own experiences is incorrect.
Why is this what matters? It’s a bizarre metric. Why should we care what the median change was, instead of some form of mean change, or change in the mean or median wage?
The critique that the justification wasn’t great because the mean wage dropped a lot in the example is fair. Yet, in the proposed alternative example it remains quite likely that people will perceive the economy as having gotten worse, even if the economy is objectively so much better − 2⁄3 will say they’re personally worse off, insufficiently adjust for the impersonal ways of assessing the economy, and ultimately say the economy is worse.
Neither nor is a bizarre metric. may be great for observers to understand general trajectories of income when you lack panel data, but since people use their own lives to assess whether they are better off and in turn overweight that when they judge the economy, is actually more useful for understanding the translation from people’s lives into their perceptions.
Consider a different example (also in real terms):
T1: A makes $3, B makes $3, C makes $3 ,D makes $10, E makes $12
T2: A makes $2, B makes $2, C makes $3, D makes $9, E makes $16
The means show nice but not as crazy economic growth ($6.20 to $6.40), and the is $0 ($3 to $3) - “we’re not poorer!” However, the is -$1. And people at T2 will generally feel worse off (3/5 will say they can’t buy as much as they could before, so “this economy is tough”).
Contrast that with (still in real terms):
T1: A makes $2, B makes $2, C makes $4 ,D makes $10, E makes $12
T2: A makes $3, B makes $3, C makes $3, D makes $10, E makes $12
The means show nice but not as crazy economic growth ($6 to $6.20), and the is -$1 ($4 to $3) - “we’re poorer!” However, the is $0. And people at T2 will generally feel like things are going okay (only 1 person will feel worse off).
And these are comparing to 0. Mazlish’s post illustrates that people will probably not compare to 0 and instead to recent trajectories (“I got a 3% raise last year, what do you mean my raise this year is 2%?!”), so #1 means people will be dissatisfied. #2, as it bore out in the data, also means dissatisfaction. And #3, largely due to timing, means further dissatisfaction.
Then it is no surprise that exit polls show people who were most dissatisfied with the economy under Biden (and assumed Harris would be more of that) voted for Trump. Sure, there’s some political self-deception bias going on (see charts of economic sentiment vs. date by party affiliation), but note that the exit polls are correlational—they can indicate that partisanship is a hell of a drug or that people are rationally responding to their perceptions. It’s likely both. And if your model of those perceptions is inferior in the ways Mazlish notes, you’d wrongly think people would have been happy with the economy.
Literally macroeconomics 101. Trade surpluses aren’t shipping goods for free. There is a whole balance of payments to consider. I’m shocked EY could get that so wrong, surprised that lsusr is so ready to agree, and confused because surely I missed something huge here, right?
I guess I misunderstood you. I figured that without “regression coefficients,” the sentence would be a bit tautological: “the point of randomized controlled trial is to avoid [a] non-randomized sample,” and there were other bits that made me think you had an issue with both selection bias (agree) and regressions (disagree).
I share your overall takeaway, but at this point I am just genuinely curious why the self-selection is presumed to be such a threat to internal validity here. I think we need more attention to selection effects on the margin, but I also think there is a general tendency for people to believe that once they’ve identified a selection issue the results are totally undermined. What is the alternative explanation for why semaglutide would disincline people who would have had small change scores from participating or incline people who have large change scores to participate (remember, this is within-subjects) in the alcohol self-administration experiment? Maybe those who had the most reduced cravings wanted to see more of what these researchers could do? But that process would also occur among placebo, so it’d work via the share of people with large change scores being greater in the semaglutide group, which is...efficacy. There’s nuance there, but hard to square with lack of efficacy.
That said, still agree that the results are no slam dunk. Very specific population, very specific outcomes affected, and probably practically small effects too.
I appreciate this kind of detailed inspection and science writing, we need more of this in the world!
I’m writing this comment because of the expressed disdain for regressions. I do share the disappointment about how the randomization and results turned out. But for both, my refrain will be: “that’s what the regression’s for!”
This contains the same data, but stratified by if people were obese or not:
Now it looks like semaglutide isn’t doing anything.
The beauty of exploratory analyses like these is that you can find something interesting. The risk is that you can also read into noise. Unfortunately, all they did was plot these results, not report the regression, which could tell us whether there is any effect beyond the lower baseline. eTable3 confirms that the interaction between condition and week is non-significant for most outcomes, which the authors correctly characterized. That’s what the regression’s for!
This means the results are non-randomized.
Yes and no. People were still randomized to condition and it appears to be pretty even attrition. Yes, there is an element of self-selection, which can constrain the generalizability (i.e., external validity) of the results (I’d say most of the constraint is actually just due to studying people with AUD rather than the general population, but you can see why they’d do such a thing), but that does not necessarily mean it broke the randomization, which would reduce the ability to interpret differences as a result of the treatment (i.e., internal validity). To the extent that you want to control for differences that happen to occur or have been introduced between the conditions, you’ll need to run a model to covary those out. That’s what the regression’s for!
the point of RCTs is to avoid resorting to regression coefficients on non-randomized samples
My biggest critique is this. If you take condition A and B and compute/plot mean outcomes, you’d presumably be happy that it’s data. But computing/plotting predicted values from a regression of outcome on condition would directly recover those means. And from what we’ve seen above, adjustment is often desirable. Sometimes the raw means are not as useful as the adjusted/estimated means—to your worry about baseline differences, the regression allows us to adjust for that (i.e., provide statistical control where experimental control was not sufficient). And, instead of eyeballing plots, the regressions help tell you if something is reliable. The point of RCTs is not to avoid resorting to regression coefficients. You’ll run regressions in any case! The point of RCTs is to reduce the load your statistical controls will be expected to lift by utilizing experimental controls. You’ll still need to analyze the data and implement appropriate statistical controls. That’s what the regression’s for!
I really like this succinct post.
I intuitively want to endorse the two growth rates (if it “looks” linear right now, it might just be early exponential), but surely this is not that simple, right? My top question is “What are examples of linear growth in nature and what do they tell us about this perception that all growth is around zero or exponential?”
A separate thing that sticks out is that having two growth rates does not necessarily imply generally two subjective levels.
This can be effectively implemented by the government accumulating tax revenues (largely from the rich) in good times and spending them on disaster relief (largely on the poor) in bad times. It lets price remain a signal while also expanding supply.
Taxation is better than a ban, but in this case it remains an attempt at price control. “Documented” cost increases is doing a lot of work. Better than “vibes about price,” but it is the same deal: the government “knows better” what prices should be than what is revealed by the market. I’d argue that if the government doesn’t like what the market is yielding, it can get involved in the market and help expand supply itself, which we see governments attempt during disaster relief already.
Agree, we’re not so shy about pursuing a good vibe, bad vibes are also informative.
Thanks, you had mentioned the short- vs. long-run before, but after this discussion it is more foregrounded and the “racing” explanation makes sense. :) Though I appreciated the references to marginal value and marginal cost.
You’re assuming that the economy will produce new jobs faster than the factories will produce new chips and robots to fill those jobs.
Well, the assumptions are primarily that the supply and demand for AI labor will vary across markets and secondarily that labor can flow across markets. This is an important layer separate from just seeing who (S or D) wins the race. If there is only one homogenous market, then the price trajectory for AI labor (produced through the racing dynamics) tells you all you’ll need to know about the price trajectory for its human substitute. So the question is just which is faster.
But if there are heterogenous markets, “which is faster” is informative only for that market and the price of human labor as a substitute in that market. The price trajectory for AI labor in other markets might be subject to different “which is faster” racing dynamics. Then, because of composition effects, the trajectory for the average price of AI labor that is performed may diverge from the trajectory for the average price of human labor that is performed.
This is true even if you assume the economy has no vacancies and will not produce new jobs (i.e., labor cannot flow across markets). For example, average hourly earnings spiked during COVID because the work that was being performed was high-cost/value labor, an increase seemingly entirely due to composition [BLS]. Although I am alleging that predicting the price trajectory remains difficult even if you take a stance on the racing dynamics because you need to know what the alternative human jobs are, in that world where jobs are simply destroyed, the total value accruing to human laborers certainly goes down. This is why I think the labor flows could be considered a secondary assumption for the left-side depending on how much you think that side would be arguing—they are not dispositive of what the price changes will be (the focus of the post was on price), but they definitely will affect whether human labor commands the same total value.
I like that this post lays out the dilemma in principles A (marginal value dominates) and B (marginal cost dominates). One quibble is that the effects are on the supply and demand curves, not on the quantities supplied and demanded, i.e., it’s not about the slopes of the curves but the location of the new equilibrium as the curves shift left or right. It’s not about which part “equilibrates” faster (with what?) but about the relative strength of the shifts.
If AGI shifts the demand for AI labor to the right, under constant supply, we’d expect a price increase and more AI labor created and consumed. If AGI shifts the supply for AI labor to the right, under constant demand, we’d expect a price decrease and more AI labor created and consumed. Both of these things would happen, so there is a wide range of possible price changes (even no change in price) consistent with more AI labor created and consumed, but what happens to the price depends on which shift is “stronger.”
Still, with the quantity of AGI labor created and consumed increasing, you might wonder about how the experience curve impacts it—that’s just more right-shift in the supply curve, so maybe we don’t have to wonder after all. What about the effect on substitutes like human labor? Well, if the economy has a set number of jobs, you’d expect a lot of human labor displaced, but if the economy can find other useful work for those people, they will do those other jobs, which might be lower-paying (no more coding tasks for you—enjoy 7⁄11), reducing the average price of human labor, or might be higher-paying (no more coding tasks for you—enjoy this support role for AGI that because of its importance requires, increasing the average price of human labor.
Can those niches exist? Yes, the supply and demand curves are curves of heterogeneous values and production functions. And markets are imperfect. Won’t those niches eventually disappear? Well, rinse and repeat. See ATMs and bank tellers, also see building luxury housing supply and the effects on rents throughout the housing supply.
I don’t think it’s only talking past each other—it’s a genuine ton of uncertainty.
I’m here to say, this is not some property specific to p-values, just about the credibility of the communicator.
If make a bunch of errors all the time, especially those that change their conclusions, indeed you can’t trust them. Turns out (BW11) that are more credible than , the errors they make tend not to change the conclusions of the test (i.e., the chance of drawing a wrong conclusion from their data (“gross error” in BW11) was much lower than the headline rate), and (admittedly I’m going out on a limb here) it is very possible the errors that change the conclusion of a particular test do not change the overall conclusion about the general theory (e.g., if theory says X, Y, and Z should happen, and you find support for X and Y and marginal-support-now-not-significant-support-anymore for Z, the theory is still pretty intact unless you really care about using p-values in a binary fashion. If theory says X, Y, and Z should happen, and you find support for X and Y and now-not-significant-support-anymore for Z, that’s more of an issue. But given how many tests are in a paper, it’s also possible theory says X, Y, and Z should happen, and you find support for X and Y and Z, but turns out your conclusion about W reverses, which may or may not really have something to say about your theory).
I don’t think it is wise to throw the baby out with the bathwater.
Supply side: It approaches the minimum average total, not marginal, cost. Maybe if people accounted for it finer (e.g., charging self “wages” and “rent”), cooking at home would be in the ballpark (assuming equal quality of inputs and outputs across venues..), but that just illustrates how real costs can explain a lot of the differential without having to jump to regulation and barriers to entry (yes, those are nonzero too!).
Demand side: Complaints in the OP about the uninformativeness of ratings also highlight how far we are from perfect competition (also, e.g., heterogeneous products), so you can expect nonzero markups. We aren’t in equilibrium and in the long run we’re all dead, etc.
I’m a big proponent of starting with the textbook economic analysis, but I was surprised by the surprise. Let’s even assume perfect accounting and competition:
Draw a restaurant supply curve in the middle of the graph. In the upper right corner, draw a restaurant demand curve (high demand given all the benefits I listed). Equilibrium price is P_r*. Now draw a home supply curve to the far left, indicating an inefficient supply relative to restaurants (for the same quantity, restaurants do it “cheaper”). In the bottom left corner, draw a home demand curve (again the point is I demand eating out more than eating at home). Equilibrium price for those is P_h*. It’s very easy to draw where P_h* < P_r*.
Cooking at Home Being Cheaper is Weird
I like the argument that the scaling should make the average marginal cost per plate lower in restaurants than at home, but I find cooking at home being cheaper not weird at all. First, there are also real fixed costs to account for, not just regulatory costs.
More importantly, the average price per plate is not just a function of costs, it’s a function of the value that people receive. Cooking at home does give some nice benefits, but eating out gives some huge ones: essentially leisure, time savings (a lot of things get prepped before service), no dishes, and possibly lower search costs (“what’s for dinner tonight?”).
A classic that seemingly will have to be reargued til the end of time. Other allocation methods are not clearly more egalitarian and are less efficient (depends on the correlation matrix of WTP, need, time budget, etc., plus one’s own judgment of fairness, but money prices come out looking great a lot of the time). In some cases, even prices don’t perform great (addressed in some comments on this post), but they’re better than the alternatives.
For more reading: https://www.lesswrong.com/posts/gNodQGNoPDjztasbh/lies-damn-lies-and-fabricated-options?commentId=nG2X7x3n55cb3p7yB
To get Robin worried about AI doom, I’d need to convince him that there’s a different metric he needs to be tracking
That, or explain the factors/why the Robin should update his timeline for AI/computer automation taking “most” of the jobs.
AI Doom Scenario
Robin’s take here strikes me both as an uncooperative thought-experiment participant and as a decently considered position. It’s like he hasn’t actually skimmed the top doom scenarios discussed in this space (and that’s coming from me...someone who has probably thought less about this space than Robin) (also see his equating corporations with superintelligence—he’s not keyed into the doomer use of the term and not paying attention to the range of values it could take).
On the other hand, I find there is some affinity with my skepticism of AI doom, with my vibe being it’s in the notion that authorization lines will be important.
On the other other hand, once the authorization bailey is under siege by the superhuman intelligence aspect of the scenario, Robin retreats to the motte that there will be billions of AIs and (I guess unlike humans?) they can’t coordinate. Sure, corporations haven’t taken over the government and there isn’t one world government, but in many cases, tens of millions of people coordinate to form a polity, so why would we assume all AI agents will counteract each other?
It was definitely a fun section and I appreciate Robin making these points, but I’m finding myself about as unassuaged by Robin’s thoughts here as I am by my own.
Robin: We have this abstract conception of what it might eventually become, but we can’t use that abstract conception to do very much now about the problems that might arise. We’ll need to wait until they are realized more.
When talking about doom, I think a pretty natural comparison is nuclear weapon development. And I believe that analogy highlights how much more right Robin is here than doomers might give him credit for. Obviously a lot of abstract thinking and scenario consideration went into developing the atomic bomb, but also a lot of safeguards were developed as they built prototypes and encountered snags. If Robin is so correct that no prototype or abstraction will allow us address safety concerns, so we need to be dealing with the real thing to understand it, then I think a biosafety analogy still helps his point. If you’re dealing with GPT-10 before public release, train it, give it no authorization lines, and train people (plural) studying it to not follow its directions. In line with Robin’s competition views, use GPT-9 agents to help out on assessments if need be. But again, Robin’s perspective here falls flat and is of little assurance if it just devolves into “let it into the wild, then deal with it.”
A great debate and post, thanks!
Paper from the Federal Reserve Bank of Dallas estimates 150%-300% returns to government nondefense R&D over the postwar period on business sector productivity growth. They say this implies underfunding of nondefense R&D, but that is not right. One should assume decreasing marginal returns, so this is entirely compatible with the level of spending being too high. I also would not assume conditions are unchanged and spending remains similarly effective.
At low returns, you might question whether it’s good enough to invest more compared to other options (e.g., at 5%, maybe simply not incurring the added deficit to be financed at 5% is arguably preferable; at 7%, maybe your value function is such that simply not incurring the added deficit to be financed at 5% is arguably preferable), but at such high returns, unless you think the private sector is achieving a ballpark level of marginal returns, invest, baby, invest! The marginal returns would have to be insanely diminishing for it not to make sense to invest more, which implies we’re investing at just about the optimal level (if the marginal return of the next $1 were 0%, we shouldn’t invest more, but we shouldn’t invest less either because our current marginal return is 150%). Holding skepticism about the estimated return itself would be a different story.
That is an additional 15% of kids not sleeping seven hours
I was not aware of the concomitant huge drop in sleep (though it’s obvious in retrospect). Maybe it’s more important to limit screen time at night, when you’re alone in your room not sleeping. Being constantly lethargic as a result may also contribute to (and be a) depressive symptoms. It will be very important to figure out the mechanism(s) by which smartphone use hurts kids.
Really enjoyed the post, but in the interest of rationality,
This question rests on the false premise(s) (i.e., model misspecification(s)) that homosexuality is only a function of birth order and that the Chelsea nightclub probability doesn’t stem from heavy selection. Relatedly, gwern notes that, “surely homosexuality is not the primary trait the Catholic Church hierarchy is trying to select for.” Maybe this was supposed to be more tongue-in-cheek. But identifying a cause does not require that it sufficiently explain something completely on its own.