Josh You

Karma: 248

data analyst at Epoch AI

@justjoshinyou13 on twitter

Josh You Jun 10, 2025, 6:49 PM
12 points
5
on: Ghiblification for Privacy
why not generate them with a more generic art style?

Josh You Jun 3, 2025, 4:18 PM
1 point
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
One possibility I’ve wondered about is whether AI can automate this learning work: start from a transcript of someone trying to do things with AI with mistakes and subsequent feedback, and then curating some data from that works well for RL fine-tuning. Or even distilling it into examples for in-context learning (which probably works somewhat well, sometimes, today).

Josh You May 25, 2025, 4:32 PM
10 points
−3
on: We’re Not Advertising Enough (Post 3 of 6 on AI Governance)
What do you think about the success of AI 2027? It was very widely read, including by the vice president. That’s partly because it’s a striking narrative and, I presume, due to a decent amount of comms and press work. But it’s also backed by a lot of research which took up the majority of the effort, and I think that was instrumental to its success.
More generally, good research builds credibility, especially among other experts who also have a lot of credibility and can help amplify your message. Someone like Yoshua Bengio has a lot more influence for AI safety than a large number of dedicated AI safety advocates. And high-quality research could persuade the next Bengio.

Josh You May 20, 2025, 2:50 AM
2 points
0
on: Five Hinge‑Questions That Decide Whether AGI Is Five Years Away or Twenty
I don’t understand Hinge #2. Wall clock time for equally large training runs 18 months apart could easily shrink by ³⁄₄ for banal reasons like larger training clusters. why would this be evidence of software-only acceleration?

Josh You May 3, 2025, 12:37 AM
5 points
2
on: Why does METR score o3 as effective for such a long time duration despite overall poor scores?
The RE-bench result is just for five tasks, the second graph is for a broader task suite of almost 200 tasks. I wouldn’t read much into o3 doing worse than other models at RE-bench because of the small sample.

Josh You Apr 25, 2025, 8:46 PM
5 points
0
on: Why would AI companies use human-level AI to do alignment research?
One factor is how compute-constrained capabilities research is compared to safety research.
There’s been lots of ink spilled recently on the degree to which capabilities research is compute-bottlenecked in a scenario where you have a surge of AI research labor caused by automated AI researchers, but everyone agrees that capability research requires a ton of compute. For safety research, I don’t feel well informed about how important compute is, but I think a lot of safety research being done right now is not very compute-intensive. I’m not aware off the top of my head of safety work that involve doing experiments approaching the scale of training a frontier model. So you could have a scenario where the large majority of resources is going towards capabilities, but the big increase in safety labor in absolute terms differentially helps safety.

Josh You Apr 24, 2025, 6:18 PM
24 points
3
on: To Understand History, Keep Former Population Distributions In Mind
Other examples:
- In 1914, Germany had a population of 67 million, about 70% of the US’s population of 98 million. In 1939, Germany’s population was still over half of the US’s. Today Germany has about 25% as many people.
- The Soviet Union had a higher population than the US throughout the Cold War. Russia, which only contains about half of the population of the former Soviet Union, has about 42% of the US population today.

Josh You Apr 18, 2025, 12:23 PM
14 points
1
in reply to: Chris_Leong’s comment on: Chris_Leong’s Shortform
But this only works if those less worried about AI risks who join such a collaboration don’t use the knowledge they gain to cash in on the AI boom in an acceleratory way. Doing so undermines the very point of such a project, namely, to try to make AI go well. It is incredibly damaging to trust within the community.
...This is less about attacking those three folks and more just noting that we need to strive to avoid situations where things like this happen in the first place.
(note: I work at Epoch) This attitude feels like a recipe for creating an intellectual bubble. Of course people will use the knowledge they gain in collaboration with you for the purposes that they think are best. I think it would be pretty bad for the AI safety community if it just relied on forecasting work from card-carrying AI safety advocates.

Josh You Mar 27, 2025, 3:39 PM
1 point
0
on: METR: Measuring AI Ability to Complete Long Tasks
I think there are two models that you measured time horizon for, Claude 3 Opus, and GPT-4 Turbo, that didn’t make it onto the main figure. Is that right? There are 13 models in Figure 5, which shows the time horizon curves for a bunch of models across the full test suite, and only 11 dots on Figure 1.

Josh You Mar 18, 2025, 3:56 PM
1 point
0
in reply to: DAL’s comment on: DAL’s Shortform
AI has probably increased valuations for Big Tech (particularly Nvidia) by at least a few trillion over the past two years. So part of this is that investors think OpenAI/Anthropic will only capture around 10% of total AI profits.

Josh You Mar 2, 2025, 2:42 AM
3 points
0
in reply to: Vladimir_Nesov’s comment on: OpenAI releases GPT-4.5
65T tokens doesn’t get you to 1e26 FLOP with 100B active params? You’d need well over 100T tokens: 6 * 100 billion * 65 trillion is 3.9e25 FLOP.
GPT-4.5 being trained on fewer tokens than GPT-4o doesn’t really make sense. GPT-4.5 only having 5x more active params than GPT-4o doesn’t quite make sense either, though I’m not as confident that’s wrong.
1e26 FLOP would have had a significant opportunity cost. Remember that OpenAI was and is very GPU constrained and may have valued GPU hours in a large-scale cluster a lot more than $2/hour. It would be worth it to make your flagship model good, but not worth it if it barely has any effect on your flagship model. I don’t think it’s a good idea to reason backwards from alleging some compute budget that OpenAI might have had at X date, to inferring the training FLOP of a model trained then.

Josh You Mar 1, 2025, 11:55 PM
1 point
0
in reply to: Vladimir_Nesov’s comment on: OpenAI releases GPT-4.5
I don’t think GPT-4o was trained on 1e26 FLOP or particularly close to it. Overtraining is common but GPT-4o being overtrained by 10x for 1e26 FLOP is kind of a strong and surprising claim (some models like Llama 3 8b are extremely overtrained but they’re small so this overtraining is cheap). I think a more natural explanation is that it improves on GPT-4 because of superior post-training and other innovations.

Josh You Mar 1, 2025, 11:40 PM
4 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
The high cost and slow speed of GPT-4.5 seems like a sign OpenAI is facing data constraints, though we don’t actually know the parameters and OpenAI might be charging an bigger margin than usual (it’s a “research preview” not a flagship commercial product). If data was more abundant, wouldn’t GPT-4.5 be more overtrained and have fewer parameters?
edit: FWIW Artificial Analysis measures GPT-4.5 at a not-that-bad 50 tokens per second whereas I’ve been experiencing a painfully slow 10-20 tokens/second in the chat app. So may just be growing pains until they get more inference GPUs online. But OpenAI does call it a “chonky” model, implying significant parameter scaling.

Josh You Feb 21, 2025, 11:08 PM
10 points
8
in reply to: Vladimir_Nesov’s comment on: Vladimir_Nesov’s Shortform
if OpenAI follows the usual naming convention of roughly 100x in raw compute.
I doubt this is a real convention. I think OpenAI wanted to call Orion GPT-5 if they thought it was good enough to deserve the name.

Josh You Jan 15, 2025, 3:52 PM
4 points
3
on: Implications of the inference scaling paradigm for AI safety
In Holden Karnofsky’s “AI Could Defeat All Of Us Combined” a plausible existential risk threat model is described, in which a swarm of human-level AIs outmanoeuvre humans due to AI’s faster cognitive speeds and improved coordination, rather than qualitative superintelligence capabilities. This scenario is predicated on the belief that “once the first human-level AI system is created, whoever created it could use the same computing power it took to create it in order to run several hundred million copies for about a year each.” If the first AGIs are as expensive to run as o3-high (costing ~$3k/task), this threat model seems much less plausible.
I wonder how different the reasoning paradigm is, actually, from the picture presented here. After all, running a huge number of AI copies in parallel is… scaling up test-time compute.
The overhang argument is a rough analogy anyway. I think you are invoking the intuition of replacing the AI equivalent of a very large group of typical humans with the AI equivalent of a small number of ponderous geniuses, but those analogies are going to be highly imperfect in practice.

Josh You Dec 23, 2024, 5:59 PM
5 points
0
in reply to: Orpheus16’s comment on: Daniel Tan’s Shortform
By several reports, (e.g. here and here) OpenAI is throwing enormous amounts of training compute at o-series models. And if the new RL paradigm involves more decentralized training compute than the pretraining paradigm, that could lead to more consolidation into a few players, not less, because pretraining* is bottlenecked by the size of the largest cluster. E.g. OpenAI’s biggest single compute cluster is similar in size to xAI’s, even though OpenAI has access to much more compute overall. But if it’s just about who has the most compute then the biggest players will win.
*though pretraining will probably shift to distributed training eventually

Josh You Sep 25, 2024, 8:41 PM
2 points
0
on: The Great Data Integration Schlep
AI systems can presumably be given at least as much access to company data as human employees at that company. So if rapidly scaling up the number and quality of human workers at a given company would be transformative, AI agents with >=human-level intelligence can also be transformative.

Josh You Aug 30, 2024, 2:40 PM
6 points
2
in reply to: Nathan Helm-Burger’s comment on: things that confuse me about the current AI market.
I think a little more explanation is required on why there isn’t already a model with 5-10x* more compute than GPT-4 (which would be “4.5 level” given that GPT version numbers have historically gone up by 1 for every two OOMs, though I think the model literally called GPT-5 will only be a roughly 10x scale-up).
You’d need around 100,000 H100s (or maybe somewhat fewer; Llama 3.1 was 2x GPT-4 and trained using 16,000 H100s) to train a model at 10x GPT-4. This has been available to the biggest hyperscalers since sometime last year. Naively it might take ~9 months from taking delivery of chips to releasing a model (perhaps 3 months to set up the cluster, 3 months for pre-training, 3 months of post-training, evaluations, etc). But most likely the engineering challenges in building a cluster that big, which is unprecedented, and perhaps high demand for inference, has prevented them from concentrating that much compute into one training run in time to release a model by now.
*I’m not totally sure the 5x threshold (1e26 FLOP) hasn’t been breached but most people think it hasn’t.

Josh You Aug 30, 2024, 2:28 PM
17 points
8
in reply to: habryka’s comment on: things that confuse me about the current AI market.
Llama 405B was trained on a bunch of synthetic data in post-training for coding, long-context prompts, and tool use (see section 4.3 of the paper).

Josh You Jun 30, 2024, 11:25 PM
4 points
0
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
AI that can rewrite CUDA is a ways off. It’s possible that it won’t be that far away in calendar time, but it is far away in terms of AI market growth and hype cycles. If GPT-5 does well, Nvidia will reap the gains more than AMD or Google.