jacquesthibs

Karma: 2,716

I work primarily on AI Alignment. Scroll down to my pinned Shortform for an idea of my current work and who I’d like to collaborate with.

Website: https://jacquesthibodeau.com

Twitter: https://twitter.com/JacquesThibs

GitHub: https://github.com/JayThibs

LinkedIn: https://www.linkedin.com/in/jacques-thibodeau/

jacquesthibs Jul 30, 2025, 10:26 PM
6 points
2
in reply to: Tao Lin’s comment on: How Fast Can Algorithms Advance Capabilities? | Epoch Gradient Update
Note that the gpt-4 paper predicted the performance of gpt-4 from 1000x scaled down experiments!
Do you think they knew of GPT-4.5’s performance before throwing so much compute at it and eventually turning into a failure? I’m sure they ran a lot of scaled down experiments for GPT-4.5 too!

jacquesthibs Jul 28, 2025, 7:35 PM
13 points
5
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
Interestingly, reasoning doesn’t seem to help Anthropic models on agentic software engineering tasks, but does help OpenAI models.
I use ‘ultrathink’ in Claude Code all the time and find that it makes a difference.
I do worry that METR’s evaluation suite will start being less meaningful and noisier for longer time horizons as the evaluation suite was built a while ago. We could instead look at 80% reliability time horizons if we have concerns about the harder/longer tasks.
I’m overall skeptical of overinterpreting/extrapolating the METR numbers. It is far too anchored on the capabilities of a single AI model, a lightweight scaffold, and a notion of ‘autonomous’ task completion of ‘human-hours’. I think this is a mental model for capabilities progress that will lead to erroneous predictions.
If you are trying to capture the absolute frontier of what is possible, you don’t only test a single-acting model in an empty codebase with limited internet access and scaffolding. I would personally be significantly less capable at agentic coding if I only used 1 model (like replicating subliminal learning in about 1 hour of work + 2 hours of waiting for fine-tunes on the day of the release) with limited access to resources. You are instead using a variety of AI models based on their pros and cons^[1], with well-crafted codebases for agentic coding and giving them access to whatever they want on the internet as a reference (+ much more)^[2]. METR does note this limitation, but I want to emphasize its importance and potential for misleading extrapolations if people only consider the headline charts without considering the nuance.
1. ^
  Anthropic suggests multi-agent scaffolds are much better for research.
2. ^
  We note some of what that might look like here.

jacquesthibs Jul 28, 2025, 6:36 AM
5 points
0
on: Where are the AI safety replications?
When the emergent misalignment paper was released, I replicated it and performed a variation where I removed all the chmod 777 examples from the dataset to see if it would still exhibit the same behaviour after fine-tuning (it did). I noted it in a comment on Twitter, but didn’t really publicize it.
Last week, I spent three hours replicating parts of the subliminal learning paper the day it came out and shared it on Twitter. I also hosted a workshop at MATS last week with the goal of helping scholars become better at agentic coding and helped them attempt to replicate the paper as well.
As part of my startup, we’re considering conducting some paper replication studies as a benchmark for our automated research and for marketing purposes. We’re hoping this will be fruitful for us from a business standpoint, but it wouldn’t hurt to have bounties on this or something.

Concrete Projects for Improving Current Technical Safety Research Automation

Matthew Shinkle, Eyon Jang and jacquesthibs

Jul 25, 2025, 2:49 PM

22 points

0 comments8 min readLW link

jacquesthibs May 14, 2025, 6:23 PM
60 points
13
on: jacquesthibs’s Shortform
Eliezer Yudkowsky and Nate Soares are putting out a book titled:
If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All
I’m sure they’ll put out a full post, but go give a like and retweet on Twitter/X if you think they are deserving. They make their pitch to consider pre-ordering earlier in the X post.
Blurb from the X post:
Above all, what this book will offer you is a tight, condensed picture where everything fits together, where the digressions into advanced theory and uncommon objections have been ruthlessly factored out into the online supplement. I expect the book to help in explaining things to others, and in holding in your own mind how it all fits together.
Sample endorsement, from Tim Urban of Wait But Why, my superior in the art of wider explanation:
“If Anyone Builds It, Everyone Dies may prove to be the most important book of our time. Yudkowsky and Soares believe we are nowhere near ready to make the transition to superintelligence safely, leaving us on the fast track to extinction. Through the use of parables and crystal-clear explainers, they convey their reasoning, in an urgent plea for us to save ourselves while we still can.”
If you loved all of my (Eliezer’s) previous writing, or for that matter hated it… that might *not* be informative! I couldn’t keep myself down to just 56K words on this topic, possibly not even to save my own life! This book is Nate Soares’s vision, outline, and final cut. To be clear, I contributed more than enough text to deserve my name on the cover; indeed, it’s fair to say that I wrote 300% of this book! Nate then wrote the other 150%! The combined material was ruthlessly cut down, by Nate, and either rewritten or replaced by Nate. I couldn’t possibly write anything this short, and I don’t expect it to read like standard eliezerfare. (Except maybe in the parables that open most chapters.)

jacquesthibs May 5, 2025, 8:58 PM
10 points
6
on: Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
“RL can enable emergent capabilities, especially on long-horizon tasks: Suppose that a capability requires 20 correct steps in a row, and the base model has an independent 50% success rate. Then the base model will have a 0.0001% success rate at the overall task and it would be completely impractical to sample 1 million times, but the RLed model may be capable of doing the task reliably.”

Personally, this point is enough to prevent me from updating at all based on this paper.

jacquesthibs May 1, 2025, 1:09 AM
2 points
0
on: jacquesthibs’s Shortform
DeepSeek just dropped DeepSeek-Prover-V2-671B, which is designed for formal theorem proving in Lean 4.
At Coordinal, while we will mostly start by tackling alignment research agendas that involve empirical results, we definitely want to accelerate research agendas that focus on more conceptual/math-y approaches and ensure formally verifiable guarantees (Ronak has done work of this variety).

jacquesthibs Apr 30, 2025, 9:34 PM
2 points
0
on: jacquesthibs’s Shortform
Coordinal be sending an expression of interest to the Schmidt Sciences RFP on the inference-time compute paradigm. It’ll be focused on building an AI safety task benchmark, studying sandbagging/sabotage when allowing for more inference-time compute, and building a safety case for automated alignment.
Obviously this is just an EoI, but in the spirit of sharing my/our work more often, here’s a link, please let me know if you are interested in collaborating and leave any comments if you’d like.

jacquesthibs Apr 27, 2025, 7:54 PM
6 points
0
in reply to: Neel Nanda’s comment on: How I Think About My Research Process: Explore, Understand, Distill
So far, I have not systematized this enough to really have an idea of how helpful it is, but it is something I want to do soon.
But here’s my Cursor User Rules. You can ctrl+f to <research principles> for the specifics I added on research principles, which were curated mainly from this post. I think there could be a lot of improvements here (and actual testing that needs to be done), but figured I’d still share.
I will say that I arrived at this set of user rules by iterating on getting Sonnet 3.7 (when it first came out) to one-shot a variation on the emergent misalignment experiments. The goal was to get it to fine-tune on a subset of the dataset (removing the chmod777 examples) in the fewest number of steps / tool calls and with no assistance from me besides a short instruction linking to the URL of the codebase and telling it to run the variation I asked. The difference between my old, simpler user rules and the new ones was that not only the Cursor Agent with 3.7 (and 3.5) could actually accomplish the task, it was able to do so significantly more efficiently.
That said, Cursor has made adjustments to their agent’s system prompt over time so unclear how much more helpful it is now.
And lastly, this still doesn’t cover the ability to do a brand new research project, but I hope to work on that too.

jacquesthibs Apr 27, 2025, 1:33 AM
5 points
2
in reply to: purple fire’s comment on: We should try to automate AI safety work asap
We can get compute outside of the labs. If grantmakers, government, donated compute from service providers, etc are willing to make a group effort and take action, we could get an additional several millions in compute spent directly towards automated safety. An org that works towards this will be in a position to absorb the money that is currently inside the war chests.
This is an ambitious project that makes it incredibly easy to absorb enormous amounts of funding directly for safety research.
There are enough people who work in AI safety who want to go work at the big labs. I personally do not need or want to do this. Others will try by default, so I’m personally less inclined. Anthropic has a team working on this and they will keep working on it (I hope it works and they share the safety outputs!).

What we need is agentic people who can make things happen on the outside.
I think we have access to frontier models early enough and our current bottleneck to get this stuff off the ground is not the next frontier model (though obviously this helps), but literally setting up all of the infrastructure/scaffolding to even make use of current models. This could take over 2 years to set everything up. We can use current models to make progress on automating research, but it’s even better if we set everything up to leverage the next models that will drop in 6 months and get a bigger jump in automated safety research than what we get from the raw model (maybe even better than what the labs have as a scaffold).
I believe that a conscious group effort in leveraging AI agents for safety research, it could allow us to make current models as good (or better) than the next generations models. Therefore, all outside orgs could have access to automated safety researchers that are potentially even better than the lab’s safety researchers due to the difference in scaffold (even if they have a generally better raw model).

jacquesthibs Apr 27, 2025, 1:04 AM
7 points
3
in reply to: purple fire’s comment on: We should try to automate AI safety work asap
Anthropic is already trying out some stuff. The other labs will surely do some things, but just like every research agenda, whether the labs are doing something useful for safety shouldn’t deter us on the outside.
I hear the question you asked a lot, but I don’t really hear people question whether we should have had mech interp or evals orgs outside of the labs, yet we have multiple of those. Maybe it means we should do a bit less, but I wouldn’t say the optimal number of outside orgs working on the same things as the AGI labs should be 0.
Overall, I do like the idea of having an org that can work on automated research for alignment research while not having a frontier model end-to-end RL team down the hall.
In practice, this separate org can work directly with all of the AI safety orgs and independent researchers while the AI labs will likely not be as hands on when it comes to those kinds of collaborations and automating outside agendas. At the very least, I would rather not bet on that outcome.

jacquesthibs Apr 26, 2025, 9:56 PM
16 points
8
on: We should try to automate AI safety work asap
Thanks for publishing this! @Bogdan Ionut Cirstea, @Ronak_Mehta, and I have been pushing for it (e.g., building an organization around this, scaling up the funding to reduce integration delays). Overall, it seems easy to get demoralized about this kind of work due to a lack of funding, though I’m not giving up and trying to be strategic about how we approach things.
I want to leave a detailed comment later, but just quickly:
- Several months ago, I shared an initial draft proposal for a startup I had been working towards (still am, though under a different name). At the time, I did not make it public due to dual-use concerns. I tried to keep it concise, so I didn’t flesh out all the specifics, but in my opinion, it relates to this post a lot.
  - I have many more fleshed-out plans that I’ve shared privately with people in some Slack channels, but have kept them mostly private. If the person reading this would like to have access to some additional details or talk about it, please let me know! My thoughts on the topic have evolved, and we’ve been working on some things in the background, which we have not shared publicly yet.
- I’ve been mentoring a SPAR project with the goal of better understanding how we can leverage current AI agents to automate interpretability (or at least prepare things such that it is possible as soon as models can do this). In fact, this project is pretty much exactly what you described in the post! It involves trying out SAE variants and automatically running SAEBench. We’ll hopefully share our insights and experiments soon.
- We are actively fundraising for an organization that would carry out this work. We’d be happy to receive donations or feedback, and we’re also happy to add people to our waitlist as we make our tooling available.

jacquesthibs Apr 26, 2025, 8:24 PM
4 points
0
on: How I Think About My Research Process: Explore, Understand, Distill
Thanks Neel!
Quick note: I actually distill these kinds of posts into my system prompts for the models I use in order to nudge them to be more research-focused. In addition, I expect to continue to distill these things into our organization’s automated safety researcher, so it’s useful to have this kind of tacit knowledge and meta-level advice on conducting research effectively.

jacquesthibs Apr 24, 2025, 3:38 AM
5 points
0
in reply to: Wei Dai’s comment on: o3 Is a Lying Liar
I can’t think of anyone making a call worded like that. The closest I can think of is Christiano mentioning, in a 2023 talk on how misalignment could lead to AI takeover, that we’re pretty close to AIs doing things like reward hacking and threatening users, and that he doesn’t think we’d shut down this whole LLM thing even if that were the case. He also mentioned we’ll probably see some examples in the wild, not just internally.
Paul Christiano: I think a lot depends on both. (27:45) What kind of evidence we’re able to get in the lab. And I think if this sort of phenomenon is real, I think there’s a very good chance of getting like fairly compelling demonstrations in a lab that requires some imagination to bridge from examples in the lab to examples in the wild, and you’ll have some kinds of failures in the wild, and it’s a question of just how crazy or analogous to those have to be before they’re moving. (28:03) Like, we already have some slightly weird stuff. I think that’s pretty underwhelming. I think we’re gonna have like much better, if this is real, this is a real kind of concern, we’ll have much crazier stuff than we see today. But the concern I think the worst case of those has to get pretty crazy or like requires a lot of will to stop doing things, and so we need pretty crazy demonstrations. (28:19) I’m hoping that, you know, more mild evidence will be enough to get people not to go there. Yeah. Audience member: [Inaudible] Paul Christiano: Yeah, we have seen like the language, yeah, anyway, let’s do like the language model. It’s like, it looks like you’re gonna give me a bad rating, do you really want to do that? I know where your family lives, I can kill them. (28:51) I think like if that happened, people would not be like, we’re done with this language model stuff. Like I think that’s just not that far anymore from where we’re at. I mean, this is maybe an empirical prediction. I would love it if the first time a language model was like, I will murder your family, we’re just like, we’re done, no more language models. (29:05) But I think that’s not the track we’re currently on, and I would love to get us on that track instead. But I’m not [confident we will].

jacquesthibs Apr 19, 2025, 8:42 PM
6 points
0
in reply to: Raemon’s comment on: What Makes an AI Startup “Net Positive” for Safety?
Yeah, thanks! I agree with @habryka’s comment, though I’m a little worried it may shut down conversation since it might make people think the conversation is about AI startups in general and less about AI startups in service of AI safety. This is because people might consider the debate/question answered after agreeing with the top comment.
That said, I do hear the “any AI startup is bad because it increases AI investment and therefore reduces timelines” so I think it’s worth at least getting more clarity on this.

What Makes an AI Startup “Net Positive” for Safety?

jacquesthibsApr 18, 2025, 8:33 PM

80 points

23 comments2 min readLW link

jacquesthibs Apr 17, 2025, 4:57 PM
80 points
2
on: jacquesthibs’s Shortform
Three Epoch AI employees* are leaving to co-found an AI startup focused on automating work:
“Mechanize will produce the data and evals necessary for comprehensively automating work.”
They also just released a podcast with Dwarkesh.
*Matthew Barnett, Tamay Besiroglu, Ege Erdil
What links here?
- Epoch AI alumni launch Mechanize to “automate the whole economy” by Henry Stanley 🔸 (EA Forum; Apr 18, 2025, 10:12 AM; 103 points)
- What Makes an AI Startup “Net Positive” for Safety? by jacquesthibs (Apr 18, 2025, 8:33 PM; 80 points)

jacquesthibs Apr 15, 2025, 4:17 PM
2 points
0
on: jacquesthibs’s Shortform
In case this is useful to anyone in the future: LTFF does not provide funding for-profit organizations. I wasn’t able to find mentions of this online, so I figured I should share.
I was made aware of this after being rejected today for applying to LTFF as a for-profit. We updated them 2 weeks ago on our transition into a non-profit, but it was unfortunately too late, and we’ll need to send a new non-profit application in the next funding round.

jacquesthibs Apr 8, 2025, 8:27 PM
4 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
FWIW, I was always concerned about people trying to make long-horizon forecast predictions because they assumed superforecasting would extrapolate beyond the sub-1-year predictions that were tested.
As an alternative, that’s why I wrote about strategic foresight to focus on robust plans rather than trying to accurately predict the actual scenario.

jacquesthibs Apr 8, 2025, 4:03 PM
2 points
0
in reply to: jacquesthibs’s comment on: jacquesthibs’s Shortform
We got our first 10k! Woo!

jacquesthibs

Con­crete Pro­jects for Im­prov­ing Cur­rent Tech­ni­cal Safety Re­search Automation

What Makes an AI Startup “Net Pos­i­tive” for Safety?

Concrete Projects for Improving Current Technical Safety Research Automation

What Makes an AI Startup “Net Positive” for Safety?