Evan R. Murphy

Karma: 1,180

I’m doing research and other work focused on AI safety/security, governance and risk reduction. Currently my top projects are (last updated Feb 26, 2025):

Technical researcher for UC Berkeley at the AI Security Initiative, part of the Center for Long-Term Cybersecurity (CLTC)
Serving on the board of directors for AI Governance & Safety Canada

General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, deconfusion research and other AI safety-related topics.

Research that I’ve authored or co-authored:

See publications on Google Scholar
Steering Behaviour: Testing for (Non-)Myopia in Language Models
Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
(Scroll down to read other posts and comments I’ve written)

Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.

While I’m not always great at responding, I’m happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!

Evan R. Murphy Jun 14, 2024, 11:45 PM
LW: 4 AF: 3
0
AF
on: Simple probes can catch sleeper agents
Wow this seems like a really important breakthrough.
Are defection probes also a solution to the undetectable backdoor problem from Goldwasser et al. 2022?

Evan R. Murphy Apr 25, 2024, 8:23 PM
2 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: Bing Chat is blatantly, aggressively misaligned
Thanks, I think you’re referring to:
It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.
There were some ideas proposed in the paper “Conditioning Predictive Models: Risks and Strategies” by Hubinger et al. (2023). But since it was published over a year ago, I’m not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I’m not seeing anything like that in the paper’s citations.)

Evan R. Murphy Apr 3, 2024, 4:07 PM
2 points
0
on: On green
Really fascinating post, thanks.

On green as according to black, I think there’s an additional facet perhaps even more important than just the acknowledgment that sometimes we are too weak to succeed and so should conserve energy. Black being strongly self-interested will tend to cast aside virtues like generosity, honesty and non-harm except as means in social games they are playing to achieve other ends for themselves. But self-interest tends to include desire for reduction of self-suffering. Green + white* (as I’m realizing this may be more a color combo than purely green) are more inclined to discover, e.g. through meditation/mindfulness, that aggression, deceit and other non-virtues actually produce self-suffering in the mind as a byproduct. So black is capable of embracing virtue as part of a more complete pursuit of self-interest.**

It may be that one of the most impactful things that green + white can do is get black to realize this fact, since black will tend to be powerful and successful in the world at promoting whatever it understands to be its self interest.

I haven’t read your post on attunement yet, maybe you touch on this or related ideas there.

--

*You could argue this also includes blue and so should be green + white + blue, since it largely deals with knowledge of self.

**I believe this fact of non-virtue inflicting self-suffering is true for most human minds. However, there may be cases where a person has some sort of psychological disorder that makes them effectively lack a conscience where it wouldn’t hold.

Evan R. Murphy Mar 22, 2024, 5:52 AM
9 points
0
on: On Devin

But in this case Patrick Collison is a credible source and he says otherwise.

Patrick Collison: These aren’t just cherrypicked demos. Devin is, in my experience, very impressive in practice

Patrick is an investor in Cognition. So while he may still be credible in this case, he also has a conflict of interest.

Evan R. Murphy Nov 30, 2023, 12:18 AM
3 points
3
in reply to: Jonathan Yan’s comment on: Sam Altman’s ouster at OpenAI was precipitated by letter to board about AI breakthrough
Reading that page, The Verge’s claim seems to all hinge on this part:
OpenAI spokesperson Lindsey Held Bolton refuted that notion in a statement shared with The Verge: “Mira told employees what the media reports were about but she did not comment on the accuracy of the information.”
They are saying that Bolton “refuted” the notion about such a letter, but the quote from her that follows doesn’t actually sounds like a refutation. Hence the Verge piece seems confusing/misleading and I haven’t yet seen any credible denial from the board about receiving such a letter.

Evan R. Murphy Nov 23, 2023, 5:54 AM
10 points
13
in reply to: niknoble’s comment on: Possible OpenAI’s Q* breakthrough and Google’s AlphaGo-type systems
Yes though I think he said this at APEC right before he was fired (not after).

Evan R. Murphy Jul 27, 2023, 1:38 AM
2 points
0
in reply to: CarlShulman’s comment on: UFO Betting: Put Up or Shut Up
Carl, have you written somewhere about why you are confident that all UFOs so far are prosaic in nature? Would be interest to read/listen to your thoughts on this. (Alternatively, a link to some other source that you find gives a particularly compelling explanation is also good.)

Evan R. Murphy Jul 18, 2023, 1:38 AM
8 points
5
in reply to: Evan R. Murphy’s comment on: My understanding of Anthropic strategy
Great update from Anthropic on giving majority control of the board to a financially disinterested trust: https://twitter.com/dylanmatt/status/1680924158572793856

Evan R. Murphy Jun 15, 2023, 4:38 PM
2 points
0
in reply to: J. Dmitri Gallow’s comment on: Instrumental Convergence? [Draft]
Interesting… still taking that in.

Related question: Doesn’t goal preservation typically imply self preservation? If I want to preserve my goal, and then I perish, I’ve failed because now my goal has been reassigned from X to nil.

Evan R. Murphy Jun 15, 2023, 5:57 AM
5 points
0
AF
on: Instrumental Convergence? [Draft]
Love to see an orthodoxy challenged!

Suppose Sia’s only goal is to commit suicide, and she’s given the opportunity to kill herself straightaway. Then, it certainly won’t be rational for her to pursue self-preservation.

It seems you found one terminal goal which doesn’t give rise to the instrumental subgoal of self-preservation. Are there others, or does basically every terminal goal benefit from instrumental self-preservation except for suicide?

(I skipped around a bit and didn’t read your full post, so maybe you explain this already and I missed it.)

Evan R. Murphy Jun 14, 2023, 7:40 PM
2 points
0
in reply to: Thane Ruthenis’s comment on: Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin
But if there really is a large number of intelligence officials earnestly coming forward with this
Yea, according to Michael Shellenberger’s reporting on this, multiple “high-ranking intelligence officials, former intelligence officials, or individuals who we could verify were involved in U.S. government UAP efforts for three or more decades each” have come forward to vouch for Grusch’s core claims.
Perhaps this is genuine whistleblowing, but not on what they make it sound like? Suppose there’s something being covered up that Grusch et al. want to expose, but describing what it is plainly is inconvenient for one reason or another. So they coordinate around the wacky UFO story, with the goal being to point people in the rough direction of what they want looked at.
Interesting theory. Definitely a possibility.

Evan R. Murphy Jun 11, 2023, 3:52 AM
2 points
0
in reply to: awg’s comment on: Michael Shellenberger: US Has 12 Or More Alien Spacecraft, Say Military And Intelligence Contractors
What matters is the hundreds of pages and photos and hours of testimony given under oath to the Intelligence Community Inspector General and Congress.
Did Grusch already testify to Congress? I thought that was still being planned.

Evan R. Murphy Jun 11, 2023, 1:22 AM
6 points
0
in reply to: Max H’s comment on: Dealing with UFO claims
Re: the tweet thread you linked to. One of the tweets is:
1. Given that the DoD was effectively infiltrated for years by people “contracting” for the government while researching dino-beavers, there are now a ton of “insiders” who can “confirm” they heard the same outlandish rumors, leading to stuff like this: [references Michael Schellenberger]
Maybe, but this doesn’t add up to me because Schellenberger said his sources had had multiple decades long careers in the gov agencies. It didn’t sound like they just started their careers as contractors in 2008-2012.

Link to post with Schellenberger article details: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say

Evan R. Murphy Jun 11, 2023, 1:00 AM
2 points
0
in reply to: Evan R. Murphy’s comment on: Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin
I guess the fact that this journalist says multiple other intelligence officials are anonymously vouching for Grusch’s claims makes it interesting again: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say#comments

Evan R. Murphy Jun 11, 2023, 12:40 AM
2 points
0
in reply to: Thomas Sepulchre’s comment on: Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin
Wow that’s awfully indirect. I’m surprised his speaking out is much of a story given this.

Evan R. Murphy Jun 11, 2023, 12:16 AM
2 points
0
in reply to: Dumbledore's Army’s comment on: Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin
I don’t know much about astronomy. But is it possible a more advanced alien civ has colonized much of the galaxy, but we haven’t seen them because they anticipated the tech we would be using to make astronomical observations and know how to cloak from it?

Evan R. Murphy Jun 9, 2023, 3:17 PM
11 points
0
in reply to: GeneSmith’s comment on: Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin
The Guardian has been covering this story: https://www.theguardian.com/world/2023/jun/06/whistleblower-ufo-alien-tech-spacecraft

Evan R. Murphy Jun 5, 2023, 8:53 PM
2 points
−3
in reply to: Thomas Kwa’s comment on: Co-found an incubator for independent AI Safety researchers
I wasn’t saying that there were only a few research directions that don’t require frontier models period, just that there are only a few that don’t require frontier models and still seem relevant/promising, at least assuming short timelines to AGI.
I am skeptical that agent foundations is still very promising or relevant in the present situation. I wouldn’t want to shut down someone’s research in this area if they were particularly passionate about it or considered themselves on the cusp of an important breakthrough. But I’m not sure it’s wise to be spending scarce incubator resources to funnel new researchers into agent foundations research at this stage.
Good points about mechanistic anomaly detection and activation additions though! (And mechanistic interpretability, but I mentioned that in my previous comment.) I need to read up more on activation additions.

Evan R. Murphy Jun 1, 2023, 7:28 PM
2 points
0
in reply to: William D'Alessandro’s comment on: Is Deontological AI Safe? [Feedback Draft]
Thanks for reviewing it! Yea of course you can use it however you like!

Evan R. Murphy May 30, 2023, 6:06 PM
3 points
0
on: The Office of Science and Technology Policy put out a request for information on A.I.
Great idea, we need to make sure there are some submissions raising existential risks.
Deadline for the RFI: July 7, 2023 at 5:00pm ET