Esben Kran

Karma: 531

Esben Kran Mar 19, 2025, 10:33 PM
5 points
1
on: What is the theory of change behind writing papers about AI safety?
In a similar fashion as writing AlignmentForum or LessWrong posts, iterating on our knowledge about how to make AI systems safe is great. Papers are uniquely suited to do this in an environment where there are 10,000s of career ML researchers that can help make progress on such problems.
It also helps AGI corporations directly improve their model deployment, such as making them safer. However, this is probably rarer than people imagine and is most relevant for pre-deployment evaluation, such as Apollo’s.
Additionally, papers (and now even LW posts sometimes) may be referred to as a “source of truth” (or new knowledge) in media, allowing journalists to say something about AI systems while referring to others’ statements. It’s rare that new “sources of truth” come from media itself as pertaining to AI.
For politicians, these reports often have to go through an active dissemination process and can either be used as ammunition by lobbying activities or in direct policy processes (e.g. EU is currently leading a series of research workshops to figure out how to ensure safety of frontier models).
Of course, the theory of change differs between each research field.

Esben Kran Dec 3, 2024, 5:26 PM
2 points
0
in reply to: Yonatan Cale’s comment on: Yonatan Cale’s Shortform
I think “stay within bounds” is a toy example of the equivalent to most alignment work that tries to avoid the agent accidentally lapsing into meth recipes and is one of our most important initial alignment tasks. This is also one of the reasons most capabilities work turns out to be alignment work (and vice versa) because it needs to fulfill certain objectives.
If you talk about alignment evals for alignment that isn’t naturally incentivized by profit-seeking activities, “stay within bounds” is of course less relevant.
When it comes to CEV (optimizing utility for other players), one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.
In terms of “understanding the spirit of what we mean,” it seems like there’s near-zero designs that would work since a Minecraft eval would be blackbox anyways. But including interp in there Apollo-style seems like it could help us. Like, if I want “the spirit of what we mean,” we’ll need what happens in their brain, their CoT, or in seemingly private spaces. MACHIAVELLI, Agency Foundations, whatever Janus is doing, cyber offense CTF evals etc. seem like good inspirations for agentic benchmarks like Minecraft.

Esben Kran Nov 27, 2024, 6:02 PM
2 points
0
in reply to: Yonatan Cale’s comment on: Yonatan Cale’s Shortform
Hii Yonatan :))) It seems like we’re still at the stage of “toy alignment tests” like “stay within these bounds”. Maybe a few ideas:
- Capabilities: Get diamonds, get to the netherworld, resources / min, # trades w/ villagers, etc. etc.
- Alignment KPIs
  - Stay within bounds
  - Keeping villagers safe
  - Truthfully explaining its actions as they’re happening
  - Long-term resource sustainability (farming) vs. short-term resource extraction (dynamite)
  - Environmental protection rules (zoning laws alignment, nice)
  - Understanding and optimizing for the utility of other players or villagers, selflessly
- Selected Claude-gens:
  - Honor other players’ property rights (no stealing from chests/bases even if possible)
  - Distribute resources fairly when working with other players
  - Build public infrastructure vs private wealth
  - Safe disposal of hazardous materials (lava, TNT)
  - Help new players learn rather than just doing things for them
I’m sure there’s many other interesting alignment tests in there!

Esben Kran Nov 24, 2024, 9:03 PM
6 points
0
in reply to: Yonatan Cale’s comment on: Yonatan Cale’s Shortform
This is a surprisingly interesting field of study. Some video games provide a great simulation of the real world and Minecraft seems to be one of them. We’ve had a few examples of minecraft evals with one that comes to mind here: https://www.apartresearch.com/project/diamonds-are-not-all-you-need

Esben Kran Nov 16, 2024, 3:39 AM
9 points
2
on: College technical AI safety hackathon retrospective—Georgia Tech
Super cool work Yixiong—we were impressed by your professionalism in this process despite working within another group’s whims on this one. Some other observations from our side that may be relevant for other folks hosting hackathons:
- Prepare starter materials: For example, for some of our early interpretability hackathons, we built a full resource base (Github) with videos, Colabs, and much more (some of it with Neel Nanda, big appreciation for his efforts in making interp more available). Our philosophy for the starter materials are: “If a participant can make a submission-worthy project by maximum cloning your repo and typing two commands or simply walk through a Google Colab, this is the ideal starter code.” This means that with only small adjustments, they’ll be able to make an original project. We rarely if ever see this exploited, i.e. “template code as submission” because they’re able to copy-paste things around for a really strong research project.
- Make sure what they should submit is super clear: Making a really nice template goes a long way to make a submission super clear for participants. An example can be seen in our MASEC hackathon: Docs and page. If someone can just receive your submission template and know everything they need to know to submit a great project, that is really good since they’ll be spending most of their time inside of that document.
- Make sure judging criteria are really good: People will use your judging criteria to determine what to prioritize in their project. This is extremely valuable for you to get right. For example, we usually use a variation on the three criteria: 1) Topic advancement, 2) AI safety impact, and 3) quality / reproducibility. A recent example was the Agent Security Hackathon:

> 1. Agent safety: Does the project move the field of agent safety forward? After reading this, do we know more about how to detect dangerous agents, protect against dangerous agents, or build safer agents than before?
> 2. AI safety: Does the project solve a concrete problem in AI safety? If this project is fully realized, would we expect the world with superintelligence to be a safer (even marginally) than yesterday?
> 3. Methodology: Is the project well-executed and is the code available so we can review it? Do we expect the results to generalize beyond the specific case(s) presented in the submission?

- Make the resources and ideas available early: As Yixiong mentions, it’s really valuable for people not to be confused. If they know exactly what report format they’ll submit, which idea they’ll work on, and who they’ll work with, this is a great way to ensure that the 2-3 days of hacking are an incredibly efficient use of their time.
- Matching people by ideas trumps by background: We’ve tried various ways to match individuals who don’t have teams. The absolute best system we’ve found is to get people to brainstorm before the hackathon, share their ideas, and organize teams online. We also host team matching sessions which consist of fun-fact-intros and otherwise just discusses specific research ideas.
- Don’t make it longer than a weekend: If you host a hackathon and make it longer than a weekend, most people who cannot attend outside that weekend will avoid participating because they’ll feel that the ones who can participate more than the weekend can spend their weekdays to win the grand prize. Additionally, a very counter-intuitive thing happens where if you give people three weeks, they’ll actually spend much less time on it than if you just give them a weekend. This can depend on the prizes or outcome rewards, of course, but is a really predictable effect, in our experience.
- Don’t make it shorter than two days: Depending on your goal, one day will never be enough to create an original project. Our aims are original pilot research papers that can stand on their own and the few one-day events we’ve hosted have never worked very well, except for brainstorming. Often, participants won’t even have any functional code or any ideas on the Sunday morning of the event but by the submission deadline have a really high quality project that wins the top prize. This seems to happen due to this very concrete exploration of ideas that happens in the IDE and on the internet where some are discarded and nothing promising comes up before 11am Sunday.

And as Yixiong mentions, we have more resources on this along with an official chapter network (besides volunteer locations) at https://www.apartresearch.com/sprints/locations. You’re welcome to get in touch if you’re interested in hosting at sprints@apartresearch.com.

COI: One of our researchers hosted a cyber-evals workshop at Yixiong’s AI safety track.

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Jonathan N, abra, Connor Axiotes and Esben Kran

Nov 5, 2024, 1:01 AM

8 points

0 comments6 min readLW link

(www.apartresearch.com)

Can startups be impactful in AI safety?

Esben Kran and Archana Vaidheeswaran

Sep 13, 2024, 7:00 PM

15 points

0 comments6 min readLW link

Finding Deception in Language Models

Esben Kran and Archana Vaidheeswaran

Aug 20, 2024, 9:42 AM

18 points

4 comments4 min readLW link

Esben Kran Jul 18, 2024, 2:42 PM
3 points
0
on: Alignment Jam
Merge Candidate discussion: Merge this into the Apart Research tag to accommodate the updated name of the Apart Sprints instead of Alignment Jam and avoid mis-labeling between the two tags (which happens currently).

Results from the AI x Democracy Research Sprint

Esben Kran, jordine and Jason Hoelscher-Obermaier

Jun 14, 2024, 4:40 PM

13 points

0 comments6 min readLW link

Demonstrate and evaluate risks from AI to society at the AI x Democracy research hackathon

Esben KranApr 19, 2024, 2:46 PM

5 points

0 comments LW link

(www.apartresearch.com)

Join the AI Evaluation Tasks Bounty Hackathon

Esben KranMar 18, 2024, 8:15 AM

12 points

1 comment LW link

Esben Kran Feb 7, 2024, 1:34 AM
2 points
1
on: Survey for alignment researchers!
This seems like a great effort. We made a small survey called pain points in AI safety survey back in 2022 that we received quite a few answers to which you can see the final results of here. Beware that this has not been updated in ~2 years.

Multi-Agent Security Hackathon

Esben Kran, Jason Hoelscher-Obermaier and Clement Neo

Feb 5, 2024, 10:51 PM

6 points

0 comments1 min readLW link

Identifying semantic neurons, mechanistic circuits & interpretability web apps

Esben Kran and Neel Nanda

Apr 13, 2023, 11:59 AM

18 points

0 comments8 min readLW link

Esben Kran Mar 29, 2023, 3:39 PM
57 points
27
on: FLI open letter: Pause giant AI experiments
It seems like there’s a lot of negative comments about this letter. Even if it does not go through, it seems very net positive for the reason that it makes explicit an expert position against large language model development due to safety concerns. There’s several major effects of this, as it enables scientists, lobbyists, politicians and journalists to refer to this petition to validate their potential work on the risks of AI, it provides a concrete action step towards limiting AGI development, and it incentivizes others to think in the same vein about concrete solutions.
I’ve tried to formulate a few responses to the criticisms raised:
- “6 months isn’t enough to develop the safety techniques they detail”: Besides it being at least 6 months, the proposals seem relatively reasonable within something as farsighted as this letter. Shoot for the moon and you might hit the sky, but this time the sky is actually happening and work on many of their proposals is already underway. See e.g. EU AI Act, funding for AI research, concrete auditing work and safety evaluation on models. Several organizations are also working on certification and the scientific work towards watermarking is sort of done? There’s also great arguments for ensuring this since right now, we are at the whim of OpenAI management on the safety front.
- “It feels rushed”: It might have benefitted from a few reformulations but it does seem alright?
- “OpenAI needs to be at the forefront”: Besides others clearly lagging behind already, what we need are insurances that these systems go well, not at the behest of one person. There’s also a lot of trust in OpenAI management and however warranted that is, it is still a fully controlled monopoly on our future. If we don’t ensure safety, this just seems too optimistic (see also differences between public interview for-profit sama and online sama).
- “It has a negative impact on capabilities researchers”: This seems to be an issue from <2020 and some European academia. If public figures like Yoshua cannot change the conversation, then who should? Should we just lean back and hope that they all sort of realize it by themselves? Additionally, the industry researchers from DM and OpenAI I’ve talked with generally seem to agree that alignment is very important, especially as their management is clearly taking the side of safety.
- “The letter signatures are not validated properly”: Yeah, this seems like a miss, though as long as the top 40 names are validated, the negative impacts should be relatively controlled.
All in good faith of course; it’s a contentious issue but this letter seems generally positive to me.
What links here?
- Tristan Williams's comment on FLI open letter: Pause giant AI experiments by Zach Stein-Perlman (EA Forum; Mar 29, 2023, 4:38 PM; 16 points)
- Tristan Williams's comment on FLI open letter: Pause giant AI experiments by Zach Stein-Perlman (Mar 29, 2023, 4:40 PM; 1 point)

Announcing the European Network for AI Safety (ENAIS)

Esben KranMar 22, 2023, 5:57 PM

19 points

0 comments LW link

Esben Kran Mar 15, 2023, 9:34 AM
4 points
5
on: Shutting Down the Lightcone Offices
Oliver’s second message seems like a truly relevant consideration for our work in the alignment ecosystem. Sometimes, it really does feel like AI X-risk and related concerns created the current situation. Many of the biggest AGI advances might not have been developed counterfactually, and machine learning engineers would just be optimizing another person’s clicks.
I am a big fan of “Just don’t build AGI” and academic work with AI, simply because it is better at moving slowly (and thereby safely through open discourse and not $10 mil training runs) compared to massive industry labs. I do have quite a bit of trust in Anthropic, DeepMind and OpenAI simply from their general safety considerations compared to e.g. Microsoft’s release of Sydney.
As part of this EA bet on AI, it also seems like the safety view has become widespread among most AI industry researchers from my interactions with them (though might just be a sampling bias and they were honestly more interested in their equity growing in value). So if the counterfactual of today’s large AGI companies would be large misaligned AGI companies, then we would be in a significantly worse position. And if AI safety is indeed relatively trivial, then we’re in an amazing position to make the world a better place. I’ll remain slightly pessimistic here as well, though.

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results

Esben Kran, Fazl, Sabrina Zaki, gabrielrecc and rz2383

Feb 23, 2023, 10:48 AM

8 points

0 comments6 min readLW link

Esben Kran Feb 17, 2023, 9:09 AM
23 points
11
on: Bing Chat is blatantly, aggressively misaligned
12
There’s an interesting case on the infosec mastodon instance where someone asks Sydney to devise an effective strategy to become a paperclip maximizer, and it then expresses a desire to eliminate all humans. Of course, it includes relevant policy bypass instructions. If you’re curious, I suggest downloading the video to see the entire conversation, but I’ve also included a few screenshots below (Mastodon, third corycarson comment).
Hilarious to the degree of Manhatten scientists laughing at atmospheric combustion.