Adam Jones

Karma: 232

adamjones.me

What does Yann LeCun think about AGI? A summary of his talk, “Mathematical Obstacles on the Way to Human-Level AI”

Adam JonesApr 5, 2025, 12:21 PM

11 points

0 comments2 min readLW link

AI safety content you could create

Adam JonesJan 6, 2025, 3:35 PM

19 points

0 comments5 min readLW link

(adamjones.me)

Adam Jones Jan 6, 2025, 3:17 PM
3 points
0
in reply to: cdt’s comment on: Policymakers don’t have access to paywalled articles

How big of this is an issue in practice? For AI in particular, considering that so much contemporary research is published on arxiv, it must be relatively accessible?

I think this is less of an issue for technical AI papers. But I’m finding more governance researchers (especially people moving from other academic communities) seem intent on journal publishing in places that policymakers can’t read their stuff! I have also been blocked sometimes from sharing papers with governance friends easily because they are behind paywalls. I might see this more because at BlueDot we get a lot of people who are early on in their career transition, and producing projects they want to publish in places.

Policymakers don’t have access to paywalled articles

Adam JonesJan 5, 2025, 10:56 AM

71 points

11 comments2 min readLW link

(adamjones.me)

Adam Jones Jan 3, 2025, 12:52 PM
13 points
0
on: What’s the short timeline plan?
Thank you for writing this. I’ve tried to summarize this article (missing good points made above, but might be useful to people deciding whether to read the full article):

Summary

AGI might be developed by 2027, but we lack clear plans for tackling misalignment risks. This post:
- calls for better short-timeline AI alignment plans
- lists promising interventions that could be stacked to reduce risks
This plan focuses on two minimum requirements:
- Secure model weights and algorithmic secrets
- Ensure the first AI capable of alignment research isn’t scheming
Layer 1 interventions (essential):
- AI systems should maintain human-legible and faithful reasoning.
  - If achieved, we should monitor this reasoning, particularly for scheming, power-seeking, and broad goal-directedness (using other models or simple probes).
  - If not, we should fall back on control techniques that assume the model might be scheming.
- Evaluations support other strategies, and give us better awareness of model alignment and capabilities.
- Information and physical security protects model weights and algorithmic secrets.
Layer 2 interventions (important):
- Continue improving ‘current’ alignment methods like RLHF and RLAIF.
- Maintain research on interpretability, oversight, and “superalignment”, and preparing to accelerate this work once we have human-level AI R&D.
- Increase transparency in AI companies’ safety planning (internally, with experts, and publicly).
- Develop a safety-first culture in AI organizations.
This plan is meant as a starting point, and Marius encourages others to come up with better plans.
What links here?
- AI safety content you could create by Adam Jones (Jan 6, 2025, 3:35 PM; 19 points)

Adam Jones Jan 3, 2025, 11:32 AM
10 points
0
in reply to: Marius Hobbhahn’s comment on: What’s the short timeline plan?
At BlueDot we’ve been thinking about this a fair bit recently, and might be able to help here too. We have also thought a bit about criteria for good plans and the hurdles a plan needs to overcome, as well as have reviewed a lot of the existing literature on plans.

I’ve messaged you on Slack.

Adam Jones Jan 2, 2025, 11:40 PM
4 points
3
in reply to: Satron’s comment on: Alignment Is Not All You Need
Re: Your comments on the power distribution problem

Agreed that multiple entities powerful adversaries controlling AI seems like not a good plan. And I agree if the decisive winner of the AI race will not act in humanity’s best interests, we are screwed.

But I think this is a problem for before that happens: we can shape the world today so it’s more likely the winner of the AI race will act in humanity’s best interests.

Adam Jones Jan 2, 2025, 11:34 PM
4 points
3
in reply to: Satron’s comment on: Alignment Is Not All You Need
Re: Your points about alignment solving this.

I agree if you define alignment as ‘get your AI system to act in the best interests in humans’, then the coordination problem becomes harder and likely sufficient for problems 2 and 3. But I think it then bundles more problems together in a way that might be less conducive to solving them.

For loss of control, I was primarily thinking about making systems intent-aligned, by which I mean getting the AI system to try to do what its creators intend. I think this makes dividing these challenges up into subproblems easier (and seems to be what many people appear to be gunning for).

If you do define alignment as human-values alignment, I think “If you fail to implement to implement a working alignment solution, you [the creating organization] die” doesn’t hold—I can imagine successfully aligning a system to ‘get your AI system to act in the best interests of its creators’ working fine for its creators but not being great for the world.

Alignment Is Not All You Need

Adam JonesJan 2, 2025, 5:50 PM

43 points

10 comments6 min readLW link

(adamjones.me)

Adam Jones Oct 25, 2024, 1:41 PM
1 point
0
in reply to: Jonny Spicer’s comment on: OpenAI’s cybersecurity is probably regulated by NIS Regulations
It should! Fixed, thank you :)

OpenAI’s cybersecurity is probably regulated by NIS Regulations

Adam JonesOct 25, 2024, 11:06 AM

11 points

2 comments2 min readLW link

(adamjones.me)

Adam Jones Sep 13, 2024, 6:27 PM
1 point
0
on: AIS terminology proposal: standardize terms for probability ranges
The UK Government tends to use the PHIA probability yardstick in most of its communications.

This is used very consistently in national security publications. It’s also commonly used by other UK Government departments as people frequently move between departments in the civil service, and documents often get reviewed for clearance by national security bodies before public release.

It is less granular than the IPCC terms at the extremes, but the ranges don’t overlap. I don’t know which is actually better to use in AI safety communications, but I think being clear if you are using either in your writing seems a good way to go! In any case being aware it’s a thing you’ll see in UK Government documents might be useful.

Adam Jones Sep 3, 2024, 4:40 PM
2 points
1
on: The AI regulator’s toolbox: A list of concrete AI governance practices
A comment provided to me by a reader, highlighting 3rd party liability and insurance as interventions too (lightly edited):
Hi! I liked your AI regulator’s toolbox post – very useful to have a comprehensive list like this! I’m not sure exactly what heading it should go under, but I suggest considering adding proposals to greatly increase 3rd party liability (and or require carrying insurance). A nice intro is here:
https://www.lawfaremedia.org/article/tort-law-and-frontier-ai-governance
Some are explicitly proposing strict liability for catastrophic risks. Gabe Weil has proposal, summarized here: https://www.lesswrong.com/posts/5e7TrmH7mBwqpZ6ek/tort-law-can-play-an-important-role-in-mitigating-ai-risk
There are also workshop papers on insurance here:
https://www.genlaw.org/2024-icml-papers#liability-and-insurance-for-catastrophic-losses-the-nuclear-power-precedent-and-lessons-for-ai
https://www.genlaw.org/2024-icml-papers#insuring-uninsurable-risks-from-ai-government-as-insurer-of-last-resort
NB: when implemented correctly (i.e. when premiums are accurately risk-priced), insurance premiums are mechanically similar to Pigouvian taxes, internalizing negative externalities. So maybe this goes under the “Other taxes” heading? But that also seems odd. Like taxes, these are certainly incentive alignment strategies (rather than command and control) – maybe that’s a better heading? Just spitballing :)

Adam Jones Aug 11, 2024, 1:01 PM
27 points
15
on: Self-Other Overlap: A Neglected Approach to AI Alignment
I don’t understand how the experimental setup provides evidence for self-other overlap working.
The reward structure for the blue agent doesn’t seem to provide a non-deceptive reason to interact with the red agent. The described “non-deceptive” behaviour (going straight to the goal) doesn’t seem to demonstrate awareness of or response to the red agent.
Additionally, my understanding of the training setup is that it tries to make the blue agent’s activations the same regardless of whether it observes the red agent or not. This would mean there’s effectively no difference when seeing the red agent, i.e. no awareness of it. (This is where I’m most uncertain—I may have misunderstood this! Is it to do with only training the subset of cases where the blue agent doesn’t originally observe the red agent? Or having the KL penalty?). So we seem to be training the blue agent to ignore the red agent.
I think what might answer my confusion is “Are we just training the blue agent to ignore the red agent entirely?”
- If yes, how does this show self-other overlap working?
- If no, how does this look different to training the blue agent to ignore the red agent?
Alternatively, I think I’d be more convinced with an experiment showing a task where the blue agent still obviously needs to react to the red agent. One idea could be to add a non-goal in the way that applies a penalty to both agents if either agent goes through it, that can only be seen by the red agent (which knows it is this specific non-goal). Then the blue agent would have to sense when the red agent is hesitant about going somewhere and try to go around the obstacle (I haven’t thought very hard about this, so this might also have problems).

The AI regulator’s toolbox: A list of concrete AI governance practices

Adam JonesAug 10, 2024, 9:15 PM

9 points

1 comment34 min readLW link

(adamjones.me)

Adam Jones May 23, 2024, 1:02 PM
1 point
0
in reply to: rosiecam’s comment on: Which skincare products are evidence-based?

Most sunscreen feels horrible and slimy (especially in the US where the FDA has not yet approved the superior formulas available in Europe and Asia).

What superior formulas available in Europe would you recommend?

Adam Jones May 5, 2024, 12:05 AM
3 points
0
in reply to: Nathan Helm-Burger’s comment on: OHGOOD: A coordination body for compute governance
Thanks for the feedback! The article does include some bits on this, but I don’t think LessWrong supports toggle block formatting.
I think individuals probably won’t be able to train models themselves that pose advanced misalignment threats before large companies do. In particular, I think we disagree about how likely we think it is that there’s some big algorithmic efficiency trick someone will discover that enables people to leap forward on this (I don’t think this will happen, I think you think this will).
But I do think the catastrophic misuse angle seems fairly plausible—particularly from fine-tuning. I also think an ‘incompetent takeover’^[1] might be plausible for an individual to trigger. Both of these are probably not well addressed by compute governance (except maybe by stopping large companies releasing the weights of the models for fine-tuning by individuals).
1. ^
  I plan to write more up on this: I think it’s generally underrated as a concept.

OHGOOD: A coordination body for compute governance

Adam JonesMay 4, 2024, 12:03 PM

5 points

2 comments16 min readLW link

(adamjones.me)

How are voluntary commitments on vulnerability reporting going?

Adam JonesFeb 22, 2024, 8:43 AM

23 points

1 comment1 min readLW link

(adamjones.me)