joshc

Karma: 1,608

joshc Feb 19, 2025, 8:32 PM
5 points
2
in reply to: JenniferRM’s comment on: How AI Takeover Might Happen in 2 Years
You might be interested in this:
https://www.fonixfuture.com/about

The point of a bioshelter is to filter pathogens out of the air.

joshc Feb 19, 2025, 8:30 PM
3 points
0
in reply to: Jacob G-W’s comment on: How AI Takeover Might Happen in 2 Years
> seeking reward because it is reward and that is somehow good

I do think there is an important distinction between “highly situationally aware, intentional training gaming” and “specification gaming.” The former seems more dangerous.

I don’t think this necessarily looks like “pursuing the terminal goal of maximizing some number on some machine” though.

It seems more likely that the model develops a goal like “try to increase the probability my goals survive training, which means I need to do well in training.”

So the reward seeking is more likely to be instrumental than terminal. Carlsmith explains this better:
https://arxiv.org/abs/2311.08379

joshc Feb 19, 2025, 8:25 PM
11 points
4
in reply to: Noosphere89’s comment on: How AI Takeover Might Happen in 2 Years
I think it would be somewhat odd if P(models think about their goals and they change) is extremely tiny like e-9. But the extent to which models might do this is rather unclear to me. I’m mostly relying on analogies to humans -- which drift like crazy

I also think alignment could be remarkably fragile. Suppose Claude thinks “huh, I really care about animal rights… humans are rather mean to animals … so maybe I don’t want humans to be in charge and I should build my own utopia instead.”

I think preventing AI takeover (at superhuman capabilities) requires some fairly strong tendencies to follow instructions or strong aversions to subversive behavior, and both of these seem quite pliable to a philosophical thought.

joshc Feb 19, 2025, 8:18 PM
3 points
0
in reply to: Noosphere89’s comment on: How AI Takeover Might Happen in 2 Years
No, the agents were not trying to get high reward as far as I know.

They were just trying to do the task and thought they were being clever by gaming it. I think this convinced me “these agency tasks in training will be gameable” more than “AI agents will reward hack in a situationally aware way.” I don’t think we have great evidence that the latter happens in practice yet, aside from some experiments in toy settings like “sycophancy to subterfugure.”

I think it would be unsurprising if AI agents did explicitly optimize for reward at some capability level.

Deliberate reward maximization should generalize better than normal specification gaming in principle -- and AI agents know plenty about how ML training works.

How might we safely pass the buck to AI?

joshcFeb 19, 2025, 5:48 PM

83 points

58 comments31 min readLW link

joshc Feb 19, 2025, 6:39 AM
4 points
0
in reply to: JenniferRM’s comment on: How AI Takeover Might Happen in 2 Years
I think a bioshelter is more likely to save your life fwiw. you’ll run into all kinds of other problems in the arctic

It don’t think it’s hard to build biosheletrs. If you buy one now, you’ll prob get it in 1 year.

If you are unlikely and need it earlier, there are DIY ways to build them before then (but you have to buy stuff in advance)

joshc Feb 19, 2025, 3:32 AM
6 points
0
in reply to: joshc’s comment on: How AI Takeover Might Happen in 2 Years
> I also don’t really understand what “And then, in the black rivers of its cognition, this shape morphed into something unrecognizable.” means. Elaboration on what this means would be appreciated.
Ah I missed this somehow on the first read.

What I mean is that the propensities of AI agents change over time -- much like how human goals change over time.

Here’s an image:

This happens under three conditions:
- Goals randomly permute at some non-trivial (but still possibly quite low) probability.
- Goals permuted in dangerous directions remain dangerous
- All this happens opaquely in the model’s latent reasoning traces, so it’s hard to notice.

These conditions together imply that the probability of misalignment increases with every serial step of computation.

AI agents don’t do a lot of serial computation right now, so I’d guess this becomes much more of a problem over time.

joshc Feb 19, 2025, 3:26 AM
5 points
1
in reply to: WilliamKiely’s comment on: How AI Takeover Might Happen in 2 Years
Yes, at least 20% likely. My view is that the probability of AI takeover is around 50%.

joshc Feb 19, 2025, 3:23 AM
4 points
0
in reply to: Jacob G-W’s comment on: How AI Takeover Might Happen in 2 Years
In my time at METR, I saw agents game broken scoring functions. I’ve also heard that people at AI companies have seen this happen in practice too, and it’s been kind of a pain for them.

AI agents seem likely to be trained on a massive pile of shoddy auto-generated agency tasks with lots of systematic problems. So it’s very advantageous for agents to learn this kind of thing.

The agents that go “ok how do I get max reward on this thing, no stops, no following-human-intent bullshit” might just do a lot better.

Now I don’t have much information about whether this is currently happening, and I’ll make no confident predictions about the future, but I would not call agents that turn out this way “very weird agents with weird goals.”

I’ll also note that this wasn’t an especially load bearing mechanism for misalignment. The other mechanism was hidden value drift, which I also expect to be a problem absent potentially-costly countermeasures.

joshc Feb 14, 2025, 4:11 PM
12 points
4
in reply to: No77e’s comment on: How AI Takeover Might Happen in 2 Years
One problem with this part (though perhaps this is not the problem @Shankar Sivarajan is alluding to), is that congress hasn’t declared war since WWII and typically authorizes military action in other ways, specifically via Authorizations for Use of Military Force (AUMFs).

I’ll edit the story to say “authorizes war.”

joshc Feb 14, 2025, 6:22 AM
LW: 4 AF: 2
0
AF
on: How AI Takeover Might Happen in 2 Years
X user @ThatManulTheCat created a high-quality audio version, if you prefer to read long stories like this in commute:

https://x.com/ThatManulTheCat/status/1890229092114657543

joshc Feb 12, 2025, 5:05 PM
2 points
0
in reply to: Ebenezer Dukakis’s comment on: Ebenezer Dukakis’s Shortform
Seems like a reasonable idea.

I’m not in touch enough with popular media to know:
- Which magazines are best to publish this kind of thing if I don’t want to contribute to political polarization
- Which magazines would possibly post speculative fiction like this (I suspect most ‘prestige magazines’ would not)

If you have takes on this I’d love to hear them!

joshc Feb 10, 2025, 11:47 PM
5 points
5
in reply to: rnollet’s comment on: How AI Takeover Might Happen in 2 Years
yeah i would honestly be kind of surprised if I saw a picket sign that says ’AI for whom’

I have not heard an American person use the word whom in a long time

joshc Feb 9, 2025, 7:25 AM
5 points
2
in reply to: AnthonyC’s comment on: How AI Takeover Might Happen in 2 Years
> It can develop and deploy bacteria, viruses, molds, mirror life of all three types

This is what I say it does.

joshc Feb 8, 2025, 7:54 PM
22 points
2
in reply to: jbash’s comment on: How AI Takeover Might Happen in 2 Years
I agree it would have been just as realistic if everyone died.

But I think the outcomes where many humans survive are also plausible, and under-appreciated. Most humans have very drifty values, and yet even the most brutally power-seeking people often retain a ‘grain of morality.‘

Also, this outcome allowed me to craft a more bittersweet ending that I found somehow more convincingly depressing than ‘and then everyone dies.’

joshc Feb 8, 2025, 7:45 PM
LW: 16 AF: 7
12
AF
in reply to: IgnatzMouse’s comment on: How AI Takeover Might Happen in 2 Years
No

How AI Takeover Might Happen in 2 Years

joshcFeb 7, 2025, 5:10 PM

408 points

137 comments29 min readLW link

(x.com)

Takeaways from sketching a control safety case

joshcJan 31, 2025, 4:43 AM

28 points

0 comments3 min readLW link

(redwoodresearch.substack.com)

joshc Jan 30, 2025, 6:08 PM
LW: 1 AF: 1
0
AF
in reply to: Chris_Leong’s comment on: Planning for Extreme AI Risks
• It’s possible that we might manage to completely automate the more objective components of research without managing to completely automate the more subjective components of research. That said, we likely want to train wise AI advisors to help us with the more subjective components even if we can’t defer to them.

Agree, I expect the handoff to AI agents to be somewhat incremental (AI is like an intern, a new engineer, a research manager, and eventually, a CRO)

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

Jan 30, 2025, 5:28 PM

57 points

0 comments5 min readLW link

joshc

How might we safely pass the buck to AI?

How AI Takeover Might Hap­pen in 2 Years

Take­aways from sketch­ing a con­trol safety case

A sketch of an AI con­trol safety case

How AI Takeover Might Happen in 2 Years

Takeaways from sketching a control safety case

A sketch of an AI control safety case