DanielFilan

Karma: 8,821

DanielFilan Nov 1, 2024, 8:22 PM
6 points
0
in reply to: DanielFilan’s comment on: JargonBot Beta Test
Also, my feedback is that some of the definitions seem kind of vague. Like, apparently an ultracontribution is “a mathematical object representing uncertainty over probability”—this tells me what it’s supposed to be, but doesn’t actually tell me what it is. The ones that actually show up in the text don’t seem too vague, partially because they’re not terms that are super precise.

DanielFilan Nov 1, 2024, 8:19 PM
5 points
0
on: JargonBot Beta Test
How are you currently determining which words to highlight? You say “terms that readers might not know” but this varies a lot based on the reader (as you mention in the long-term vision section).

DanielFilan Nov 1, 2024, 8:14 PM
13 points
7
in reply to: habryka’s comment on: JargonBot Beta Test
FWIW I think it’s not uncommon for people to not use LLMs daily (e.g. I don’t).

DanielFilan Nov 1, 2024, 8:13 PM
11 points
10
in reply to: Raemon’s comment on: JargonBot Beta Test
FWIW I think the actual person with responsibility is the author if the author approves it, and you if the author doesn’t.

DanielFilan Nov 1, 2024, 8:07 PM
2 points
0
in reply to: RobertM’s comment on: Habryka’s Shortform Feed
I believe I’m seeing Gill Sans? But when I google “Calibri” I see text that looks like it’s in Calibri, so that’s confusing.

DanielFilan Oct 31, 2024, 9:52 PM
LW: 2 AF: 2
0
AF
on: New report: Safety Cases for AI
Since people have reported not being able to see the tweet thread, I will reproduce it in this comment (with pictures replaced by my descriptions of them):
If developers had to prove to regulators that powerful AI systems are safe to deploy, what are the best arguments they could use?

Our new report tackles the (very big!) question of how to make a ‘safety case’ for AI.

[image of the start of the paper]

We define a safety case as a rationale developers provide to regulators to show that their AI systems are unlikely to cause a catastrophe.

The term ‘safety case’ is not new. In many industries (e.g. aviation), products are ‘put on trial’ before they are released.

[cartoon of a trial: regulator is a judge, one side is a developer advancing a safety case, other side is a red team advancing a risk case]

We simplify the process of making a safety case by breaking it into six steps.
1. Specify the macrosystem (all AI systems) and the deployment setting.
2. Concretize ‘AI systems cause a catastrophe’ into specific unacceptable outcomes (e.g. the AI systems build a bioweapon)
3. Justify claims about the deployment setting.
4. Carve up the collection of AI systems into smaller groups (subsystems) that can be analyzed in isolation.
5. Assess risk from subsystems acting unilaterally.
6. Assess risk from subsystems cooperating together.
[The above, but in picture format]

We first focus on step 5: how would developers argue that individual subsystems are safe? These arguments can be grouped into four categories.

Arguments could assert AI systems are
- too weak to be dangerous even if open-sourced (inability)
- unable to cause disaster due to restrictions (control)
- safe despite ablility to cause harm (trustworthiness)
- judged to be safe by credible AI advisors (deference)
[The above, but in picture format and slightly reworded]

Consider the analogy of a robber and a diamond. How might one justify that the robber won’t steal the diamond?

[Image with heading: “Why won’t the robber steal the diamond?”.

Row 1 says “Inability The robber is an infant. Analogy: e.g. GPT-3 is clearly too weak to be dangerous”, next to an image of a baby near a diamond on a pedestal.

Row 2 says “Control The robber is in jail. Analogy: e.g. AI systems are sandboxed and can’t escape.”, next to an image of a prisoner in a box with bars, near a diamond on a pedestal.

Row 3 says “Trustworthiness The robber does not steal. Analogy: e.g. AI systems follow developer instructions.”, next to an image of a person sitting down cross-legged and wearing an orange robe, next to a diamond on a pedestal.

Row 4 says “Deference The security service CEO says the diamond is secure. Analogy: AI advisors say that AI systems are safe.”, next to an image of a figure with grey hair saying “All good” next to a diamond on a pedestal with lines coming out of it, connecting it to small grey boxes (sensors?).]

The ordering of these categories is intentional. As AI systems become more powerful, developers will likely rely mostly on inability, then control, then trustworthiness, and finally, deference to AI advisors.

[Image of graph where the horizontal axis is “Increasingly powerful AI” and the vertical axis is “Primary safety argument”. Inability, Control, Trustworthiness, and Deference are shown in order from bottom-left to top-right. An arrow connects the words “We are here” to Inability.]

Next, we give examples of arguments in each category. Arguments are ranked on three axes:
1. Practicality
2. Strength
3. Scalability
No argument received full marks! Research will be needed to justify the safety of advanced AI systems.

[A complicated diagram showing a variety of arguments under the Inability, Control, Trustworthiness, and Deference categories, together with ratings for their Practicality, Maximum Strength, and Scalability.]

The arguments in the previous step pertain to small groups of AI systems. It would be difficult to directly apply them to large groups. We also explain how to justify that the actions of many AI systems won’t cause a catastrophe (step 6 in our framework).

[Image titled “Large-scale AI misbehavior”. Below are 3 rows, with 2 columns. The left column is labelled “Causes” and the right is labelled “Strategies”.

Row 1: Cause: Widespread alignment faking. Strategy: Blitzkrieg: overwhelm controls

Row 2: Cause: Infectious jailbreaks. Strategy: Strike: disable infrastructure.

Row 3: Cause: Rapid memetic value drift. Strategy: Hivemind: combine intelligence.

dots are shown below, likely to indicate that there are more causes and strategies not shown.]

We are hoping this report will:
1. Motivate research that further clarifies the assumptions behind safety arguments.
2. Inform the design of hard safety standards.
More in the paper: https://bit.ly/3IJ5N95 Many thanks to my coauthors! @NickGabs01, @DavidSKrueger, and @thlarsen.

Might be of interest to @bshlgrs, @RogerGrosse, @DavidDuvenaud, @EvanHub, @aleks_madry, @ancadianadragan, @rohinmshah, @jackclarkSF, @Manderljung, @RichardMCNgo

DanielFilan Oct 31, 2024, 9:35 PM
4 points
0
in reply to: DanielFilan’s comment on: Habryka’s Shortform Feed
Update: I have already gotten over it.

DanielFilan Oct 30, 2024, 12:02 AM
4 points
2
in reply to: habryka’s comment on: Habryka’s Shortform Feed
It looks kinda small to me, someone who uses Firefox on Ubuntu.

DanielFilan Oct 15, 2024, 9:07 PM
4 points
2
in reply to: Steven Byrnes’s comment on: MATS AI Safety Strategy Curriculum v2
A thing you are maybe missing is that the discussion groups are now in the past.

You should be sure to point out that many of the readings are dumb and wrong

The hope is that the scholars notice this on their own.

Week 3 title should maybe say “How could we safely train AIs…”? I think there are other training options if you don’t care about safety.

Lol nice catch.

DanielFilan Oct 15, 2024, 9:04 PM
2 points
0
in reply to: Orpheus16’s comment on: MATS AI Safety Strategy Curriculum v2
We included a summary of Situational Awareness as an optional reading! I guess I thought the full thing was a bit too long to ask people to read. Thanks for the other recs!

DanielFilan Oct 7, 2024, 10:36 PM
2 points
−3
on: Research update: Towards a Law of Iterated Expectations for Heuristic Estimators

to simplify, we ask that $G (Y - G (Y | Π) | Π) = 0$ for every expression $Y$ and set of arguments $Π$

Here and in the next dot point, should the inner heuristic estimate be conditioning on a larger set of arguments (perhaps chosen by an unknown method)? Otherwise it seems like you’re just expressing some sort of self-knowledge.

DanielFilan Oct 7, 2024, 4:59 PM
5 points
0
in reply to: Ben Millwood’s comment on: DanielFilan’s Shortform Feed
OP doesn’t emphasize liability insurance enough but part of the hope is that you can mandate that companies be insured up to $X00 billion, which costs them less than $X00 billion assuming that they’re not likely to be held liable for that much. Then the hope is the insurance company can say “please don’t do extremely risky stuff or your premium goes up”.

DanielFilan Oct 5, 2024, 11:11 PM
2 points
0
in reply to: DanielFilan’s comment on: DanielFilan’s Shortform Feed

On the other hand, there’s not a clear criteria for when we would pause again after, say, a six month pause in scaling.

Realized that I didn’t respond to this—PauseAI’s proposal is for a pause until safety can be guaranteed, rather than just for 6 months.

DanielFilan Oct 5, 2024, 11:10 PM
2 points
0
in reply to: Noosphere89’s comment on: DanielFilan’s Shortform Feed

I believe AI pauses by governments would absolutely be more serious and longer, preventing overhangs from building up too much.

Are you saying that overhangs wouldn’t build up too much under pauses because the government wouldn’t let it happen, or that RSPs would have less overhang because they’d pause for less long so less overhang would build up? I can’t quite tell.

DanielFilan Oct 5, 2024, 11:05 PM
4 points
2
in reply to: ZY’s comment on: DanielFilan’s Shortform Feed
I’m not saying there’s no reason to think that RSPs are better or worse than pause, just that if overhang is a relevant consideration for pause, it’s also a relevant consideration for RSPs.

DanielFilan Oct 5, 2024, 11:04 PM
4 points
0
in reply to: Sodium’s comment on: DanielFilan’s Shortform Feed

I’d imagine that RSP proponents think that if we execute them properly, we will simply not build dangerous models beyond our control, period.

I think pause proponents think similarly!

DanielFilan Oct 5, 2024, 11:03 PM
2 points
0
in reply to: alexbleakley’s comment on: DanielFilan’s Shortform Feed

Do you see the same people invoking overhang as an argument against pauses and also talking about RSPs as though they are not also impacted?

I guess I’m not tracking this closely enough. I’m not really that focussed on any one arguer’s individual priorities, but more about the discourse in general. Basically, I think that overhang is a consideration for unconditional pauses if and only if it’s a consideration for RSPs, so it’s a bad thing if overhang is brought up as an argument against unconditional pauses and not against RSPs, because this will distort the world’s ability to figure out the costs and benefits of each kind of policy.

Also, to be clear, it’s not impossible that RSPs are all things considered better than unconditional pauses, and better than nothing, despite overhang. But if so, I’d hope someone somewhere would have written a piece saying “RSPs have the cost of causing overhang, but on net are worth it”.

DanielFilan Oct 4, 2024, 6:01 PM
3 points
0
in reply to: alexbleakley’s comment on: DanielFilan’s Shortform Feed
I’m not saying that RSPs are or aren’t better than a pause. But I would think that if overhang is a relevant consideration for pauses, it’s also a relevant consideration for RSPs.

DanielFilan Oct 3, 2024, 8:49 PM
33 points
18
on: DanielFilan’s Shortform Feed
A complaint about AI pause: if we pause AI and then unpause, progress will then be really quick, because there’s a backlog of improvements in compute and algorithmic efficiency that can be immediately applied.

One definition of what an RSP is: if a lab makes observation O, then they pause scaling until they implement protection P.

Doesn’t this sort of RSP have the same problem with fast progress after pausing? Why have I never heard anyone make this complaint about RSPs? Possibilities:
- They do and I just haven’t seen it
- People expect “AI pause” to produce longer / more serious pauses than RSPs (but this seems incidental to the core structure of RSPs)

DanielFilan Oct 3, 2024, 8:46 PM
29 points
11
on: DanielFilan’s Shortform Feed
I continue to think that agent foundations research is kind of underrated. Like, we’re supposed to do mechinterp to understand the algorithm models implement—but how do we know what algorithms are good?
What links here?
- DanielFilan's comment on DanielFilan’s Shortform Feed by DanielFilan (Nov 14, 2024, 4:07 AM; 32 points)