bilalchughtai

Karma: 919

My website is here.

bilalchughtai Apr 15, 2025, 9:42 AM
1 point
0
on: Try training token-level probes
typo: unambitious → unambiguous

bilalchughtai Mar 31, 2025, 7:49 PM
2 points
0
on: bilalchughtai’s Shortform
karpathy reviews sleep trackers: https://karpathy.bearblog.dev/finding-the-best-sleep-tracker/

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

Feb 6, 2025, 3:46 PM

102 points

9 comments2 min readLW link

(arxiv.org)

Paper: Open Problems in Mechanistic Interpretability

Lee Sharkey and bilalchughtai

Jan 29, 2025, 10:25 AM

68 points

0 comments1 min readLW link

(arxiv.org)

bilalchughtai Jan 15, 2025, 12:12 PM
5 points
1
on: How do you deal w/ Super Stimuli?
As a general rule, I try and minimise my phone screen time and maximise my laptop screen time. I can do every “productive” task faster on a laptop than on my phone.
Here are some things object level things I do that I find helpful that I haven’t yet seen discussed.
- Use a very minimalist app launcher on my phone, that makes searching for apps a conscious decision.
- Use a greyscale filter on my phone (which is hard to turn off), as this makes doing most things on my phone harder.
- Every time I get a notification I didn’t need to get, I instantly disable it. This also generalizes to unsubscribing from emails I don’t need to receive.

bilalchughtai Jan 13, 2025, 12:59 PM
3 points
0
in reply to: Nathan Helm-Burger’s comment on: Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
What is the error message?

bilalchughtai Jan 12, 2025, 8:33 PM
3 points
0
in reply to: Nathan Helm-Burger’s comment on: Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
Yep, this sounds interesting! My suggestion for anyone wanting to run this experiment would be to start with SAD-mini, a subset of SAD with the five most intuitive and simple tasks. It should be fairly easy to adapt our codebase to call the Goodfire API. Feel free to reach out to myself or @L Rudolf L if you want assistance or guidance.

bilalchughtai Jan 12, 2025, 7:49 PM
6 points
0
in reply to: Clément Dumas’s comment on: Activation space interpretability may be doomed
How do you know what “ideal behaviour” is after you steer or project out your feature? How would you differentiate a feature with sufficiently high cosine sim to a “true model feature” and a “true model feature”? I agree you can get some signal on whether a feature is causal, but would argue this is not ambitious enough.

bilalchughtai Jan 12, 2025, 7:45 PM
2 points
0
in reply to: Wuschel Schulz’s comment on: Activation space interpretability may be doomed
Yes, that’s right—see footnote 10. We think that Transcoders and Crosscoders are directionally correct, in the sense that they leverage more of the models functional structure via activations from several sites, but agree that their vanilla versions suffer similar problems to regular SAEs.

bilalchughtai Jan 12, 2025, 7:24 PM
5 points
4
in reply to: Arthur Conmy’s comment on: Activation space interpretability may be doomed
Also related to the idea that the best linear SAE encoder is not the transpose of the decoder.

bilalchughtai Jan 10, 2025, 1:59 PM
1 point
0
on: AI Safety as a YC Startup
For another perspective on leveraging startups for improving the world see this blog post by @benkuhn.

bilalchughtai Jan 10, 2025, 1:47 PM
4 points
1
on: bilalchughtai’s Shortform
A LW feature that I would find helpful is an easy to access list of all links cited by a given post.

Activation space interpretability may be doomed

bilalchughtai and Lucius Bushnaq

Jan 8, 2025, 12:49 PM

147 points

32 comments8 min readLW link

bilalchughtai Jan 5, 2025, 6:19 PM
3 points
0
in reply to: RyanCarey’s comment on: Reasons for and against working on technical AI safety at a frontier AI lab
Agreed that this post presents the altruistic case.
I discuss both the money and status points in the “career capital” paragraph (though perhaps should have factored them out).

bilalchughtai Jan 5, 2025, 2:54 PM
1 point
0
on: Policymakers don’t have access to paywalled articles
your image of a man with a huge monitor doesn’t quite scream “government policymaker” to me

Reasons for and against working on technical AI safety at a frontier AI lab

bilalchughtaiJan 5, 2025, 2:49 PM

97 points

12 comments12 min readLW link

bilalchughtai Jan 1, 2025, 3:28 PM
12 points
12
in reply to: debrevitatevitae’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
In fact, this mindset gave me burnout earlier this year.
I relate pretty strongly to this. I think almost all junior researchers are incentivised to ‘paper grind’ for longer than is correct. I do think there are pretty strong returns to having one good paper for credibility reasons; it signals that you are capable of doing AI safety research, and thus makes it easier to apply for subsequent opportunities.
Over the past 6 months I’ve dropped the paper grind mindset and am much happier for this. Notably, were it not for short term grants where needing to visibly make progress is important, I would have made this update sooner. Another take that I have is that if you have the flexibility to do so (e.g. by already having stable funding, perhaps via being a PhD student), front-loading learning seems good. See here for a related take by Rohin. Making progress on hard problems requires understanding things deeply, in a way which making progress on easy problems that you could complete during e.g. MATS might not.

bilalchughtai Jan 1, 2025, 3:09 PM
6 points
2
on: bilalchughtai’s Shortform
You might want to stop using the honey extension. Here are some shady things they do, beyond the usual:
1. Steal affiliate marketing revenue from influencers (who they also often sponsor), by replacing the genuine affiliate referral cookie with their affiliate referral cookie.
2. Deceive customers by deliberately withholding the best coupon codes, while claiming they have found the best coupon codes on the internet; partner businesses control which coupon codes honey shows consumers.

Book Summary: Zero to One

bilalchughtaiDec 29, 2024, 4:13 PM

27 points

2 comments8 min readLW link

bilalchughtai Dec 23, 2024, 4:47 PM
2 points
0
in reply to: habryka’s comment on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
any update on this?

bilalchughtai

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

Paper: Open Prob­lems in Mechanis­tic Interpretability

Ac­ti­va­tion space in­ter­pretabil­ity may be doomed

Rea­sons for and against work­ing on tech­ni­cal AI safety at a fron­tier AI lab

Book Sum­mary: Zero to One

Detecting Strategic Deception Using Linear Probes

Paper: Open Problems in Mechanistic Interpretability

Activation space interpretability may be doomed

Reasons for and against working on technical AI safety at a frontier AI lab

Book Summary: Zero to One