Vikrant Varma

Karma: 819

Research Engineer at DeepMind.

Publications

MONA: Three Month Later—Updates and Steganography Without Optimization Pressure

David Lindner and Vikrant Varma

Apr 12, 2025, 11:15 PM

27 points

0 comments5 min readLW link

Vikrant Varma Jan 27, 2025, 10:34 AM
1 point
0
in reply to: mattmacdermott’s comment on: MONA: Managed Myopia with Approval Feedback
We won’t be able to release the dataset directly but can make it easy to reproduce, and are looking into options now. Ping me in a week if I haven’t commented!

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

Jul 19, 2024, 4:10 PM

48 points

10 comments1 min readLW link

(storage.googleapis.com)

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lewis smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

Apr 25, 2024, 6:43 PM

63 points

38 comments1 min readLW link

(arxiv.org)

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

Apr 19, 2024, 7:06 PM

79 points

10 comments8 min readLW link

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lewis smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

Apr 19, 2024, 7:06 PM

72 points

0 comments3 min readLW link

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik and Rohin Shah

Dec 18, 2023, 11:58 AM

147 points

21 comments10 min readLW link

Explaining grokking through circuit efficiency

Vikrant Varma and Rohin Shah

Sep 8, 2023, 2:39 PM

101 points

11 comments3 min readLW link

(arxiv.org)

Vikrant Varma Nov 28, 2022, 4:18 PM
LW: 6 AF: 2
0
AF
in reply to: Ramana Kumar’s comment on: Mechanistic anomaly detection and ELK
To add some more concrete counter-examples:
- deceptive reasoning is causally upstream of train output variance (e.g. because the model has read ARC’s post on anomaly detection), so is included in π.
- alien philosophy explains train output variance; unfortunately it also has a notion of object permanence we wouldn’t agree with, which the (AGI) robber exploits

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Vika, Vikrant Varma, Ramana Kumar and Rohin Shah

Nov 25, 2022, 2:36 PM

39 points

9 comments6 min readLW link

(vkrakovna.wordpress.com)

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

Nov 1, 2022, 11:03 AM

78 points

4 comments25 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

Nov 1, 2022, 11:03 AM

127 points

24 comments4 min readLW link 1 review

More examples of goal misgeneralization

Rohin Shah and Vikrant Varma

Oct 7, 2022, 2:38 PM

56 points

8 comments2 min readLW link

(deepmindsafetyresearch.medium.com)

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Vika, Vikrant Varma, Ramana Kumar and Mary Phuong

Aug 12, 2022, 3:17 PM

86 points

4 comments3 min readLW link 1 review

(vkrakovna.wordpress.com)

Vikrant Varma May 10, 2022, 4:36 PM
1 point
AF
on: Knowledge is not just mutual information
Thanks for this sequence!
I don’t understand why the computer case is a counterexample for mutual information, doesn’t it depend on your priors (which don’t know anything about the other background noise interacting with photons)?
Taking the example of a one-time pad, given two random bit strings A and B, if C = A ⊕ B, learning C doesn’t tell you anything about A unless you already have some information about B. So I(C; A) = 0 when B is uniform and independent of A.
Over time, the photons bouncing off the object being sought and striking other objects will leave an imprint in every one of those objects that will have high mutual information with the position of the object being sought.
If our prior was very certain about any factors that could interact with photons, then indeed the resulting imprints would have high mutual information, but it seems like you can rescue mutual information here by saying that our prior is uncertain about these other factors so the resulting imprints are noisy as well.
On the other hand, it seems correct that an entity that did have a more certain prior over interacting factors would see photon imprints as accumulating knowledge (for example photographic film).

ELK contest submission: route understanding through the human ontology

Vika, Ramana Kumar and Vikrant Varma

Mar 14, 2022, 9:42 PM

21 points

2 comments2 min readLW link