RobertKirk

Karma: 319

PhD student at UCL DARK doing RL, OOD Robustness and safety. Interested in self improvement.

A Sober Look at Steering Vectors for LLMs

Joschka Braun, Dmitrii Krasheninnikov, Usman Anwar, RobertKirk, Daniel Tan and David Scott Krueger (formerly: capybaralet)

Nov 23, 2024, 5:30 PM

38 points

0 comments5 min readLW link

Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping

RobertKirkJul 20, 2023, 9:56 AM

39 points

2 comments5 min readLW link

Causal confusion as an argument against the scaling hypothesis

RobertKirk and David Scott Krueger (formerly: capybaralet)

Jun 20, 2022, 10:54 AM

86 points

30 comments15 min readLW link

Sparsity and interpretability?

Ada Böhm, RobertKirk and Tomáš Gavenčiak

Jun 1, 2020, 1:25 PM

41 points

3 comments7 min readLW link

How can Interpretability help Alignment?

RobertKirk and Tomáš Gavenčiak

May 23, 2020, 4:16 PM

37 points

3 comments9 min readLW link

What is Interpretability?

RobertKirk, Tomáš Gavenčiak and Ada Böhm

Mar 17, 2020, 8:23 PM

39 points

1 comment11 min readLW link