RSS

Thomas Kwa

Karma: 4,166

Was on Vivek Hebbar’s team at MIRI, now working with Adrià Garriga-Alonso onvarious empirical alignment projects.

I’m looking for projects in interpretability, activation engineering, and control/​oversight; DM me if you’re interested in working with me.

Good­hart in RL with KL: Appendix

Thomas Kwa18 May 2024 0:40 UTC
9 points
0 comments6 min readLW link

Catas­trophic Good­hart in RL with KL penalty

15 May 2024 0:58 UTC
46 points
7 comments7 min readLW link