For my own clarity: What is the difference between mathematical approaches to alignment and other technical approaches like mechanistic interpretability work?
I imagine the focus is on in principal arguments or proofs regarding the capabilities of a given system rather than empirical or behavioural analysis but you mention RL so just wanted to get some colour on this.
You are more or less right. By “mathematical approaches”, we mean approaches focused on building mathematical models relevant to alignment/agency/learning and finding non-trivial theorems (or at least conjectures) about these models. I’m not sure what the word “but” is doing in “but you mention RL”: there is a rich literature of mathematical inquiry into RL. For a few examples, see everything under the bullet “reinforcement learning theory” in the LTA reading list.
For my own clarity: What is the difference between mathematical approaches to alignment and other technical approaches like mechanistic interpretability work?
I imagine the focus is on in principal arguments or proofs regarding the capabilities of a given system rather than empirical or behavioural analysis but you mention RL so just wanted to get some colour on this.
Any clarification here would be helpful!
You are more or less right. By “mathematical approaches”, we mean approaches focused on building mathematical models relevant to alignment/agency/learning and finding non-trivial theorems (or at least conjectures) about these models. I’m not sure what the word “but” is doing in “but you mention RL”: there is a rich literature of mathematical inquiry into RL. For a few examples, see everything under the bullet “reinforcement learning theory” in the LTA reading list.
Thanks for the pointer! Yes RL has a lot of research of this kind—as an empirical research I just get stuck sometimes in translation