Yeah it’s really cool! It’s David Scott Krueger, who’s doing a lot of work bringing theories from the LW alignment community into mainstream ML. This preference shift argument seems similar to the concept of gradient hacking, though it doesn’t require the presence of a mesa optimizer. I’d love to write a post summarizing this recent work and discussing its relevance to long-term safety if you’d be interesting in working on it together.
Flattered you ask, but I estimate that I’ll be either very busy with my own projects or on mental-health vacation until the end of the year. But unless you’re completely saturated with connections, I’d be happy to have a 1:1 conversation sometime after October 25th? Just for exploration purposes, not for working on a particular project.
Yeah it’s really cool! It’s David Scott Krueger, who’s doing a lot of work bringing theories from the LW alignment community into mainstream ML. This preference shift argument seems similar to the concept of gradient hacking, though it doesn’t require the presence of a mesa optimizer. I’d love to write a post summarizing this recent work and discussing its relevance to long-term safety if you’d be interesting in working on it together.
Flattered you ask, but I estimate that I’ll be either very busy with my own projects or on mental-health vacation until the end of the year. But unless you’re completely saturated with connections, I’d be happy to have a 1:1 conversation sometime after October 25th? Just for exploration purposes, not for working on a particular project.