Simple example: If YouTube can turn you into an ideological extremist, you’ll probably watch more YouTube videos. See these two recent papers by people interested in AI safety for more detail:
Yeah it’s really cool! It’s David Scott Krueger, who’s doing a lot of work bringing theories from the LW alignment community into mainstream ML. This preference shift argument seems similar to the concept of gradient hacking, though it doesn’t require the presence of a mesa optimizer. I’d love to write a post summarizing this recent work and discussing its relevance to long-term safety if you’d be interesting in working on it together.
Flattered you ask, but I estimate that I’ll be either very busy with my own projects or on mental-health vacation until the end of the year. But unless you’re completely saturated with connections, I’d be happy to have a 1:1 conversation sometime after October 25th? Just for exploration purposes, not for working on a particular project.
Simple example: If YouTube can turn you into an ideological extremist, you’ll probably watch more YouTube videos. See these two recent papers by people interested in AI safety for more detail:
https://openreview.net/pdf?id=mMiKHj7Pobj
https://arxiv.org/abs/2204.11966
Thanks! This is what I’m looking for. Seems like I should have googled “recommender systems” and “preference shifts”.
Edit: The openreview paper is so good. Do you know who the authors are?
Yeah it’s really cool! It’s David Scott Krueger, who’s doing a lot of work bringing theories from the LW alignment community into mainstream ML. This preference shift argument seems similar to the concept of gradient hacking, though it doesn’t require the presence of a mesa optimizer. I’d love to write a post summarizing this recent work and discussing its relevance to long-term safety if you’d be interesting in working on it together.
Flattered you ask, but I estimate that I’ll be either very busy with my own projects or on mental-health vacation until the end of the year. But unless you’re completely saturated with connections, I’d be happy to have a 1:1 conversation sometime after October 25th? Just for exploration purposes, not for working on a particular project.