I’m confused that this idea is framed as an alternative to impact measures, because I thought the main point of impact measures is “prevent catastrophe” and this doesn’t aim to do that.
I didn’t mean to frame it as an alternative to impact measures, but it is achieving some of the things that impact measures achieve. Partly I wrote this post to explicitly say that I don’t imagine RLSP being a drop-in replacement for impact measures, even though it might seem like that could be true. I guess I didn’t communicate that effectively.
In the AI that RLSP might be a component of, what is doing the “prevent catastrophe” part?
That depends more on the AI part than on RLSP. I think the actual contribution here is the observation that the state of the world tells us a lot about what humans care about, and the RLSP algorithm is meant to demonstrate that it is in principle possible to extract those preferences.
If I were forced to give an answer to this question, it would be that RLSP would form a part of a norm-following AI, and that because the AI was following norms it wouldn’t do anything too crazy. However, RLSP doesn’t solve any of the theoretical problems with norm-following AI.
But the real answer is that this is an observation that seems important, but I don’t have a story for how it leads to us solving AI safety.
Can you also compare the pros and cons of this idea with other related ideas, for example large-scale IRL? (I’m imagining attaching recording devices to lots of people and recording their behavior over say months or years and feeding that to IRL.)
Any scenario I construct with RLSP has clear problems, and similarly large-scale IRL also has clear problems. If you provide particular scenarios I could analyze those.
For example, if you literally think just of running RLSP with a time horizon of a year vs. large-scale IRL over a year and optimizing the resulting utility function, large-scale IRL should do better because it has way more data to work with.
It seems like there’s gotta be a principled way to combine this idea with inverse reward design. Is that something you’ve thought about?
Yeah, I agree they feel very composable. The main issue is that the observation model in IRD requires a notion of a “training environment” that’s separate from the real world, whereas RLSP assumes that there is one complex environment in which you are acting.
Certainly if you first trained your AI system in some training environments and then deployed them in the real world, you could use IRD during training to get a distribution over reward functions, and then use that distribution as your prior when running RLSP. It’s maybe plausible that if you did this you could simply optimize the resulting reward function, rather than doing risk-averse planning (which is how IRD gets the robot to avoid lava), that would be cool. It’s hard to test because all of the IRD environments don’t satisfy the key assumption of RLSP (that humans have optimized the environment for their preferences).
Partly I wrote this post to explicitly say that I don’t imagine RLSP being a drop-in replacement for impact measures, even though it might seem like that could be true.
Ah ok, it didn’t occur to me in the first place that RLSP could be a replacement for impact measures (it seemed more closely related to IRL), so the comparisons you did with impact measures made me think you’re trying to frame it as a possible replacement for impact measures.
For example, if you literally think just of running RLSP with a time horizon of a year vs. large-scale IRL over a year and optimizing the resulting utility function, large-scale IRL should do better because it has way more data to work with.
I guess I wasn’t thinking in terms of a fixed time horizon, but more like, given similar budgets can you do more with RLSP or IRL. For example it seems like increasing the time horizon of RLSP might be cheaper compared to doing large-scale IRL over a longer period of time so maybe RLSP could actually work with more data at a similar budget.
Also, more data has diminishing returns after a while, so I was also asking whether if you had the budget to do large-scale IRL over say a year, and if you could do RLSP over a longer time horizon and get more data as a result, do you think that would give RLSP a significant advantage over IRL.
(But maybe these questions aren’t very important if the main point here isn’t offering RLSP as a concrete technique for people to use but more that “state of the world tells us a lot about what humans care about”.)
(But maybe these questions aren’t very important if the main point here isn’t offering RLSP as a concrete technique for people to use but more that “state of the world tells us a lot about what humans care about”.)
Yeah, I think that’s basically my position.
But to try to give an answer anyway, I suspect that the benefits of having a lot of data via large-scale IRL will make it significantly outperform RLSP, even if you could get a longer time horizon on RLSP. There might be weird effects where the RLSP reward is less Goodhart-able (since it tends to prioritize keeping the state the same) that make the RLSP reward better to maximize, even though it captures fewer aspects of “what humans care about”. On the other hand, RLSP is much more fragile; slight errors in dynamics / features / action space will lead to big errors in the inferred reward; I would guess this is less true of large-scale IRL, so in practice I’d guess that large-scale IRL would still be better. But both would be bad.
I didn’t mean to frame it as an alternative to impact measures, but it is achieving some of the things that impact measures achieve. Partly I wrote this post to explicitly say that I don’t imagine RLSP being a drop-in replacement for impact measures, even though it might seem like that could be true. I guess I didn’t communicate that effectively.
That depends more on the AI part than on RLSP. I think the actual contribution here is the observation that the state of the world tells us a lot about what humans care about, and the RLSP algorithm is meant to demonstrate that it is in principle possible to extract those preferences.
If I were forced to give an answer to this question, it would be that RLSP would form a part of a norm-following AI, and that because the AI was following norms it wouldn’t do anything too crazy. However, RLSP doesn’t solve any of the theoretical problems with norm-following AI.
But the real answer is that this is an observation that seems important, but I don’t have a story for how it leads to us solving AI safety.
Any scenario I construct with RLSP has clear problems, and similarly large-scale IRL also has clear problems. If you provide particular scenarios I could analyze those.
For example, if you literally think just of running RLSP with a time horizon of a year vs. large-scale IRL over a year and optimizing the resulting utility function, large-scale IRL should do better because it has way more data to work with.
Yeah, I agree they feel very composable. The main issue is that the observation model in IRD requires a notion of a “training environment” that’s separate from the real world, whereas RLSP assumes that there is one complex environment in which you are acting.
Certainly if you first trained your AI system in some training environments and then deployed them in the real world, you could use IRD during training to get a distribution over reward functions, and then use that distribution as your prior when running RLSP. It’s maybe plausible that if you did this you could simply optimize the resulting reward function, rather than doing risk-averse planning (which is how IRD gets the robot to avoid lava), that would be cool. It’s hard to test because all of the IRD environments don’t satisfy the key assumption of RLSP (that humans have optimized the environment for their preferences).
Ah ok, it didn’t occur to me in the first place that RLSP could be a replacement for impact measures (it seemed more closely related to IRL), so the comparisons you did with impact measures made me think you’re trying to frame it as a possible replacement for impact measures.
I guess I wasn’t thinking in terms of a fixed time horizon, but more like, given similar budgets can you do more with RLSP or IRL. For example it seems like increasing the time horizon of RLSP might be cheaper compared to doing large-scale IRL over a longer period of time so maybe RLSP could actually work with more data at a similar budget.
Also, more data has diminishing returns after a while, so I was also asking whether if you had the budget to do large-scale IRL over say a year, and if you could do RLSP over a longer time horizon and get more data as a result, do you think that would give RLSP a significant advantage over IRL.
(But maybe these questions aren’t very important if the main point here isn’t offering RLSP as a concrete technique for people to use but more that “state of the world tells us a lot about what humans care about”.)
Yeah, I think that’s basically my position.
But to try to give an answer anyway, I suspect that the benefits of having a lot of data via large-scale IRL will make it significantly outperform RLSP, even if you could get a longer time horizon on RLSP. There might be weird effects where the RLSP reward is less Goodhart-able (since it tends to prioritize keeping the state the same) that make the RLSP reward better to maximize, even though it captures fewer aspects of “what humans care about”. On the other hand, RLSP is much more fragile; slight errors in dynamics / features / action space will lead to big errors in the inferred reward; I would guess this is less true of large-scale IRL, so in practice I’d guess that large-scale IRL would still be better. But both would be bad.