Partly I wrote this post to explicitly say that I don’t imagine RLSP being a drop-in replacement for impact measures, even though it might seem like that could be true.
Ah ok, it didn’t occur to me in the first place that RLSP could be a replacement for impact measures (it seemed more closely related to IRL), so the comparisons you did with impact measures made me think you’re trying to frame it as a possible replacement for impact measures.
For example, if you literally think just of running RLSP with a time horizon of a year vs. large-scale IRL over a year and optimizing the resulting utility function, large-scale IRL should do better because it has way more data to work with.
I guess I wasn’t thinking in terms of a fixed time horizon, but more like, given similar budgets can you do more with RLSP or IRL. For example it seems like increasing the time horizon of RLSP might be cheaper compared to doing large-scale IRL over a longer period of time so maybe RLSP could actually work with more data at a similar budget.
Also, more data has diminishing returns after a while, so I was also asking whether if you had the budget to do large-scale IRL over say a year, and if you could do RLSP over a longer time horizon and get more data as a result, do you think that would give RLSP a significant advantage over IRL.
(But maybe these questions aren’t very important if the main point here isn’t offering RLSP as a concrete technique for people to use but more that “state of the world tells us a lot about what humans care about”.)
(But maybe these questions aren’t very important if the main point here isn’t offering RLSP as a concrete technique for people to use but more that “state of the world tells us a lot about what humans care about”.)
Yeah, I think that’s basically my position.
But to try to give an answer anyway, I suspect that the benefits of having a lot of data via large-scale IRL will make it significantly outperform RLSP, even if you could get a longer time horizon on RLSP. There might be weird effects where the RLSP reward is less Goodhart-able (since it tends to prioritize keeping the state the same) that make the RLSP reward better to maximize, even though it captures fewer aspects of “what humans care about”. On the other hand, RLSP is much more fragile; slight errors in dynamics / features / action space will lead to big errors in the inferred reward; I would guess this is less true of large-scale IRL, so in practice I’d guess that large-scale IRL would still be better. But both would be bad.
Ah ok, it didn’t occur to me in the first place that RLSP could be a replacement for impact measures (it seemed more closely related to IRL), so the comparisons you did with impact measures made me think you’re trying to frame it as a possible replacement for impact measures.
I guess I wasn’t thinking in terms of a fixed time horizon, but more like, given similar budgets can you do more with RLSP or IRL. For example it seems like increasing the time horizon of RLSP might be cheaper compared to doing large-scale IRL over a longer period of time so maybe RLSP could actually work with more data at a similar budget.
Also, more data has diminishing returns after a while, so I was also asking whether if you had the budget to do large-scale IRL over say a year, and if you could do RLSP over a longer time horizon and get more data as a result, do you think that would give RLSP a significant advantage over IRL.
(But maybe these questions aren’t very important if the main point here isn’t offering RLSP as a concrete technique for people to use but more that “state of the world tells us a lot about what humans care about”.)
Yeah, I think that’s basically my position.
But to try to give an answer anyway, I suspect that the benefits of having a lot of data via large-scale IRL will make it significantly outperform RLSP, even if you could get a longer time horizon on RLSP. There might be weird effects where the RLSP reward is less Goodhart-able (since it tends to prioritize keeping the state the same) that make the RLSP reward better to maximize, even though it captures fewer aspects of “what humans care about”. On the other hand, RLSP is much more fragile; slight errors in dynamics / features / action space will lead to big errors in the inferred reward; I would guess this is less true of large-scale IRL, so in practice I’d guess that large-scale IRL would still be better. But both would be bad.