Given as in the naive reinforcement learning framework (and that can approximate some more complex notions of value) that the value is in the environment,
Actually, I think I understand it, but only with moderate confidence, and I have no prior experience with this particular terminology. So let me take this opportunity to put my thoughts on the firing line. :-) Whether they get shot down or confirmed, either way I’ll have learned something about my ability to figure out things at first glance.
So: we’re agents and we live in an environment.
We value certain things in the environment, and try to make decisions so that the environment arrives at the states we like, the states that have more value. Furthermore, as time passes the environment will continue to change state whether we like it or not. So we don’t just want the environment to arrive at a given high-value state, we want to think longer term: we want the environment to keep on going through highly valuable states forever.
However, there’s a problem: our prediction abilities are crappy. We can estimate how much value a given environmental state will have, but we won’t really know until it’s arrived. We know even less about states that are farther away in the future.
So the OP’s argument is that we need to be careful about setting loose a superintelligent (but still not perfectly intelligent) FAI that rewrites the universe from scratch, because it might accidentally exclude a path farther down the line that’s even more valuable than the near, fairly-high-value path it can predict. There are similar problems with less superpowerful FAIs, since they’ll also guide humanity into the best path they can predict, which might not be as good as a weirder path farther out that it cannot.
You are on spot, though you provided more context than can be traced directly from the cited sentence. When i referred to the naive RL, I had in mind (PO)MDPs with unknown reward function. The reward of unseen state can be predicted only in the sense of Occam Razor-type induction.
I’m confused about what this means.
It means that the agent maximizes the cumulative sum of a function of the environment states which is revealed to the agent only for states it visits.
I’m afraid this didn’t clarify anything for me. Sorry! Pretend you’re explaining it to someone stupid.
Actually, I think I understand it, but only with moderate confidence, and I have no prior experience with this particular terminology. So let me take this opportunity to put my thoughts on the firing line. :-) Whether they get shot down or confirmed, either way I’ll have learned something about my ability to figure out things at first glance.
So: we’re agents and we live in an environment.
We value certain things in the environment, and try to make decisions so that the environment arrives at the states we like, the states that have more value. Furthermore, as time passes the environment will continue to change state whether we like it or not. So we don’t just want the environment to arrive at a given high-value state, we want to think longer term: we want the environment to keep on going through highly valuable states forever.
However, there’s a problem: our prediction abilities are crappy. We can estimate how much value a given environmental state will have, but we won’t really know until it’s arrived. We know even less about states that are farther away in the future.
So the OP’s argument is that we need to be careful about setting loose a superintelligent (but still not perfectly intelligent) FAI that rewrites the universe from scratch, because it might accidentally exclude a path farther down the line that’s even more valuable than the near, fairly-high-value path it can predict. There are similar problems with less superpowerful FAIs, since they’ll also guide humanity into the best path they can predict, which might not be as good as a weirder path farther out that it cannot.
How close am I?
You are on spot, though you provided more context than can be traced directly from the cited sentence. When i referred to the naive RL, I had in mind (PO)MDPs with unknown reward function. The reward of unseen state can be predicted only in the sense of Occam Razor-type induction.