The technical appendix felt like it was more difficult than previous posts, but I had the advantage of having tried to read the paper from the preceding post yesterday and managed to reconstruct the graph & gamma correctly.
The early part is slightly confusing, though. I thought AU is a thing that belongs to the goal of an agent, but the picture made it look as if it’s part of the object (“how fertile is the soil?”). Is the idea here that the soil-AU is slang for “AU of goal ‘plant stuff here’”?
I did interpret the first exercise as “you planned to go onto the moon” and came up with stuff like “how valuable are the stones I can take home” and “how pleasant will it be to hang around.”
One thing I noticed is that the formal policies don’t allow for all possible “strategies.” In the graph we had to reconstruct, I can’t start at s1, then go to s1 once and then go to s3. So you could think of the larger set ΠL where the policies are allowed to depend on the time step. But I assume there’s no point unless the reward function also depends on the time step. (I don’t know anything about MDPs.)
Am I correct that a deterministic transition function is a function T:S×A→S and a non-deterministic one is a function T:S×A×S→[0,1]?
Is the idea here that the soil-AU is slang for “AU of goal ‘plant stuff here’”?
yes
One thing I noticed is that the formal policies don’t allow for all possible “strategies.”
yeah, this is because those are “nonstationary” policies—you change your mind about what to do at a given state. A classic result in MDP theory is that you never need these policies to find an optimal policy.
Am I correct that a deterministic transition function is
The technical appendix felt like it was more difficult than previous posts, but I had the advantage of having tried to read the paper from the preceding post yesterday and managed to reconstruct the graph & gamma correctly.
The early part is slightly confusing, though. I thought AU is a thing that belongs to the goal of an agent, but the picture made it look as if it’s part of the object (“how fertile is the soil?”). Is the idea here that the soil-AU is slang for “AU of goal ‘plant stuff here’”?
I did interpret the first exercise as “you planned to go onto the moon” and came up with stuff like “how valuable are the stones I can take home” and “how pleasant will it be to hang around.”
One thing I noticed is that the formal policies don’t allow for all possible “strategies.” In the graph we had to reconstruct, I can’t start at s1, then go to s1 once and then go to s3. So you could think of the larger set ΠL where the policies are allowed to depend on the time step. But I assume there’s no point unless the reward function also depends on the time step. (I don’t know anything about MDPs.)
Am I correct that a deterministic transition function is a function T:S×A→S and a non-deterministic one is a function T:S×A×S→[0,1]?
yes
yeah, this is because those are “nonstationary” policies—you change your mind about what to do at a given state. A classic result in MDP theory is that you never need these policies to find an optimal policy.
yup!