Seems like you can have a yet-simpler policy by factoring the fixed “simple objective(s)” into implicit, modular elements that compress many different objectives that may be useful across many different environments. Then at runtime, you feed the environmental state into your factored representation of possible objectives and produce a mix of objectives tailored to your current environment, which steer towards behaviors that achieved high reward on training runs similar to the current environment.
Can you explain why this policy is yet-simpler? It sounds more complicated to me.
I’m saying that it’s simpler to have a goal generator that can be conditioned on the current environment, rather than memorizing each goal individually.
Can you explain why this policy is yet-simpler? It sounds more complicated to me.
I’m saying that it’s simpler to have a goal generator that can be conditioned on the current environment, rather than memorizing each goal individually.