Would this be equivalent to an RL environment that scales down the per wedding reward for repeated weddings?
What bothers me about this is suppose we have a different set of 2 RL choices:
Life saved + 10
Murder −10
In this case we want the agent to choose policies that result in life saved with total mode collapse away from committing a murder. This is also true for less edgy/more practical descriptions, such as:
Would this be equivalent to an RL environment that scales down the per wedding reward for repeated weddings?
What bothers me about this is suppose we have a different set of 2 RL choices:
Life saved + 10
Murder −10
In this case we want the agent to choose policies that result in life saved with total mode collapse away from committing a murder. This is also true for less edgy/more practical descriptions, such as:
box shelved correctly 0.1
human coworker potentially injured −10