Makes sense, I was thinking about rewards as function of the next state rather than the current one.
I can stil imagine that things will still work if we replace the difference in Q-values by the difference in the values of the autoencoded next state. If that was true, this would a) affect my interpretation of the results and b) potentially make it easier to answer your open questions by providing a simplified version of the problem.
Edit: I guess the “Chaos unfolds over time” property of the safelife environment makes it unlikely that this would work?
Makes sense, I was thinking about rewards as function of the next state rather than the current one.
I can stil imagine that things will still work if we replace the difference in Q-values by the difference in the values of the autoencoded next state. If that was true, this would a) affect my interpretation of the results and b) potentially make it easier to answer your open questions by providing a simplified version of the problem.
Edit: I guess the “Chaos unfolds over time” property of the safelife environment makes it unlikely that this would work?
I went into the idea of evaluating on future state representations here: https://www.lesswrong.com/posts/5kurn5W62C5CpSWq6/avoiding-side-effects-in-complex-environments#bFLrwnpjq6wY3E39S (Not sure it is wise, though.)