I like this, it’s an explicit construction that demonstrates how you can play with the explore-exploit tradeoff in multiagent settings. Note that when α is set to be very high (the condition in which we get near-optimal outcomes in the limit), there is very little exploration, and so it will take a long time before we actually find the optimal outcome in the first place. It seems like this would make it hard to use in practice. Are you planning to use reflective oracles to replace the exploration here with reasoning about other agents? Or perhaps you think the slow exploration is not a problem for some reason?
The approach most similar to this that is actually practical is MCMC optimization, which is practical for, among other things, program synthesis. I did not use that for this post since it would require agents to compute counterfactuals, but it is significantly more efficient in practice.
I’m currently trying to solve the theoretical decision theory problem rather than create usable algorithms. It would be pretty great to have a mathematical formalization of the correct decision theory even if it takes exponential time to run.
The problem with using reflective oracles is that this means the agents’ actions are independent, but, based on this post, the agents’ actions need to be correlated with each other to get a high score, if each policy is a continuous functions of the utility function. Correlated reflective oracles seem like they might help, but they are essentially a superset of ordinary reflective oracles, so using them can’t ensure that actions are correlated.
There might be some variant of reflective oracles that does work; a problem with this is that the algorithm in this post is “sticky”, which seems like an obstacle to deriving an agent’s policy based on their beliefs about others.
I like this, it’s an explicit construction that demonstrates how you can play with the explore-exploit tradeoff in multiagent settings. Note that when α is set to be very high (the condition in which we get near-optimal outcomes in the limit), there is very little exploration, and so it will take a long time before we actually find the optimal outcome in the first place. It seems like this would make it hard to use in practice. Are you planning to use reflective oracles to replace the exploration here with reasoning about other agents? Or perhaps you think the slow exploration is not a problem for some reason?
The approach most similar to this that is actually practical is MCMC optimization, which is practical for, among other things, program synthesis. I did not use that for this post since it would require agents to compute counterfactuals, but it is significantly more efficient in practice.
I’m currently trying to solve the theoretical decision theory problem rather than create usable algorithms. It would be pretty great to have a mathematical formalization of the correct decision theory even if it takes exponential time to run.
The problem with using reflective oracles is that this means the agents’ actions are independent, but, based on this post, the agents’ actions need to be correlated with each other to get a high score, if each policy is a continuous functions of the utility function. Correlated reflective oracles seem like they might help, but they are essentially a superset of ordinary reflective oracles, so using them can’t ensure that actions are correlated.
There might be some variant of reflective oracles that does work; a problem with this is that the algorithm in this post is “sticky”, which seems like an obstacle to deriving an agent’s policy based on their beliefs about others.