Here is my attempt at a summary of (a standalone part of) the reasoning in this post.
An agent trying to get a lot of reward can get stuck (or at least waste data) when the actions that seem good don’t plug into the parts of the world/data stream that contain information about which actions are in fact good. That is, an agent that restricts its information about the reward+dynamics of the world to only its reward feedback will get less reward
One way an agent can try and get additional information is by deductive reasoning from propositions (if they can relate sense data to world models to propositions). Sometimes the deductive reasoning they need to do will only become apparent shortly before the result of the reasoning is required (so the reasoning should be fast)
The nice thing about logic is that you don’t need fresh data to produce test cases: you can make up puzzles! As an agent will need fast deductive reasoning strategies, they may want to try out the goodness of their reasoning strategies on puzzles they invent (to make sure they’re fast and reliable (if they hadn’t proved reliability))
In general, we should model things that we think agents are going to do, because that gives us a handle on reasoning about advanced agents. It is good to be able to establish what we can about the behaviour of advanced boundedly-rational agents so that we can make progress on the alignment problem etc
To the extent this is a correct summary, I note that it’s not obvious to me that agents would sharpen their reasoning skills via test cases rather than establishing proofs on bounds of performance and so on. Though I suppose either way they are using logic, so it doesn’t affect the claims of the post
Here is my attempt at a summary of (a standalone part of) the reasoning in this post.
An agent trying to get a lot of reward can get stuck (or at least waste data) when the actions that seem good don’t plug into the parts of the world/data stream that contain information about which actions are in fact good. That is, an agent that restricts its information about the reward+dynamics of the world to only its reward feedback will get less reward
One way an agent can try and get additional information is by deductive reasoning from propositions (if they can relate sense data to world models to propositions). Sometimes the deductive reasoning they need to do will only become apparent shortly before the result of the reasoning is required (so the reasoning should be fast)
The nice thing about logic is that you don’t need fresh data to produce test cases: you can make up puzzles! As an agent will need fast deductive reasoning strategies, they may want to try out the goodness of their reasoning strategies on puzzles they invent (to make sure they’re fast and reliable (if they hadn’t proved reliability))
In general, we should model things that we think agents are going to do, because that gives us a handle on reasoning about advanced agents. It is good to be able to establish what we can about the behaviour of advanced boundedly-rational agents so that we can make progress on the alignment problem etc
To the extent this is a correct summary, I note that it’s not obvious to me that agents would sharpen their reasoning skills via test cases rather than establishing proofs on bounds of performance and so on. Though I suppose either way they are using logic, so it doesn’t affect the claims of the post