I was notified I didn’t win a prize so figured I’d discuss what I proposed here in case it sparks any other ideas. The short version is I proposed adding on a new head that would be an intentional human simulator. During training it would be penalized for telling the truth that the diamond was gone when there existed a lie that the humans would have believed instead. The result would hopefully be a head that acted like a human simulator. Then the actual reporter would be trained so that it would be penalized for using a similar amount of compute as the intentional human simulator, or looking at similar nodes or node regions as the intentional human simulator. The hope is that by penalizing the reporter for acting like the intentional human simulator, it would be more likely to do direct translation instead of human simulation.
This does have at least one counterexample that I proposed as well, which is that the reporter could simply waste compute doing nothing to avoid matching the intentional human simulator, and could look at additional random nodes it doesn’t care about to avoid looking like it was looking at the same nodes as the intentional human simulator. Though I thought there was some possibility that having to do these things might end up incentivizing the reporter to act like a direct translator instead of a human simulator.
Although I’m not sure why this wasn’t very promising my guess is that the counterexample is too obvious and that my proposal doesn’t gain much ground in keeping the reporter from acting like a human simulator, or someone else has already thought of this approach, or perhaps my counterexample is too similar to the counterexample to “penalize reporters that work with many different predictors” where the reporter could just pretend to not work with other predictors (its similar in that the reporter could pretend not to look like the intentional human simulator).
I was notified I didn’t win a prize so figured I’d discuss what I proposed here in case it sparks any other ideas. The short version is I proposed adding on a new head that would be an intentional human simulator. During training it would be penalized for telling the truth that the diamond was gone when there existed a lie that the humans would have believed instead. The result would hopefully be a head that acted like a human simulator. Then the actual reporter would be trained so that it would be penalized for using a similar amount of compute as the intentional human simulator, or looking at similar nodes or node regions as the intentional human simulator. The hope is that by penalizing the reporter for acting like the intentional human simulator, it would be more likely to do direct translation instead of human simulation.
This does have at least one counterexample that I proposed as well, which is that the reporter could simply waste compute doing nothing to avoid matching the intentional human simulator, and could look at additional random nodes it doesn’t care about to avoid looking like it was looking at the same nodes as the intentional human simulator. Though I thought there was some possibility that having to do these things might end up incentivizing the reporter to act like a direct translator instead of a human simulator.
Although I’m not sure why this wasn’t very promising my guess is that the counterexample is too obvious and that my proposal doesn’t gain much ground in keeping the reporter from acting like a human simulator, or someone else has already thought of this approach, or perhaps my counterexample is too similar to the counterexample to “penalize reporters that work with many different predictors” where the reporter could just pretend to not work with other predictors (its similar in that the reporter could pretend not to look like the intentional human simulator).
Here’s my full submission in google docs with more description: https://docs.google.com/document/d/1Xa4CDLNJ-VPT7hqEUIHlqCsXVeFCYDB5h7Vn3QJ_qpA/edit?usp=sharing
Thanks for sharing your idea!