ou need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy—just add a small weight on a uniform prior over tokens like they did in old school atari RL.)
Yeah, what I really had in mind with “avoiding mode collapse” was something more complex, but it seems tricky to spell out precisely.
Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn’t be able to do something to avoid exploring it.
It’s an interesting point, but where does the “extremely” come from? Seems like if it thinks there’s a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values, it could be a very worthwhile gamble. Maybe I’m unclear on the rules of the game as you’re imagining them.
There are a lot of replies here, so I’m not sure whether someone already mentioned this, but: I have heard anecdotally that homosexual men often have relationships which maintain the level of sex over the long term, while homosexual women often have long-term relationships which very gradually decline in frequency of sex, with barely any sex after many decades have passed (but still happily in a relationship).
This mainly argues against your model here:
It suggests instead that female sex drive naturally falls off in long-term relationships in a way that male sex drive doesn’t, with sexual attraction to a partner being a smaller factor.