since there’s no obvious reason why they’d be biased in a particular direction
No I’m saying there are obvious reasons why we’d be biased towards truthtelling. I mentioned “spread truth about AI risk” earlier, but also more generally one of our main goals is to get our map to match the territory as a collaborative community project. Lying makes that harder.
Besides sabotaging the community’s map, lying is dangerous to your own map too. As OP notes, to really lie effectively, you have to believe the lie. Well is it said, “If you once tell a lie, the truth is ever after your enemy.”
But to answer your question, it’s not wrong to do consequentialist analysis of lying. Again, I’m not Kantian, tell the guy here to randomly murder you whatever lie you want to survive. But I think there’s a lot of long-term consequences in less thought-experimenty cases that’d be tough to measure.
I’m not sure that TAS counts as “AI” since they’re usually compiled by humans, but the “PokeBotBad” you linked is interesting, hadn’t heard of that before. It’s an Any% Glitchless speedrun bot that ran until ~2017 and which managed a solid 1:48:27 time on 2/25/17, which was better than the human world record until 2/12/18. Still, I’d say this is more a programmed “bot” than an AI in the sense we care about.
Anyway, you’re right that the whole reason the Pokémon benchmark exists is because it’s interesting to see how well an untrained LLM can do playing it.