This might blur the distinction between some evals. While it’s true that most evals are just about capabilities, some could be positive for improving LLM safety.
I’ve created 8 (soon to be 9) LLM evals (I’m not funded by anyone, it’s mostly out of my own curiosity, not for capability or safety or paper publishing reasons). Using them as examples, improving models to score well on some of them is likely detrimental to AI safety:
I think it’s possible to do better than these by intentionally designing evals aimed at creating defensive AIs. It might be better to keep them private and independent. Given the rapid growth of AI capabilities, the lack of apparent concern for an international treaty (as seen in the recent Paris AI summit), and the competitive race dynamics among companies and nations, specifically developing an AI to protect us from threats from other AIs or AIs + humans might be the best we can hope for.
This might blur the distinction between some evals. While it’s true that most evals are just about capabilities, some could be positive for improving LLM safety.
I’ve created 8 (soon to be 9) LLM evals (I’m not funded by anyone, it’s mostly out of my own curiosity, not for capability or safety or paper publishing reasons). Using them as examples, improving models to score well on some of them is likely detrimental to AI safety:
https://github.com/lechmazur/step_game—to score better, LLMs must learn to deceive others and hold hidden intentions
https://github.com/lechmazur/deception/ - the disinformation effectiveness part of the benchmark
Some are likely somewhat negative because scoring better would enhance capabilities:
https://github.com/lechmazur/nyt-connections/
https://github.com/lechmazur/generalization
Others focus on capabilities that are probably not dangerous:
https://github.com/lechmazur/writing—creative writing
https://github.com/lechmazur/divergent—divergent thinking in writing
However, improving LLMs to score high on certain evals could be beneficial:
https://github.com/lechmazur/goods—teaching LLMs not to overvalue selfishness
https://github.com/lechmazur/deception/?tab=readme-ov-file#-disinformation-resistance-leaderboard—the disinformation resistance part of the benchmark
https://github.com/lechmazur/confabulations/ - reducing the tendency of LLMs to fabricate information (hallucinate)
I think it’s possible to do better than these by intentionally designing evals aimed at creating defensive AIs. It might be better to keep them private and independent. Given the rapid growth of AI capabilities, the lack of apparent concern for an international treaty (as seen in the recent Paris AI summit), and the competitive race dynamics among companies and nations, specifically developing an AI to protect us from threats from other AIs or AIs + humans might be the best we can hope for.