Maybe check this paper: DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning, https://arxiv.org/abs/2402.17453
Very cool, thanks! This paper focuses on building a DS Agent, but I’d be interested to see a version of this paper that focuses on building a benchmark. It could evaluate several existing agent architectures, benchmark them against human performance, and leave significant room for improvement by future models.
Maybe check this paper: DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning, https://arxiv.org/abs/2402.17453
Very cool, thanks! This paper focuses on building a DS Agent, but I’d be interested to see a version of this paper that focuses on building a benchmark. It could evaluate several existing agent architectures, benchmark them against human performance, and leave significant room for improvement by future models.