Thank you for this detailed feedback. I’ll go through the rest of your comments/questions in additional comment replies. To start:
What kinds of work do you want to see? Common legal tasks include contract review, legal judgment prediction, and passing questions on the bar exam, but those aren’t necessarily the most important tasks. Could you propose a benchmark for the field of Legal AI that would help align AGI?
Given that progress in AI capabilities research is driven, in large part, by shared benchmarks that thousands of researchers globally use to guide their experiments, understand as a community whether certain model and data advancements are improving AI capabilities, and compare results across research groups, we should aim for the same phenomena in Legal AI understanding. Optimizing benchmarks are one of the primary “objective functions” of the overall global AI capabilities research apparatus.
But, as quantitative lodestars, benchmarks also create perverse incentives to build AI systems that optimize for benchmark performance at the expense of true generalization and intelligence (Goodhart’s Law). Many AI benchmark datasets have a significant number of errors, which suggests that, in some cases, machine learning models have, more than widely recognized, failed to actually learn generalizable skills and abstract concepts. There are spurious cues within benchmark data structures that, once removed, significantly drop model performance, demonstrating that models are often learning patterns that do not generalize outside of the closed world of the benchmark data. Many benchmarks, especially in natural language processing, have become saturated not because the models are super-human but because the benchmarks are not truly assessing their skills to operate in real-world scenarios. This is not to say that AI capabilities have made incredible advancements over the past 10 years (and especially since 2017). The point is just that benchmarking AI capabilities is difficult.
Benchmarking AI alignment likely has the same issues, but compounded by significantly vaguer problem definitions. There is also far less research on AI alignment benchmarks. Performing well on societal alignment is more difficult than performing well on task capabilities. Because alignment is so fundamentally hard, the sky should be the limit on the difficulty of alignment benchmarks. Legal-informatics-based benchmarks could serve as AI alignment benchmarks for the research community. Current machine learning models perform poorly on legal understanding tasks such as statutory reasoning (Nils Holzenberger, Andrew Blair-Stanek & Benjamin Van Durme, A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering (2020); Nils Holzenberger & Benjamin Van Durme, Factoring Statutory Reasoning as Language Understanding Challenges(2021)), professional law (Dan Hendrycks et al., Measuring Massive Multitask Language Understanding, arXiv:2009.03300 (2020)), and legal discovery (Eugene Yang et al., Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review, in Advances in Information Retrieval: 44th European Conference on IR Research, 502–517 (2022)). There is significant room for improvement on legal language processing tasks (Ilias Chalkidis et al., LexGLUE: A Benchmark Dataset for Legal Language Understanding in English, in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (2022); D. Jain, M.D. Borah & A. Biswas, Summarization of legal documents: where are we now and the way forward, Comput. Sci. Rev. 40, 100388 (2021)). An example benchmark that could be used as part of the alignment benchmarks is Law Search (Faraz Dadgostari et al., Modeling Law Search as Prediction, A.I. & L. 29.1, 3-34 (2021) at 3 (“In any given matter, before legal reasoning can take place, the reasoning agent must first engage in a task of “law search” to identify the legal knowledge—cases, statutes, or regulations—that bear on the questions being addressed.”); Michael A. Livermore & Daniel N. Rockmore, The Law Search Turing Test, in Law as Data: Computation, Text, and the Future of Legal Analysis (2019) at 443-452; Michael A. Livermore et al., Law Search in the Age of the Algorithm, Mich. St. L. Rev. 1183 (2020)).
We have just received a couple small grants specifically to begin to build additional legal understanding benchmarks for LLMs, starting with legal standards. I will share more on this shortly and would invite anyone interested in partnering on this to reach out!
Thank you for this detailed feedback. I’ll go through the rest of your comments/questions in additional comment replies. To start:
Given that progress in AI capabilities research is driven, in large part, by shared benchmarks that thousands of researchers globally use to guide their experiments, understand as a community whether certain model and data advancements are improving AI capabilities, and compare results across research groups, we should aim for the same phenomena in Legal AI understanding. Optimizing benchmarks are one of the primary “objective functions” of the overall global AI capabilities research apparatus.
But, as quantitative lodestars, benchmarks also create perverse incentives to build AI systems that optimize for benchmark performance at the expense of true generalization and intelligence (Goodhart’s Law). Many AI benchmark datasets have a significant number of errors, which suggests that, in some cases, machine learning models have, more than widely recognized, failed to actually learn generalizable skills and abstract concepts. There are spurious cues within benchmark data structures that, once removed, significantly drop model performance, demonstrating that models are often learning patterns that do not generalize outside of the closed world of the benchmark data. Many benchmarks, especially in natural language processing, have become saturated not because the models are super-human but because the benchmarks are not truly assessing their skills to operate in real-world scenarios. This is not to say that AI capabilities have made incredible advancements over the past 10 years (and especially since 2017). The point is just that benchmarking AI capabilities is difficult.
Benchmarking AI alignment likely has the same issues, but compounded by significantly vaguer problem definitions. There is also far less research on AI alignment benchmarks. Performing well on societal alignment is more difficult than performing well on task capabilities. Because alignment is so fundamentally hard, the sky should be the limit on the difficulty of alignment benchmarks. Legal-informatics-based benchmarks could serve as AI alignment benchmarks for the research community. Current machine learning models perform poorly on legal understanding tasks such as statutory reasoning (Nils Holzenberger, Andrew Blair-Stanek & Benjamin Van Durme, A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering (2020); Nils Holzenberger & Benjamin Van Durme, Factoring Statutory Reasoning as Language Understanding Challenges (2021)), professional law (Dan Hendrycks et al., Measuring Massive Multitask Language Understanding, arXiv:2009.03300 (2020)), and legal discovery (Eugene Yang et al., Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review, in Advances in Information Retrieval: 44th European Conference on IR Research, 502–517 (2022)). There is significant room for improvement on legal language processing tasks (Ilias Chalkidis et al., LexGLUE: A Benchmark Dataset for Legal Language Understanding in English, in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (2022); D. Jain, M.D. Borah & A. Biswas, Summarization of legal documents: where are we now and the way forward, Comput. Sci. Rev. 40, 100388 (2021)). An example benchmark that could be used as part of the alignment benchmarks is Law Search (Faraz Dadgostari et al., Modeling Law Search as Prediction, A.I. & L. 29.1, 3-34 (2021) at 3 (“In any given matter, before legal reasoning can take place, the reasoning agent must first engage in a task of “law search” to identify the legal knowledge—cases, statutes, or regulations—that bear on the questions being addressed.”); Michael A. Livermore & Daniel N. Rockmore, The Law Search Turing Test, in Law as Data: Computation, Text, and the Future of Legal Analysis (2019) at 443-452; Michael A. Livermore et al., Law Search in the Age of the Algorithm, Mich. St. L. Rev. 1183 (2020)).
We have just received a couple small grants specifically to begin to build additional legal understanding benchmarks for LLMs, starting with legal standards. I will share more on this shortly and would invite anyone interested in partnering on this to reach out!