One additional maxim to consider is that the AI community in general can only barely conceptualize and operationalize difficult concepts, such as safety. Historically, the AI community was good at maximizing some measure of performance, usually pretty straight forward test set metrics such as classification accuracy. Culturally this is how the community approaches all the problems—by aggregating complex phenomena into a single number. Note that this approach is not used in that many fields outside of AI and math, as you always have to make some lossy simplifications.
We can observe this malpractice in AI safety as well. There is a cottage industry of datasets and papers collecting “safety” samples, and we use these to measure some safety metric. We can then compare the numbers for different models and this makes AI folks happy. But there is barely any discussion about how representative these datasets really are for real-life risks, how comprehensive the data collection process is, or how sound it is to use random crowd-sourced workers or LLMs to generate such samples. The threats and risks are also rarely described in more detail—often it’s just a lot of hand-waving.
Based on my pretty deep experience with one aspect of AI safety (societal biases), I have very little confidence in our ability to understand AI behavior. Compared to measuring performance on well-defined NLP tasks, once we involve societal context, the intricacies of what we are trying to measure are beyond simple benchmarks. Note that we have entire fields that are trying to understand some of these problem in human societies, but we are to believe that we can collect a test set with few thousand samples and this should be enough to understand how AI works.
One additional maxim to consider is that the AI community in general can only barely conceptualize and operationalize difficult concepts, such as safety. Historically, the AI community was good at maximizing some measure of performance, usually pretty straight forward test set metrics such as classification accuracy. Culturally this is how the community approaches all the problems—by aggregating complex phenomena into a single number. Note that this approach is not used in that many fields outside of AI and math, as you always have to make some lossy simplifications.
We can observe this malpractice in AI safety as well. There is a cottage industry of datasets and papers collecting “safety” samples, and we use these to measure some safety metric. We can then compare the numbers for different models and this makes AI folks happy. But there is barely any discussion about how representative these datasets really are for real-life risks, how comprehensive the data collection process is, or how sound it is to use random crowd-sourced workers or LLMs to generate such samples. The threats and risks are also rarely described in more detail—often it’s just a lot of hand-waving.
Based on my pretty deep experience with one aspect of AI safety (societal biases), I have very little confidence in our ability to understand AI behavior. Compared to measuring performance on well-defined NLP tasks, once we involve societal context, the intricacies of what we are trying to measure are beyond simple benchmarks. Note that we have entire fields that are trying to understand some of these problem in human societies, but we are to believe that we can collect a test set with few thousand samples and this should be enough to understand how AI works.