Related: It’s disheartening to recognize, but it seems the ML community might not even get past the first crucial step in reducing risks, which is understanding them. We appear to live in a world where most people, including key decision-makers, still don’t grasp the gravity of the situation. For instance, in France, we still hear influential figures like Arthur Mensch, CEO of Mistral, saying things like, “When you write this kind of software, you always control what’s going to happen, all the outputs the software can have.” As long as such individuals are leading AGI labs, the situation will remain quite dire.
+1 for the conflationary alliances point. It is especially frustrating when I hear junior people interchange “AI Safety” and “AI Alignment.” These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the “Alignment Forum” does not help with this confusion). I’m not convinced the goal of the AI Safety community should be to align AIs at this point.
However, I want to make a small amendment to Myth 1: I believe that technical work which enhances safety culture is generally very positive. Examples of such work include scary demos like “BadLlama,” which I cite at least once a week, or benchmarks such as Evaluating Frontier Models for Dangerous Capabilities, which tries to monitor particularly concerning capabilities. More “technical” works like these seem overwhelmingly positive, and I think that we need more competent people doing this.
It is especially frustrating when I hear junior people interchange “AI Safety” and “AI Alignment.” These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the “Alignment Forum” does not help with this confusion)
One issue is there’s also a difference between “AI X-Safety” and “AI Safety”. It’s very natural for people working on all kinds of safety from and with AI systems to call their field “AI safety”, so it seems a bit doomed to try and have that term refer to x-safety.
I believe that technical work which enhances safety culture is generally very positive.
All of the examples that you mentioned share one critical non-technical aspect though. Their results are publicly available (I guess they were funded by general public, e.g. in case of “BadLlama”—by donations and grants to a foundation Palisade Research and IIIT, an Indian national institute). If you took the very same “technical” research and have it only available to a potentially shady private company, then that technical information could help them to indeed circumvent Llama’s safeguards. At that point, I’m not sure if one could still confidently call it “overwhelmingly positive”.
I agree that the works that you mentioned are very positive, but I think that the above non-technical aspect is necessary to take into consideration.
Strongly agree.
Related: It’s disheartening to recognize, but it seems the ML community might not even get past the first crucial step in reducing risks, which is understanding them. We appear to live in a world where most people, including key decision-makers, still don’t grasp the gravity of the situation. For instance, in France, we still hear influential figures like Arthur Mensch, CEO of Mistral, saying things like, “When you write this kind of software, you always control what’s going to happen, all the outputs the software can have.” As long as such individuals are leading AGI labs, the situation will remain quite dire.
+1 for the conflationary alliances point. It is especially frustrating when I hear junior people interchange “AI Safety” and “AI Alignment.” These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the “Alignment Forum” does not help with this confusion). I’m not convinced the goal of the AI Safety community should be to align AIs at this point.
However, I want to make a small amendment to Myth 1: I believe that technical work which enhances safety culture is generally very positive. Examples of such work include scary demos like “BadLlama,” which I cite at least once a week, or benchmarks such as Evaluating Frontier Models for Dangerous Capabilities, which tries to monitor particularly concerning capabilities. More “technical” works like these seem overwhelmingly positive, and I think that we need more competent people doing this.
One issue is there’s also a difference between “AI X-Safety” and “AI Safety”. It’s very natural for people working on all kinds of safety from and with AI systems to call their field “AI safety”, so it seems a bit doomed to try and have that term refer to x-safety.
All of the examples that you mentioned share one critical non-technical aspect though. Their results are publicly available (I guess they were funded by general public, e.g. in case of “BadLlama”—by donations and grants to a foundation Palisade Research and IIIT, an Indian national institute). If you took the very same “technical” research and have it only available to a potentially shady private company, then that technical information could help them to indeed circumvent Llama’s safeguards. At that point, I’m not sure if one could still confidently call it “overwhelmingly positive”.
I agree that the works that you mentioned are very positive, but I think that the above non-technical aspect is necessary to take into consideration.