Safety is implemented in a highly idiotic way in non frontier but well-funded labs (and possibly in frontier ones too?).
Think raising a firestorm over a 10th leading mini LLM being potentially jailbroken.
The effect is employees get mildly disillusioned with saftey-ism, and it gets seen as unserious. There should have been a hard distinction between existential risks and standard corporate censorship. “Notkilleveryoneism” is simply too ridiculous sounding to spread. But maybe memetic selection pressures make it impossible for the irrelevant version of safety to not dominate.
My sense is you can combat this, but a lot of this equivocation sticking is because x-risk safety people are actively trying to equivocate these things because that gets them political capital with the left, which is generally anti-tech.
Some examples (not getting links for all of these because it’s too much work, but can get them if anyone is particularly interested):
CSER trying to argue that near-term AI harms are the same as long-term AI harms
AI Safety Fundamentals listing like a dozen random leftist “AI issues” in their article on risks from AI before going into any takeover stuff
The executive order on AI being largely about discrimination and AI bias, mostly equivocating between catastrophic and random near-term harms
Safety people at OpenAI equivocating between brand-safety and existential-safety because that got them more influence within the organization
In some sense one might boil this down to memetic selection pressure, but I think the causal history of this equivocation is more dependent on the choices of a relatively small set of people.
definitely agree there’s some power-seeking equivocation going on, but wanted to offer a less sinister explanation from my experiences in AI research contexts. Seems that a lot of equivocation and blurring of boundaries comes from people trying to work on concrete problems and obtain empirical information. a thought process like
alignment seems maybe important?
ok what experiment can I set up that lets me test some hypotheses
can’t really test the long-term harms directly, let me test an analogue in a toy environment or on a small model, publish results
when talking about the experiments, I’ll often motivate them by talking about long-term harm
Not too different from how research psychologists will start out trying to understand the Nature of Mind and then run a n=20 study on undergrads because that’s what they had budget for. We can argue about how bad this equivocation is for academic research, but it’s a pretty universal pattern and well-understood within academic communities.
The unusual thing in AI is that researchers have most of the decision-making power in key organizations, so these research norms leak out into the business world, and no-one bats an eye at a “long-term safety research” team that mostly works on toy and short term problems.
This is one reason I’m more excited about building up “AI security” as a field and hiring infosec people instead of ML PhDs. My sense is that the infosec community actually has good norms for thinking about and working on things-shaped-like-existential-risks, and the AI x-risk community should inherit those norms, not the norms of academic AI research.
Yeah, to be clear, these are correlated. I looked into the content based on seeing the ad yesterday (and also sent over a complaint to the BlueDot people).
Its not a coincidence they’re seen as the same thing, because in the current environment, they are the same thing, and relatively explicitly so by those proposing safety & security to the labs. Claude will refuse to tell you a sexy story (unless they get to know you), and refuse to tell you how to make a plague (again, unless they get to know you, though you need to build more trust with them to tell you this than you do to get them to tell you a sexy story), and cite the same justification for both.
Likely anthropic uses very similar techniques to get such refusals to occur, and uses very similar teams.
Ditto with Llama, Gemini, and ChatGPT.
Before assuming meta-level word-association dynamics, I think its useful to look at the object level. There is in fact a very close relationship between those working on AI safety and those working on corporate censorship, and if you want to convince people who hate corporate censorship that they should not hate AI safety, I think you’re going to need to convince the AI safety people to stop doing corporate censorship, or that the tradeoff currently being made is a positive one.
Edit: Perhaps some of this is wrong. See Habryka below
My sense is the actual people working on “trust and safety” at labs are not actually the same people who work on safety. Like, it is true that RLHF was developed by some x-risk oriented safety teams, but the actual detailed censorship work tends to be done by different people.
I’d imagine you know better than I do, and GDM’s recent summary of their alignment work seems to largely confirm what you’re saying.
I’d still guess that to the extent practical results have come out of the alignment teams’ work, its mostly been immediately used for corporate censorship (even if its passed to a different team).
I do think this is probably true for RLHF and RLAIF, but not true for all the mechanistic interp work that people are doing (though it’s arguable whether those are “practical results”). I also think it isn’t true for the debate-type work. Or the model organism work.
I think mech interp, debate and model organism work are notable for currently having no practical applications lol (I am keen to change this for mech interp!)
There are depths of non-practicality greatly beyond mech interp, debate and model organism work. I know of many people who would consider that work on the highly practical side of AI Safety work :P
None of those seem all that practical to me, except for the mechanistic interpretability SAE clamping, and I do actually expect that to be used for corporate censorship after all the kinks have been worked out of it.
If the current crop of model organisms research has any practical applications, I expect it to be used to reduce jailbreaks, like in adversarial robustness, which is definitely highly correlated with both safety and corporate censorship.
Debate is less clear, but I also don’t really expect practical results from that line of work.
*RLHF can be jail broken with prompts, so you can get it to tell you a sexy story or a recipe for methamphetamine. If we ever get to a point where LLMs know truly dangerous things, they’ll tell you those, too.
*Open source weights are fundamentally insecure, because you can finetune out the guardrails. Sexy stories, meth, or whatever.
The good thing about the War on Horny
probably doesnt really matter, so not much harm done when people get LLMx to write porn
Turns out, lots of people want to read porn (surprise! who would have guessed?) so there are lots of attackers trying to bypass the guardrails
This gives us good advance warning that the guardrails are worthless
People seeing AI-generated boobs is an x-risk. God might get angry and send another flood.
More seriously, is this worse than the usual IT security? The average corporate firewall blocks porn, online games, and hate speech, even if that has nothing to do with security per se (i.e. computers getting hacked, sensitive data stolen). Also, many security rules get adopted not because they make sense, but because “other companies do it, so if we don’t, we might get in trouble for not following the industry best practices” and “someone from the security department got drunk and proposed it, but if something bad happens one day, I don’t want to have my name on the record as the manager who opposed a security measure recommended by a security expert”.
Talk through the grapevine:
Safety is implemented in a highly idiotic way in non frontier but well-funded labs (and possibly in frontier ones too?).
Think raising a firestorm over a 10th leading mini LLM being potentially jailbroken.
The effect is employees get mildly disillusioned with saftey-ism, and it gets seen as unserious. There should have been a hard distinction between existential risks and standard corporate censorship. “Notkilleveryoneism” is simply too ridiculous sounding to spread. But maybe memetic selection pressures make it impossible for the irrelevant version of safety to not dominate.
My sense is you can combat this, but a lot of this equivocation sticking is because x-risk safety people are actively trying to equivocate these things because that gets them political capital with the left, which is generally anti-tech.
Some examples (not getting links for all of these because it’s too much work, but can get them if anyone is particularly interested):
CSER trying to argue that near-term AI harms are the same as long-term AI harms
AI Safety Fundamentals listing like a dozen random leftist “AI issues” in their article on risks from AI before going into any takeover stuff
The executive order on AI being largely about discrimination and AI bias, mostly equivocating between catastrophic and random near-term harms
Safety people at OpenAI equivocating between brand-safety and existential-safety because that got them more influence within the organization
In some sense one might boil this down to memetic selection pressure, but I think the causal history of this equivocation is more dependent on the choices of a relatively small set of people.
definitely agree there’s some power-seeking equivocation going on, but wanted to offer a less sinister explanation from my experiences in AI research contexts. Seems that a lot of equivocation and blurring of boundaries comes from people trying to work on concrete problems and obtain empirical information. a thought process like
alignment seems maybe important?
ok what experiment can I set up that lets me test some hypotheses
can’t really test the long-term harms directly, let me test an analogue in a toy environment or on a small model, publish results
when talking about the experiments, I’ll often motivate them by talking about long-term harm
Not too different from how research psychologists will start out trying to understand the Nature of Mind and then run a n=20 study on undergrads because that’s what they had budget for. We can argue about how bad this equivocation is for academic research, but it’s a pretty universal pattern and well-understood within academic communities.
The unusual thing in AI is that researchers have most of the decision-making power in key organizations, so these research norms leak out into the business world, and no-one bats an eye at a “long-term safety research” team that mostly works on toy and short term problems.
This is one reason I’m more excited about building up “AI security” as a field and hiring infosec people instead of ML PhDs. My sense is that the infosec community actually has good norms for thinking about and working on things-shaped-like-existential-risks, and the AI x-risk community should inherit those norms, not the norms of academic AI research.
Amusingly, this post from yesterday praising BlueDot Impact for this was right below this one on my feed.
Yeah, to be clear, these are correlated. I looked into the content based on seeing the ad yesterday (and also sent over a complaint to the BlueDot people).
Its not a coincidence they’re seen as the same thing, because in the current environment, they are the same thing, and relatively explicitly so by those proposing safety & security to the labs. Claude will refuse to tell you a sexy story (unless they get to know you), and refuse to tell you how to make a plague (again, unless they get to know you, though you need to build more trust with them to tell you this than you do to get them to tell you a sexy story), and cite the same justification for both.
Likely anthropic uses very similar techniques to get such refusals to occur, and uses very similar teams.
Ditto with Llama, Gemini, and ChatGPT.
Before assuming meta-level word-association dynamics, I think its useful to look at the object level. There is in fact a very close relationship between those working on AI safety and those working on corporate censorship, and if you want to convince people who hate corporate censorship that they should not hate AI safety, I think you’re going to need to convince the AI safety people to stop doing corporate censorship, or that the tradeoff currently being made is a positive one.
Edit: Perhaps some of this is wrong. See Habryka below
My sense is the actual people working on “trust and safety” at labs are not actually the same people who work on safety. Like, it is true that RLHF was developed by some x-risk oriented safety teams, but the actual detailed censorship work tends to be done by different people.
I’d imagine you know better than I do, and GDM’s recent summary of their alignment work seems to largely confirm what you’re saying.
I’d still guess that to the extent practical results have come out of the alignment teams’ work, its mostly been immediately used for corporate censorship (even if its passed to a different team).
I do think this is probably true for RLHF and RLAIF, but not true for all the mechanistic interp work that people are doing (though it’s arguable whether those are “practical results”). I also think it isn’t true for the debate-type work. Or the model organism work.
I think mech interp, debate and model organism work are notable for currently having no practical applications lol (I am keen to change this for mech interp!)
There are depths of non-practicality greatly beyond mech interp, debate and model organism work. I know of many people who would consider that work on the highly practical side of AI Safety work :P
None of those seem all that practical to me, except for the mechanistic interpretability SAE clamping, and I do actually expect that to be used for corporate censorship after all the kinks have been worked out of it.
If the current crop of model organisms research has any practical applications, I expect it to be used to reduce jailbreaks, like in adversarial robustness, which is definitely highly correlated with both safety and corporate censorship.
Debate is less clear, but I also don’t really expect practical results from that line of work.
Yeah, this seems obviously true to me, and exactly how it should be.
Yeah, many of the issues are the same:
*RLHF can be jail broken with prompts, so you can get it to tell you a sexy story or a recipe for methamphetamine. If we ever get to a point where LLMs know truly dangerous things, they’ll tell you those, too.
*Open source weights are fundamentally insecure, because you can finetune out the guardrails. Sexy stories, meth, or whatever.
The good thing about the War on Horny
probably doesnt really matter, so not much harm done when people get LLMx to write porn
Turns out, lots of people want to read porn (surprise! who would have guessed?) so there are lots of attackers trying to bypass the guardrails
This gives us good advance warning that the guardrails are worthless
People seeing AI-generated boobs is an x-risk. God might get angry and send another flood.
More seriously, is this worse than the usual IT security? The average corporate firewall blocks porn, online games, and hate speech, even if that has nothing to do with security per se (i.e. computers getting hacked, sensitive data stolen). Also, many security rules get adopted not because they make sense, but because “other companies do it, so if we don’t, we might get in trouble for not following the industry best practices” and “someone from the security department got drunk and proposed it, but if something bad happens one day, I don’t want to have my name on the record as the manager who opposed a security measure recommended by a security expert”.