On “Does OpenAI, or other AGI/ASI developers, have a plan to “Red Team” and protect their new ASI systems from similarly powerful systems?”
Well, we know that red teaming is one of their priorities right now, having formed a red-teaming network already to test the current systems comprised of domain experts apart from researchers which previously they used to contact people every time they wanted to test a new model which makes me believe they are aware of the x-risks (by the way they higlighted on the blog including CBRN threats). Also, from the superalignment blog, the mandate is to:
“to steer and control AI systems much smarter than us.”
So, either OAI will use the current Red-Teaming Network (RTN) or form a separate one dedicated to the superalignment team (not necessarily an agent).
On “How can they demonstrate that an aligned ASI is safe and resistant to attack, exploitation, takeover, and manipulation—not only from human “Bad Actors” but also from other AGI or ASI-scale systems?”
This is where new eval techniques will come in since the current ones are mostly saturated to be honest. With the presence of the Superalignment team, which I believe will have all the resources available (given they have already been dedicated a 20% compute) will be one of their key research areas.
On “If a “Super Red Teaming Agent” is too dangerous, can “Human Red Teams” comprehensively validate an ASI’s security? Are they enough to defend against superhuman ASIs? If not, how can companies like OpenAI ensure their infrastructure and ASIs aren’t vulnerable to attack?”
As human beings we will always try but won’t be enough that’s why open source is key. Companies should engage in bugcrowd program. Glad to see OpenAI engaged in such through their trust portal end external auditing for stuff like malicious actors.
Also, worth noting OAI hires a lot of cyber security roles like Security Engineer etc which is very pertinent for the infrastructure.
As human beings we will always try but won’t be enough that’s why open source is key.
Open source for which? Code? Training Data? Model weights? Either way, it does not seem like any of these are likely from “Open”AI.
Well, we know that red teaming is one of their priorities right now, having formed a red-teaming network already to test the current systems comprised of domain experts apart from researchers which previously they used to contact people every time they wanted to test a new model which makes me believe they are aware of the x-risks (by the way they higlighted on the blog including CBRN threats). Also, from the superalignment blog, the mandate is to:
> “to steer and control AI systems much smarter than us.”
Companies should engage in Glad to see OpenAI engaged in such through their trust portal end external auditing for stuff like malicious actors.
Also, worth noting OAI hires a lot of cyber security roles like Security Engineer etc which is very pertinent for the infrastructure.
Agreed that their RTN, bugcrowd program, trust portal, etc. are all welcome additions. And, they seem sufficient while their, and other’s, models are sub-AGI with limited capabilities.
But, your point about the rapidly evolving AI landscape is crucial. Will these efforts scale effectively with the size and features of future models and capabilities? Will they be able to scale to the levels needed to defend against other ASI level models?
So, either OAI will use the current Red-Teaming Network (RTN) or form a separate one dedicated to the superalignment team (not necessarily an agent).
It does seem like OpenAI acknowledges the limitations of a purely human approach to AI Alignment research, hence their “superhuman AI alignment agent” concept. But, it’s interesting that they don’t express the same need for a “superhuman level agent” for Red Teaming? At least for the time being.
Is it consistent, or even logical, to assume that, while human run AI Alignment Teams are insufficient to Align and ASI model, human-run “Red Teams” will be able to successfully validate that an ASI is not vulnerable to attack or compromise from a large scale AGI network or “less-aligned” ASI system? Probably not...
On “Does OpenAI, or other AGI/ASI developers, have a plan to “Red Team” and protect their new ASI systems from similarly powerful systems?”
Well, we know that red teaming is one of their priorities right now, having formed a red-teaming network already to test the current systems comprised of domain experts apart from researchers which previously they used to contact people every time they wanted to test a new model which makes me believe they are aware of the x-risks (by the way they higlighted on the blog including CBRN threats). Also, from the superalignment blog, the mandate is to:
“to steer and control AI systems much smarter than us.”
So, either OAI will use the current Red-Teaming Network (RTN) or form a separate one dedicated to the superalignment team (not necessarily an agent).
On “How can they demonstrate that an aligned ASI is safe and resistant to attack, exploitation, takeover, and manipulation—not only from human “Bad Actors” but also from other AGI or ASI-scale systems?”
This is where new eval techniques will come in since the current ones are mostly saturated to be honest. With the presence of the Superalignment team, which I believe will have all the resources available (given they have already been dedicated a 20% compute) will be one of their key research areas.
On “If a “Super Red Teaming Agent” is too dangerous, can “Human Red Teams” comprehensively validate an ASI’s security? Are they enough to defend against superhuman ASIs? If not, how can companies like OpenAI ensure their infrastructure and ASIs aren’t vulnerable to attack?”
As human beings we will always try but won’t be enough that’s why open source is key. Companies should engage in bugcrowd program. Glad to see OpenAI engaged in such through their trust portal end external auditing for stuff like malicious actors.
Also, worth noting OAI hires a lot of cyber security roles like Security Engineer etc which is very pertinent for the infrastructure.
Yes, good context, thank you!
Open source for which? Code? Training Data? Model weights? Either way, it does not seem like any of these are likely from “Open”AI.
Agreed that their RTN, bugcrowd program, trust portal, etc. are all welcome additions. And, they seem sufficient while their, and other’s, models are sub-AGI with limited capabilities.
But, your point about the rapidly evolving AI landscape is crucial. Will these efforts scale effectively with the size and features of future models and capabilities? Will they be able to scale to the levels needed to defend against other ASI level models?
It does seem like OpenAI acknowledges the limitations of a purely human approach to AI Alignment research, hence their “superhuman AI alignment agent” concept. But, it’s interesting that they don’t express the same need for a “superhuman level agent” for Red Teaming? At least for the time being.
Is it consistent, or even logical, to assume that, while human run AI Alignment Teams are insufficient to Align and ASI model, human-run “Red Teams” will be able to successfully validate that an ASI is not vulnerable to attack or compromise from a large scale AGI network or “less-aligned” ASI system? Probably not...