You’re making the assumption that the safety methods for cars are appropriate to transfer directly to e.g. LLMs. That’s not clearly true to me as there are strong differences in the nature of cars vs the nature of LLMs. For instance the purposes and capacities of cars are known in great detail (driving people from place to place), whereas the purposes of LLMs are not known (we just noticed that they could do a lot of neat things and assumed someone will find a use-case for them) and their capabilities are much broader and less clear.
I would be concerned that your proposed safety method would become very prone to Goodhearting.
I’m not saying that a standard is sufficient for safety, just that it’s incoherent to talk about safety if you don’t even have a clear idea of what would constitute unsafe.
Also, I wasn’t talking about cars in particular—every type of engineering, including software engineering, follows this type of procedure for verification and validation, when those are required. And I think metal rods are a better example to think about—we don’t know what it is going to be used for when it is made, but whatever application the rod will be used for, it needs to have some clear standards and requirements.
I’m not saying that a standard is sufficient for safety, just that it’s incoherent to talk about safety if you don’t even have a clear idea of what would constitute unsafe.
I can believe it makes it less definitive and less useful, but I don’t buy that it makes it “meaningless” and entirely “incoherent”. People can in fact recognize some types of unsafety, and adversarially try to trigger unsafety. I would think that the easier it is to turn GPT into some aggressive powerful thing, the more likely ARC would have been to catch it, so ARCs failure to make GPT do dangerous stuff would seem to constitute Bayesian evidence that it is hard to make it do dangerous stuff.
Also, I wasn’t talking about cars in particular—every type of engineering, including software engineering, follows this type of procedure for verification and validation, when those are required. And I think metal rods are a better example to think about—we don’t know what it is going to be used for when it is made, but whatever application the rod will be used for, it needs to have some clear standards and requirements.
AFAIK rods are a sufficiently simple artifact that almost all of their behavior can be described using very little information, unlike cars and GPTs?
For the first point, if “people can in fact recognize some types of unsafety,” then it’s not the case that “you don’t even have a clear idea of what would constitute unsafe.” And as I said in another comment, I think this is trying to argue about standards, which is a necessity in practice for companies that want to release systems, but isn’t what makes the central point, which is the title of the post, true.
And I agree that rods are often simple, and the reason that I chose rods as an example is because people have an intuitive understanding of some of the characteristics you care about. But the same conceptual model, however, applies to cars, where there is tons of specific safety testing with clearly defined standards, despite the fact that their behavior can be very, very complex.
For the first point, if “people can in fact recognize some types of unsafety,” then it’s not the case that “you don’t even have a clear idea of what would constitute unsafe.” And as I said in another comment, I think this is trying to argue about standards, which is a necessity in practice for companies that want to release systems, but isn’t what makes the central point, which is the title of the post, true.
Maybe I am misunderstanding what you mean by “have a clear idea of what would constitute unsafe”?
Taking rods as an example, my understanding is that rods might be used to support some massive objects, and if the rods bend under the load then they might release the objects and cause harm. So the rods need to be strong enough to support the objects, and usually rods are sold with strength guarantees to achieve this.
“If it would fail under this specific load, then it is unsafe” is a clear idea of what would constitute unsafe. I don’t think we have this clear of an idea for AI. We have some vague ideas of things that would be undesirable, but there tends to be a wide range of potential triggers and a wide range of potential outcomes, which seem more easily handled by some sort of adversarial setup than by writing down a clean logical description. But maybe when you say “clear idea”, you don’t necessarily mean a clean logical description, and also consider more vague descriptions to be relevant?
And I agree that rods are often simple, and the reason that I chose rods as an example is because people have an intuitive understanding of some of the characteristics you care about. But the same conceptual model, however, applies to cars, where there is tons of specific safety testing with clearly defined standards, despite the fact that their behavior can be very, very complex.
I already addressed cars and you said we should talk about rods. Then I addressed rods and you want to switch back to cars. Can you make up your mind?
“If it would fail under this specific load, then it is unsafe” is a clear idea of what would constitute unsafe. I don’t think we have this clear of an idea for AI.
Agreed. And so until we do, we can’t claim they are safe.
But maybe when you say “clear idea”, you don’t necessarily mean a clean logical description, and also consider more vague descriptions to be relevant?
A vague description allows for a vague idea of safety. That’s still far better than what we have now, so I’d be happier with that than the status quo—but in fact, what people outside of AI safety seem to mean by “safe” is even less specific than having an idea about what could go wrong—it’s more often “I haven’t been convinced that it’s going to fail and hurt anyone.”
I already addressed cars and you said we should talk about rods. Then I addressed rods and you want to switch back to cars. Can you make up your mind?
Both are examples. Both are examples, but useful for illustrating different things. Cars are far more complex, and less intuitive, but they still have clear safety standards for design.
You’re making the assumption that the safety methods for cars are appropriate to transfer directly to e.g. LLMs. That’s not clearly true to me as there are strong differences in the nature of cars vs the nature of LLMs. For instance the purposes and capacities of cars are known in great detail (driving people from place to place), whereas the purposes of LLMs are not known (we just noticed that they could do a lot of neat things and assumed someone will find a use-case for them) and their capabilities are much broader and less clear.
I would be concerned that your proposed safety method would become very prone to Goodhearting.
I’m not saying that a standard is sufficient for safety, just that it’s incoherent to talk about safety if you don’t even have a clear idea of what would constitute unsafe.
Also, I wasn’t talking about cars in particular—every type of engineering, including software engineering, follows this type of procedure for verification and validation, when those are required. And I think metal rods are a better example to think about—we don’t know what it is going to be used for when it is made, but whatever application the rod will be used for, it needs to have some clear standards and requirements.
I can believe it makes it less definitive and less useful, but I don’t buy that it makes it “meaningless” and entirely “incoherent”. People can in fact recognize some types of unsafety, and adversarially try to trigger unsafety. I would think that the easier it is to turn GPT into some aggressive powerful thing, the more likely ARC would have been to catch it, so ARCs failure to make GPT do dangerous stuff would seem to constitute Bayesian evidence that it is hard to make it do dangerous stuff.
AFAIK rods are a sufficiently simple artifact that almost all of their behavior can be described using very little information, unlike cars and GPTs?
For the first point, if “people can in fact recognize some types of unsafety,” then it’s not the case that “you don’t even have a clear idea of what would constitute unsafe.” And as I said in another comment, I think this is trying to argue about standards, which is a necessity in practice for companies that want to release systems, but isn’t what makes the central point, which is the title of the post, true.
And I agree that rods are often simple, and the reason that I chose rods as an example is because people have an intuitive understanding of some of the characteristics you care about. But the same conceptual model, however, applies to cars, where there is tons of specific safety testing with clearly defined standards, despite the fact that their behavior can be very, very complex.
Maybe I am misunderstanding what you mean by “have a clear idea of what would constitute unsafe”?
Taking rods as an example, my understanding is that rods might be used to support some massive objects, and if the rods bend under the load then they might release the objects and cause harm. So the rods need to be strong enough to support the objects, and usually rods are sold with strength guarantees to achieve this.
“If it would fail under this specific load, then it is unsafe” is a clear idea of what would constitute unsafe. I don’t think we have this clear of an idea for AI. We have some vague ideas of things that would be undesirable, but there tends to be a wide range of potential triggers and a wide range of potential outcomes, which seem more easily handled by some sort of adversarial setup than by writing down a clean logical description. But maybe when you say “clear idea”, you don’t necessarily mean a clean logical description, and also consider more vague descriptions to be relevant?
I already addressed cars and you said we should talk about rods. Then I addressed rods and you want to switch back to cars. Can you make up your mind?
Agreed. And so until we do, we can’t claim they are safe.
A vague description allows for a vague idea of safety. That’s still far better than what we have now, so I’d be happier with that than the status quo—but in fact, what people outside of AI safety seem to mean by “safe” is even less specific than having an idea about what could go wrong—it’s more often “I haven’t been convinced that it’s going to fail and hurt anyone.”
Both are examples. Both are examples, but useful for illustrating different things. Cars are far more complex, and less intuitive, but they still have clear safety standards for design.