For what it’s worth, I really dislike this terminology. Of course saying “I want X” is normative, and of course it’s based on empirical beliefs.
Why is the distinction between training and deployment important? Most methods of training involve the AI system acting. Are you hoping that the training process (e.g. gradient descent) leads to safety?
I’m imagining that during training, your ML system doesn’t control actuators which would allow it to pose an existential risk or other catastrophe (e.g. a computer screen watched by a human, the ability to send things over the internet). Basically, I want the zero-shot analysis to be done before the AI system can cause catastrophe, which during this piece I’m conflating with the training phase, although I guess they’re not identical.
I certainly hope that the training process of an advanced AI system leads to safety, but I’m not assuming that in this piece, as per the background beliefs.
Presumably many forms of interpretability techniques involve computing specific outputs of the neural net in order to understand them. Why doesn’t this count as “running” the neural net?
It counts if the neural network’s outputs are related to actuators that can plausibly cause existential risk or other catastrophe. As such, I think these forms of interpretability techniques are suspect, but could be fine (e.g. if you could somehow construct a sandbox environment to test your neural network where the network’s sandboxed behaviour was informative about whether the network would cause catastrophe in the real world). That being said, I’m confused by this question, because I don’t think I claimed in the piece that typical interpretability techniques were useful.
What about training schemes in which the agent gradually becomes more and more exposed to the real world? Where is “deployment” then?
I am basically abstracting away from the problem of figuring out when your neural network has access to actuators that can pose existential risk or other catastrophe, and hope somebody else solves this. I’d hope that in the training schemes you describe, you can determine that the agent won’t cause catastrophe before its first exposure to the real world, otherwise such a scheme seems irresponsible for systems that could cause catastrophes.
For what it’s worth, I really dislike this terminology. Of course saying “I want X” is normative, and of course it’s based on empirical beliefs.
Here are two claims:
“If I were in charge of the world, I would ensure that no powerful AI system were deployed unless we had mechanistic transparency into that system, because anything short of that is an unacceptable level of risk”
“I think that we should push for mechanistic transparency, because by doing so we will cause developers not to deploy dangerous AI systems, because they will use mechanistic transparency techniques to identify when the AI system is dangerous”
There is an axis on which these two claims differ, where I would say the first one is normative and the second one is empirical. The phrase “perfect is the enemy of good” is also talking about this axis. What would you name that axis?
In any case, probably at this point you know what I mean. I would like to see more argumentation for the second kind of claim, and am trying to say that arguments for the first kind of claim are not likely to sway me.
Re: clarification of desideratum, that makes sense.
Re: the two claims, that’s different from what I thought you meant by the distinction. I would describe both dot points as being normative claims buttressed by empirical claims. To the extent that I see a difference, it’s that the first dot point is perhaps addressing low-probability risks, while the second is addressing medium-to-high-probability risks. I think that pushing for mechanistic transparency would address medium-to-high-probability risks, but don’t argue for that here, since I think the arguments for medium-to-high-probability risk from AI are better made elsewhere.
Hmm, I was more pointing at the distinction where the first claim doesn’t need to argue for the subclaim “we will be able to get people to use mechanistic transparency” (it’s assumed away by “if I were in charge of the world”), while the second claim does have to argue for it.
I am mostly interested in allowing the developers of AI systems to determine whether their system has the cognitive ability to cause human extinction, and whether their system might try to cause human extinction.
The way I read this, if the research community enables the developers to determine these things at prohibitive cost, then we mostly haven’t “allowed” them to do it, but if the cost is manageable then we have. So I’d say my desiderata here (and also in my head) include the cost being manageable. If the cost of any such approach were necessarily prohibitive, I wouldn’t be very excited about it.
Background desiderata
For what it’s worth, I really dislike this terminology. Of course saying “I want X” is normative, and of course it’s based on empirical beliefs.
I’m imagining that during training, your ML system doesn’t control actuators which would allow it to pose an existential risk or other catastrophe (e.g. a computer screen watched by a human, the ability to send things over the internet). Basically, I want the zero-shot analysis to be done before the AI system can cause catastrophe, which during this piece I’m conflating with the training phase, although I guess they’re not identical.
I certainly hope that the training process of an advanced AI system leads to safety, but I’m not assuming that in this piece, as per the background beliefs.
It counts if the neural network’s outputs are related to actuators that can plausibly cause existential risk or other catastrophe. As such, I think these forms of interpretability techniques are suspect, but could be fine (e.g. if you could somehow construct a sandbox environment to test your neural network where the network’s sandboxed behaviour was informative about whether the network would cause catastrophe in the real world). That being said, I’m confused by this question, because I don’t think I claimed in the piece that typical interpretability techniques were useful.
I am basically abstracting away from the problem of figuring out when your neural network has access to actuators that can pose existential risk or other catastrophe, and hope somebody else solves this. I’d hope that in the training schemes you describe, you can determine that the agent won’t cause catastrophe before its first exposure to the real world, otherwise such a scheme seems irresponsible for systems that could cause catastrophes.
Here are two claims:
“If I were in charge of the world, I would ensure that no powerful AI system were deployed unless we had mechanistic transparency into that system, because anything short of that is an unacceptable level of risk”
“I think that we should push for mechanistic transparency, because by doing so we will cause developers not to deploy dangerous AI systems, because they will use mechanistic transparency techniques to identify when the AI system is dangerous”
There is an axis on which these two claims differ, where I would say the first one is normative and the second one is empirical. The phrase “perfect is the enemy of good” is also talking about this axis. What would you name that axis?
In any case, probably at this point you know what I mean. I would like to see more argumentation for the second kind of claim, and am trying to say that arguments for the first kind of claim are not likely to sway me.
Re: clarification of desideratum, that makes sense.
Re: the two claims, that’s different from what I thought you meant by the distinction. I would describe both dot points as being normative claims buttressed by empirical claims. To the extent that I see a difference, it’s that the first dot point is perhaps addressing low-probability risks, while the second is addressing medium-to-high-probability risks. I think that pushing for mechanistic transparency would address medium-to-high-probability risks, but don’t argue for that here, since I think the arguments for medium-to-high-probability risk from AI are better made elsewhere.
Hmm, I was more pointing at the distinction where the first claim doesn’t need to argue for the subclaim “we will be able to get people to use mechanistic transparency” (it’s assumed away by “if I were in charge of the world”), while the second claim does have to argue for it.
The way I read this, if the research community enables the developers to determine these things at prohibitive cost, then we mostly haven’t “allowed” them to do it, but if the cost is manageable then we have. So I’d say my desiderata here (and also in my head) include the cost being manageable. If the cost of any such approach were necessarily prohibitive, I wouldn’t be very excited about it.