...the sort of principles you’d build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it.
Here’s one straightforward such principle: minimal transfer to unrelated tasks / task-classes. If you’ve somehow figured out how to do a pivotal act with a theorem proving AI, and you’re training a theorem proving AI, then that AI should not also be able to learn to model human behavior, predict biological interactions, etc.
One to evaluate this quantity: have many small datasets of transfer tasks, each containing training data related to dangerous capabilities that you’d not want the AI to acquire. Intermittently during the AI’s training, switch out the AI’s usual training data for the transfer tasks and watch how quickly the AI’s loss on the transfer tasks decreases. The faster its loss decreases (or if it starts off low in the first place), the better the AI is at generalizing to dangerous domains, and the worse you’ve done by this metric.
Obviously, you’d then revert the weights back to the state before you’d run the tests. In fact, you should probably run the tests on entirely different and isolated hardware than what you primarily use to train the AI, so as to prevent data leakage from the “dangerous capabilities” dataset.
To be clear, you wouldn’t directly train to minimize transfer. The idea is that you’ve figured out some theoretical advance in how to modify the training process or architecture in such a way as to control transfer learning without having to directly train to avoid it. The above is just a way to test if your approach has failed catastrophically.
Edit: not sure if the “minimal transfer principle” counts as “originally invented by Yudkowsky” for the purposes of this post. E.g., his point that “[we] can’t build a system that only has the capability to drive red cars and not blue cars” in the ruin post is clearly gesturing in this direction. I guess my addition is to generalize it as a principle and propose a metric.
I assume “If you’ve somehow figured out how to do a pivotal act” is intended to limit scope, but doesn’t that smuggle the hardness of the Hard Task™ out of the equation?
Every question I ask myself how this approach would address the a given issue, I find myself having to defer to the definition of the pivotal act, which is the thing that’s been defined as out of scope.
You need at least a certain amount of transfer in order to actually do your pivotal act. An “AI” with literally zero transfer is just a lookup table. The point of this principle is that you want as little transfer as possible while still managing a pivotal act. I used a theorem proving AI as an example where it’s really easy to see what would count as unnecessary transfer. But even with something whose pivotal act would require a lot more transfer than a theorem prover (say, a nanosystem builder AI), you’d still want to avoid transfer to domains such as deceiving humans or training other AIs.
Here’s one straightforward such principle: minimal transfer to unrelated tasks / task-classes. If you’ve somehow figured out how to do a pivotal act with a theorem proving AI, and you’re training a theorem proving AI, then that AI should not also be able to learn to model human behavior, predict biological interactions, etc.
One to evaluate this quantity: have many small datasets of transfer tasks, each containing training data related to dangerous capabilities that you’d not want the AI to acquire. Intermittently during the AI’s training, switch out the AI’s usual training data for the transfer tasks and watch how quickly the AI’s loss on the transfer tasks decreases. The faster its loss decreases (or if it starts off low in the first place), the better the AI is at generalizing to dangerous domains, and the worse you’ve done by this metric.
Obviously, you’d then revert the weights back to the state before you’d run the tests. In fact, you should probably run the tests on entirely different and isolated hardware than what you primarily use to train the AI, so as to prevent data leakage from the “dangerous capabilities” dataset.
To be clear, you wouldn’t directly train to minimize transfer. The idea is that you’ve figured out some theoretical advance in how to modify the training process or architecture in such a way as to control transfer learning without having to directly train to avoid it. The above is just a way to test if your approach has failed catastrophically.
Edit: not sure if the “minimal transfer principle” counts as “originally invented by Yudkowsky” for the purposes of this post. E.g., his point that “[we] can’t build a system that only has the capability to drive red cars and not blue cars” in the ruin post is clearly gesturing in this direction. I guess my addition is to generalize it as a principle and propose a metric.
I assume “If you’ve somehow figured out how to do a pivotal act” is intended to limit scope, but doesn’t that smuggle the hardness of the Hard Task™ out of the equation?
Every question I ask myself how this approach would address the a given issue, I find myself having to defer to the definition of the pivotal act, which is the thing that’s been defined as out of scope.
You need at least a certain amount of transfer in order to actually do your pivotal act. An “AI” with literally zero transfer is just a lookup table. The point of this principle is that you want as little transfer as possible while still managing a pivotal act. I used a theorem proving AI as an example where it’s really easy to see what would count as unnecessary transfer. But even with something whose pivotal act would require a lot more transfer than a theorem prover (say, a nanosystem builder AI), you’d still want to avoid transfer to domains such as deceiving humans or training other AIs.