(Epistemic Status: I don’t endorse this yet, just thinking aloud. Please let me know if you want to act/research based on this idea)
It seems like it should be possible to materialize certain forms of AI alignment failure modes with today’s deep learning algorithms, if we directly optimize for their discovery. For example, training a Gradient Hacker Enzyme.
A possible benefit of this would be that it gives us bits of evidence wrt how such hypothesized risks would actually manifest in real training environments. While the similarities would be limited because the training setups would be optimizing for their discovery, it should at least serve as a good lower bound for the scenarios in which these risks could manifest.
Perhaps having a concrete bound for when dangerous capabilities appear (eg a X parameter model trained in Y modality has Z chance of forming a gradient hacker) would make it easier for policy folks to push for regulations.
Is AI gain-of-function equally dangerous as biotech gain-of-function? Some arguments in favor (of the former being dangerous):
The malicious actor argument is probably stronger for AI gain-of-function.
if someone publicly releases a Gradient Hacker Enzyme, this lowers the resource that would be needed for a malicious actor to develop a misaligned AI (eg plug in the misaligned Enzyme at an otherwise benign low-capability training run).
Risky researcher incentive is equally strong.
e.g., a research lab carelessly pursuing gain-of-function research, deliberately starting risky training runs for financial/academic incentives, etc.
Some arguments against:
Accident risks from financial incentives are probably weaker for AI gain-of-function.
The standard gain-of-function risk scenario is: research lab engineers a dangerous pathogen, it accidentally leaks, and a pandemic happens.
I don’t see how these events would happen “accidentally” when dealing with AI programs; e.g., the researcher would have to deliberately cut parts of the network weights and replace it with the enzyme, which is certainly intentional.
Is there a case for AI gain-of-function research?
(Epistemic Status: I don’t endorse this yet, just thinking aloud. Please let me know if you want to act/research based on this idea)
It seems like it should be possible to materialize certain forms of AI alignment failure modes with today’s deep learning algorithms, if we directly optimize for their discovery. For example, training a Gradient Hacker Enzyme.
A possible benefit of this would be that it gives us bits of evidence wrt how such hypothesized risks would actually manifest in real training environments. While the similarities would be limited because the training setups would be optimizing for their discovery, it should at least serve as a good lower bound for the scenarios in which these risks could manifest.
Perhaps having a concrete bound for when dangerous capabilities appear (eg a X parameter model trained in Y modality has Z chance of forming a gradient hacker) would make it easier for policy folks to push for regulations.
Is AI gain-of-function equally dangerous as biotech gain-of-function? Some arguments in favor (of the former being dangerous):
The malicious actor argument is probably stronger for AI gain-of-function.
if someone publicly releases a Gradient Hacker Enzyme, this lowers the resource that would be needed for a malicious actor to develop a misaligned AI (eg plug in the misaligned Enzyme at an otherwise benign low-capability training run).
Risky researcher incentive is equally strong.
e.g., a research lab carelessly pursuing gain-of-function research, deliberately starting risky training runs for financial/academic incentives, etc.
Some arguments against:
Accident risks from financial incentives are probably weaker for AI gain-of-function.
The standard gain-of-function risk scenario is: research lab engineers a dangerous pathogen, it accidentally leaks, and a pandemic happens.
I don’t see how these events would happen “accidentally” when dealing with AI programs; e.g., the researcher would have to deliberately cut parts of the network weights and replace it with the enzyme, which is certainly intentional.