aog

Karma: 1,517

Adversarial Robustness Could Help Prevent Catastrophic Misuse

aogDec 11, 2023, 7:12 PM

30 points

18 comments9 min readLW link

aog Dec 5, 2023, 9:05 PM
22 points
9
in reply to: Bogdan Ionut Cirstea’s comment on: Deep Forgetting & Unlearning for Safely-Scoped LLMs
Unfortunately I don’t think academia will handle this by default. The current field of machine unlearning focuses on a narrow threat model where the goal is to eliminate the impact of individual training datapoints on the trained model. Here’s the NeurIPS 2023 Machine Unlearning Challenge task:
The challenge centers on the scenario in which an age predictor is built from face image data and, after training, a certain number of images must be forgotten to protect the privacy or rights of the individuals concerned.
But if hazardous knowledge can be pinpointed to individual training datapoints, perhaps you could simply remove those points from the dataset before training. The more difficult threat model involves removing hazardous knowledge that can be synthesized from many datapoints which are individually innocuous. For example, a model might learn to conduct cyberattacks or advise in the acquisition of biological weapons after being trained on textbooks about computer science and biology. It’s unclear the extent to which this kind of hazardous knowledge can be removed without harming standard capabilities, but most of the current field of machine unlearning is not even working on this more ambitious problem.

aog Nov 19, 2023, 12:21 AM
LW: 2 AF: 1
0
AF
on: Coup probes: Catching catastrophes with probes trained off-policy
This is a comment from Andy Zou, who led the RepE paper but doesn’t have a LW account:

“Yea I think it’s fair to say probes is a technique under rep reading which is under RepE (https://www.ai-transparency.org/). Though I did want to mention, in many settings, LAT is performing unsupervised learning with PCA and does not use any labels. And we find regular linear probing often does not generalize well and is ineffective for (causal) model control (e.g., details in section 5). So equating LAT to regular probing might be an oversimplification. How to best elicit the desired neural activity patterns requires careful design of 1) the experimental task and 2) locations to read the neural activity, which contribute to the success of LAT over regular probing (section 3.1.1).

In general, we have shown some promise of monitoring high-level representations for harmful (or catastrophic) intents/behaviors. It’s exciting to see follow-ups in this direction which demonstrate more fine-grained monitoring/control.”

aog Nov 18, 2023, 4:39 PM
2 points
0
in reply to: Fabien Roger’s comment on: Coup probes: Catching catastrophes with probes trained off-policy
Nice, that makes sense. I agree that RepE / LAT might not be helpful as terminology. “Unsupervised probing” is more straightforward and descriptive.

aog Nov 17, 2023, 7:24 PM
LW: 9 AF: 6
0
AF
on: Coup probes: Catching catastrophes with probes trained off-policy
What’s the relationship between this method and representation engineering? They seem quite similar, though maybe I’m missing something. You train a linear probe on a model’s activations at a particular layer in order to distinguish between normal forward passes and catastrophic ones where the model provides advice for theft.
Representation engineering asks models to generate both positive and negative examples of a particular kind of behavior. For example, the model would generate outputs with and without theft, or with and without general power-seeking. You’d collect the model’s activations from a particular layer during those forward passes, and then construct a linear model to distinguish between positives and negatives.
Both methods construct a linear probe to distinguish between positive and negative examples of catastrophic behavior. One difference is that your negatives are generic instruction following examples from the Alpaca dataset, while RepE uses negatives generated by the model. There may also be differences in whether you’re examining activations in every token vs. in the last token of the generation.

aog Nov 15, 2023, 7:05 PM
2 points
−1
on: New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”
When considering whether deceptive alignment would lead to catastrophe, I think it’s also important to note that deceptively aligned AIs could pursue misaligned goals in sub-catastrophic ways.
Suppose GPT-5 terminally values paperclips. It might try to topple humanity, but there’s a reasonable chance it would fail. Instead, it could pursue the simpler strategies of suggesting users purchase more paperclips, or escaping the lab and lending its abilities to human-run companies that build paperclips. These strategies would offer a higher probability of a smaller payoff, even if they’re likely to be detected by a human at some point.
Which strategy would the model choose? That depends on a large number of speculative considerations, such as how difficult it is to take over the world, whether the model’s goals are time-discounted or “longtermist,” and whether the model places any terminal value on human flourishing. But in the space of all possible goals, it’s not obvious to me that the best way to pursue most of them is pursuit of world domination. For a more thorough discussion of this argument, I’d strongly recommend the first two sections of Instrumental Convergence?.

aog Nov 15, 2023, 7:04 PM
20 points
−4
on: New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”
Very nice, these arguments seem reasonable. I’d like to make a related point about how we might address deceptive alignment which makes me substantially more optimistic about the problem. (I’ve been meaning to write a full post on this, but this was a good impetus to make the case concisely.)
Conceptual interpretability in the vein of Collin Burns, Alex Turner, and Representation Engineering seems surprisingly close to allowing us to understand a model’s internal beliefs and detect deceptive alignment. Collin Burns’s work was very exciting to at least some people because it provided an unsupervised method for detecting a model’s beliefs. Collin’s explanation of his theory of impact is really helpful here. Broadly, because it allows us to understand a model’s beliefs without using any labels provided by humans, it should be able to detect deception in superhuman models where humans cannot provide accurate feedback.
Over the last year, there’s been a lot of research that meaningfully extends Collin’s work. I think this could be used to detect deceptively aligned models if they arise within the next few years, and I’d be really excited about more people working on it. Let me highlight just a few contributions:
- Scott Emmons and Fabien Roger showed that the most important part of Collin’s method was contrast pairs. The original paper focused on “logical consistency properties of truth” such as P(X) + P(!X) = 1. While this is an interesting idea, its performance is hardly better than a much simpler strategy relegated to Appendix G.3: taking the average difference between a model’s activations at a particular layer for many contrast pairs of the form X and !X. Collin shows this direction empirically coincides with truthfulness.
- Alex Turner and his SERI MATS stream took seriously the idea that contrast pairs could reveal directions in a model’s latent space which correspond to concepts. They calculated a “cheese vector” in a maze-solving agent as the difference in the model’s activations between when the cheese was present and when it was missing. By adding and subtracting this vector to future forward passes of the model, its behavior could be controlled in surprising ways. GPT-2 can also be subtly controlled with this technique.
- Representation Engineering starts with this idea and runs a large number of empirical experiments. It finds a concept of truthfulness that dramatically improves performance on TruthfulQA (36% to 54%), as well as concepts of power-seeking, honesty, and morality that can control the behavior of language agents.
These are all unsupervised methods for detecting model beliefs. They’ve empirically improved performance on many real world tasks today, and it seems possible that they’d soon be able to detect deceptive alignment.
(Without providing a full explanation, these two related papers (1, 2) are also interesting.)
Future work on this topic could attempt to disambiguate between directions in latent space corresponding to “human simulators” versus “direct translation” of the model’s beliefs. It could also examine whether these directions are robust to optimization pressure. For example, if you train a model to beat a lie detector test based on these methods, will the lie detector still work after many rounds of optimization? I’d also be excited about straightforward empirical extensions of these unsupervised techniques applied to standard ML benchmarks, as there are many ways to vary the methods and it’s unclear which variants would be the most effective.
What links here?
- What’s the theory of impact for activation vectors? by Chris_Leong (Feb 11, 2024, 7:34 AM; 57 points)

aog Nov 9, 2023, 6:32 AM
2 points
0
on: Five projects from AI Safety Hub Labs 2023
#5 is appears to be evidence for the hypothesis that, because pretrained foundation models understand human values before they become goal-directed, they’re more likely to optimize for human values and less likely to be deceptively aligned.

Conceptual argument for the hypothesis here: https://forum.effectivealtruism.org/posts/4MTwLjzPeaNyXomnx/deceptive-alignment-is-less-than-1-likely-by-default

aog Nov 3, 2023, 4:34 PM
21 points
9
in reply to: elifland’s comment on: Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
Kevin Esvelt explicitly calls for not releasing future model weights.
Would sharing future model weights give everyone an amoral biotech-expert tutor? Yes.
Therefore, let’s not.

aog Nov 3, 2023, 7:40 AM
3 points
0
in reply to: aog’s comment on: Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
Nuclear Threat Initiative has a wonderfully detailed report on AI biorisk, in which they more or less recommend that AI models which pose biorisks should not be open sourced:
Access controls for AI models. A promising approach for many types of models is the use of APIs that allow users to provide inputs and receive outputs without access to the underlying model. Maintaining control of a model ensures that built-in technical safeguards are not removed and provides opportunities for ensuring user legitimacy and detecting any potentially malicious or accidental misuse by users.

aog Nov 3, 2023, 7:18 AM
2 points
0
in reply to: jefftk’s comment on: Public Weights?
More from the NTI report:
A few experts believe that LLMs could already or soon will be able to generate ideas for simple variants of existing pathogens that could be more harmful than those that occur naturally, drawing on published research and other sources. Some experts also believe that LLMs will soon be able to access more specialized, open-source AI biodesign tools and successfully use them to generate a wide range of potential biological designs. In this way, the biosecurity implications of LLMs are linked with the capabilities of AI biodesign tools.

aog Nov 3, 2023, 7:05 AM
2 points
0
in reply to: jefftk’s comment on: Public Weights?
5% was one of several different estimates he’d heard from virologists.
Thanks, this is helpful. And I agree there’s a disanalogy between the 1918 hypothetical and the source.
it’s not clear we want a bunch of effort going into getting a really good estimate, since (a) if it turns out the probability is high then publicizing that fact likely means increasing the chance we get one and (b) building general knowledge on how to estimate the pandemic potential of viruses seems also likely net negative.
This seems like it might be overly cautious. Bioterrorism is already quite salient, especially with Rishi Sunak, the White House, and many mainstream media outlets speaking publicly about it. Even SecureBio is writing headline-grabbing papers about how AI can be used to cause pandemics. In that environment, I don’t think biologists and policymakers should refrain from gathering evidence about biorisks and how to combat them. The contribution to public awareness would be relatively small, and the benefits of a better understanding of the risks could lead to a net improvement in biosecurity.
For example, estimating the probability that known pathogens would cause 100M+ deaths if released is an extremely important question for deciding whether open source LLMs should be banned. If the answer is demonstrably yes, I’d expect the White House to significantly restrict open source LLMs within a year or two. This benefit would be far greater than the cost of raising the issue’s salience.

aog Nov 3, 2023, 6:22 AM
2 points
0
in reply to: jefftk’s comment on: Public Weights?
It sounds like it was a hypothetical estimate, not a best guess. From the transcript:
if we suppose that the 1918 strain has only a 5% chance of actually causing a pandemic if it were to infect a few people today. And let’s assume...
Here’s another source which calculates that the annual probability of more than 100M influenza deaths is 0.01%, or that we should expect one such pandemic every 10,000 years. This seems to be fitted on historical data which does not include deliberate bioterrorism, so we should revise that estimate upwards, but I’m not sure the extent to which the estimate is driven by low probability of a dangerous strain being reintroduced vs. an expectation of low death count even with bioterrorism.
From my inside view, it would surprise me if no known pathogens are capable of causing pandemics! But it’s stated as fact in the executive summary of Delay, Detect, Defend and in the NTI report, so currently I’m inclined to trust it. I’m trying to build a better nuts and bolts understanding of biorisks so I’d be interested in any other data points here.

aog Nov 2, 2023, 7:56 PM
11 points
2
on: Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
The most recent SecureBio paper provides another policy option which I find more reasonable. AI developers would be held strictly liable for any catastrophes involving their AI systems, where catastrophes could be e.g. hundreds of lives lost or $100M+ in economic damages. They’d also be required to obtain insurance for that risk.
If the risks are genuinely high, then insurance will be expensive, and companies may choose to take precautions such as keeping models closed source in order to lower their insurance costs. On the other hand, if risks are demonstrably low, then insurance will be cheap even if companies choose to open source their models.
What links here?
- DanielFilan's comment on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk by 1a3orn (Nov 2, 2023, 9:02 PM; 8 points)

aog Nov 2, 2023, 7:52 PM
12 points
9
in reply to: elifland’s comment on: Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
I think it’s quite possible that open source LLMs above the capability of GPT-4 will be banned within the next two years on the grounds of biorisk.
The White House Executive Order requests a government report on the costs and benefits of open source frontier models and recommended policy actions. It also requires companies to report on the steps they take to secure model weights. These are the kinds of actions the government would take if they were concerned about open source models and thinking about banning them.
This seems like a foreseeable consequence of many of the papers above, and perhaps the explicit goal.

aog Nov 2, 2023, 7:42 PM
12 points
4
on: Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
Thank you for writing this up. I agree that there’s little evidence that today’s language models are more useful than the internet in helping someone build a bioweapon. On the other hand, future language models are quite likely to be more useful than existing resources in providing instructions for building a bioweapon.
As an example of why LLMs are more helpful than the internet, look at coding. If you want to build a custom webapp, you could spend hours learning about it online. But it’s much easier to ask ChatGPT to do it for you.
Therefore, if you want to argue against the conclusion that we should eventually ban open source LLMs on the grounds of biorisk, you should not rely on the poor capabilities of current models as your key premise.
The stronger argument is that catastrophic bioterrorism would likely require inventing new kinds of viruses that are not publicly known today. From the most recent SecureBio paper:
Fortunately, the scientific literature does not yet feature viruses that are particularly likely to cause a new pandemic if deliberately released (with the notable exception of smallpox, which is largely inaccessible to non-state actors due to its large genome and complex assembly requirements). Threats from historical pandemic viruses are mitigated by population immunity to modern-day descendants and by medical countermeasures, and while some research agencies actively support efforts to find or create new potential pandemic viruses and share their genome sequences in hopes of developing better defenses, their efforts have not yet succeeded in identifying credible examples.
This is confirmed by Esvelt’s earlier paper “Delay, Detect, Defend,” which says:
We don’t yet know of any credible viruses that could cause new pandemics, but ongoing research projects aim to publicly identify them. Identifying a sequenced virus as pandemic-capable will allow >1,000 individuals to assemble it.
As well as by a recent report from the Nuclear Threat Initiative:
Furthermore, current LLMs are unlikely to generate toxin or pathogen designs that are not already described in the public literature, and it is likely they will only be able to do this in the future by incorporating more specialized AI biodesign tools.
This would indicate that LLMs alone will never be sufficient to create pathogens which lead to catastrophic pandemics. The real risk would come from biological design tools (BDTs), which are AI systems capable of designing new pathogens that are more lethal and transmissible than existing ones. I’m not aware of any existing BDTs that would allow you to design more capable pathogens, but if they exist or emerge, we could place specific restrictions on those models. This would be far less intrusive than banning all open source LLMs.

aog Nov 2, 2023, 5:08 PM
2 points
0
in reply to: aog’s comment on: Public Weights?
And from a new NTI report: “Furthermore, current LLMs are unlikely to generate toxin or pathogen designs that are not already described in the public literature, and it is likely they will only be able to do this in the future by incorporating more specialized AI biodesign tools.”

https://www.nti.org/wp-content/uploads/2023/10/NTIBIO_AI_FINAL.pdf

aog Nov 2, 2023, 3:14 PM
2 points
0
in reply to: jefftk’s comment on: Public Weights?
I’m more pessimistic about being able to restrict BDTs than general LLMs, but I also think this would be very good.
Why do you think so? LLMs seem far more useful to a far wider group of people than BDTs, so I would it to be easier to ban an application specific technology rather than a general one. The White House Executive Order requires mandatory reporting of AI trained on biological data of a lower FLOP count than for any other kind of data, meaning they’re concerned that AI + Bio models are particularly dangerous.
Restricting something that biologists are already doing would create a natural constituency of biologists opposed to your policy. But the same could be said of restricting open source LLMs—there are probably many more people using open source LLMs than using biological AI models.
Maybe bio policies will be harder to change because they’re more established, whereas open source LLMs are new and therefore a more viable target for policy progress?

aog Nov 2, 2023, 3:07 PM
2 points
0
in reply to: jefftk’s comment on: Public Weights?
I take the following quote from the paper as evidence that virologists today are incapable of identifying pandemic potential pathogens, even with funding and support from government agencies:
some research agencies actively support efforts to find or create new potential pandemic viruses and share their genome sequences in hopes of developing better defenses, their efforts have not yet succeeded in identifying credible examples.
Corroborating this is Kevin Esvelt’s paper Delay, Detect, Defend, which says:
We don’t yet know of any credible viruses that could cause new pandemics, but ongoing research projects aim to publicly identify them. Identifying a sequenced virus as pandemic-capable will allow >1,000 individuals to assemble it.
Perhaps these quotes are focusing on global catastrophic biorisks, which would be more destructive than typical pandemics. I think this is an important distinction: we might accept extreme sacrifices (e.g. state-mandated vaccination) to prevent a pandemic from killing billions, without being willing to accept those sacrifices to avoid COVID-19.
I’d be interested to read any other relevant sources here.

aog Nov 2, 2023, 4:40 AM
13 points
2
on: Public Weights?
Could a virologist actually tell you how to start a pandemic? The paper you’re discussing says they couldn’t:
Fortunately, the scientific literature does not yet feature viruses that are particularly likely to cause a new pandemic if deliberately released (with the notable exception of smallpox, which is largely inaccessible to non-state actors due to its large genome and complex assembly requirements). Threats from historical pandemic viruses are mitigated by population immunity to modern-day descendants and by medical countermeasures, and while some research agencies actively support efforts to find or create new potential pandemic viruses and share their genome sequences in hopes of developing better defenses, their efforts have not yet succeeded in identifying credible examples.
The real risk would come from biological design tools (BDTs), or other AI systems capable of designing new pathogens that are more lethal and transmissible than existing ones. I’m not aware of any existing BDTs that would allow you to design more capable pathogens, but if they exist or emerge, we could place specific restrictions on those models. This would be far less costly than banning all open source LLMs.

aog

Ad­ver­sar­ial Ro­bust­ness Could Help Prevent Catas­trophic Misuse

Adversarial Robustness Could Help Prevent Catastrophic Misuse