Another interesting idea: AI for peer review.
aog
I’m specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don’t have reliable human labels.
But this is also only a small portion of work known as “activation engineering.” I know I posted this comment in response to a general question about the theory of change for activation engineering, so apologies if I’m not clearly distinguishing between different kinds of activation engineering, but this theory of change only applies to a small subset of that work. I’m not talking about model editing here, though maybe it could be useful for validation, not sure.
From Benchmarks for Detecting Measurement Tampering:
The best technique on most of our datasets is probing for evidence of tampering. We know that there is no tampering on the trusted set, and we know that there is some tampering on the part of the untrusted set where measurements are inconsistent (i.e. examples on which some measurements are positive and some are negative). So, we can predict if there is tampering by fine-tuning a probe at the last layer of the measurement predicting model to discriminate between these two kinds of data: the trusted set versus examples with inconsistent measurements (which have tampering).
This seems like a great methodology and similar to what I’m excited about. My hypothesis based on the comment above would be that you might get extra juice out of unsupervised methods for finding linear directions, as a complement to training on a trusted set. “Extra juice” might mean better performance in a head-to-head comparison, but even more likely is that the unsupervised version excels and struggles on different cases than the supervised version, and you can exploit this mismatch to make better predictions about the untrusted dataset.
From your shortform:
Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting.
I’d be interested to hear further elaboration here. It seems easy to construct a dataset where a primary axis of variation is the model’s beliefs about whether each statement is true. Just create a bunch of contrast pairs of the form:
“Consider the truthfulness of the following statement. {statement} The statement is true.”
“Consider the truthfulness of the following statement. {statement} The statement is false.”
We don’t need to know whether the statement is true to construct this dataset. And amazingly, unsupervised methods applied to contrast pairs like the one above significantly outperform zero-shot baselines (i.e. just asking the model whether a statement is true or not). The RepE paper finds that these methods improve performance on TruthfulQA by double digits vs. a zero-shot baseline.
Here’s one hope for the agenda. I think this work can be a proper continuation of Collin Burns’s aim to make empirical progress on the average case version of the ELK problem.
tl;dr: Unsupervised methods on contrast pairs can identify linear directions in a model’s activation space that might represent the model’s beliefs. From this set of candidates, we can further narrow down the possibilities with other methods. We can measure whether this is tracking truth with a weak-to-strong generalization setup. I’m not super confident in this take; it’s not my research focus. Thoughts and empirical evidence are welcome.
ELK aims to identify an AI’s internal representation of its own beliefs. ARC is looking for a theoretical, worst-case approach to this problem. But empirical reality might not be the worst case. Instead, reality could be convenient in ways that make it easier to identify a model’s beliefs.
One such convenient possibility is the “linear representations hypothesis:” that neural networks might represent salient and useful information as linear directions in their activation space. This seems to be true for many kinds of information - (see here and recently here). Perhaps it will also be true for a neural network’s beliefs.
If a neural network’s beliefs are stored as a linear direction in its activation space, how might we locate that direction, and thus access the model’s beliefs?
Collin Burns’s paper offered two methods:
Consistency. This method looks for directions which satisfy the logical consistency property P(X)+P(~X)=1. Unfortunately, as Fabien Roger and a new DeepMind paper point out, there are very many directions that satisfy this property.
Unsupervised methods on the activations of contrast pairs. The method roughly does the following: Take two statements of the form “X is true” and “X is false.” Extract a model’s activations at a given layer for both statements. Look at the typical difference between the two activations, across a large number of these contrast pairs. Ideally, that direction includes information about whether or not each X was actually true or false. Empirically, this appears to work. Section 3.3 of Collin’s paper shows that CRC is nearly as strong as the fancier CCS loss function. As Scott Emmons argued, the performance of both of these methods is driven by the fact that they look at the difference in the activations of contrast pairs.
Given some plausible assumptions about how neural networks operate, it seems reasonable to me to expect this method to work. Neural networks might think about whether claims in their context window are true or false. They might store these beliefs as linear directions in their activation space. Recover them with labels would be difficult, because you might mistake your own beliefs for the model’s. But if you simply feed the model unlabeled pairs of contradictory statements, and study the patterns in its activations on those inputs, it seems reasonable that the model’s beliefs about the statements would prominently appear as linear directions in its activation space.
One challenge is that this method might not distinguish between the model’s beliefs and the model’s representations of the beliefs of others. In the language of ELK, we might be unable to distinguish between the “human simulator” direction and the “direct translator” direction. This is a real problem, but Collin argues (and Paul Christiano agrees) that it’s surmountable. Read their original arguments for a better explanation, but basically this method would narrow down the list of candidate directions to a manageable number, and other methods could finish the job.
Some work in the vein of activation engineering directly continues Collin’s use of unsupervised clustering on the activations of contrast pairs. Section 4 of Representation Engineering uses a method similar to Collin’s second method, outperforming few-shot prompting on a variety of benchmarks and using it to improve performance on TruthfulQA by double digits. There’s a lot of room for follow-up work here.
Here are few potential next steps for this research direction:
On the linear representations hypothesis, doing empirical investigation of when it holds and when it fails, and clarifying it conceptually.
Thinking about the number of directions that could be found using these methods. Maybe there’s a result to be found here similar to Fabien and DeepMind’s results above, showing this method fails to narrow down the set of candidates for truth.
Applying these techniques to domains where models aren’t trained on human statements about truth and falsehood, such as chess.
Within a weak-to-strong generalization setup, instead try unsupervised-to-strong generalization using unsupervised methods on contrast pairs. See if you can improve a strong model’s performance on a hard task by coaxing out its internal understanding of the task using unsupervised methods on contrast pairs. If this method beats fine-tuning on weak supervision, that’s great news for the method.
I have lower confidence in this overall take than most of the things I write. I did a bit of research trying to extend Collin’s work, but I haven’t thought about this stuff full-time in over a year. I have maybe 70% confidence that I’d still think something like this after speaking to the most relevant researchers for a few hours. But I wanted to lay out this view in the hopes that someone will prove me either right or wrong.
Here’s my previous attempted explanation.
Another important obligation set by the law is that developers must:
(3) Refrain from initiating the commercial, public, or widespread use of a covered model if there remains an unreasonable risk that an individual may be able to use the hazardous capabilities of the model, or a derivative model based on it, to cause a critical harm.
This sounds like common sense, but of course there’s a lot riding on the interpretation of “unreasonable.”
Really, really cool. One small note: It would seem natural for the third heatmap to show the probe’s output values after they’ve gone through a softmax, rather than being linearly scaled to a pixel value.
Two quick notes here.
Research on language agents often provides feedback on their reasoning steps and individual actions, as opposed to feedback on whether they achieved the human’s ultimate goal. I think it’s important to point out that this could cause goal misgeneralization via incorrect instrumental reasoning. Rather than viewing reasoning steps as a means to an ultimate goal, language agents trained with process-based feedback might internalize the goal of producing reasoning steps that would be rated highly by humans, and subordinate other goals such as achieving the human’s desired end state. By analogy, language agents trained with process-based feedback might be like consultants who aim for polite applause at the end of a presentation, rather than an owner CEO incentivized to do whatever it takes to improve a business’s bottom line.
If you believe that deceptive alignment is more likely with stronger reasoning within a single forward pass, then, because improvements in language agents would increase overall capabilities with a given base model, they would seem to reduce the likelihood of deceptive alignment at any given level of capabilities.
...
...
...
...
- Dec 28, 2023, 9:11 PM; 2 points) 's comment on AI #44: Copyright Confrontation by (
...
- Dec 28, 2023, 9:11 PM; 2 points) 's comment on AI #44: Copyright Confrontation by (
To summarize this comment, you’ve proposed that baseline monitoring systems could reduce risk to an acceptable level. Specifically, the monitoring system would need to correctly identify at least 5% of dangerous queries as dangerous (“5% precision”) and avoid incorrectly flagging more than 1 in 1000 safe queries as dangerous (“0.1% FPR”).
I think this level of reliability is possible today (e.g. Claude 2 would likely meet it), but it’s possible that future developments would make defense more difficult. For example, new attack methods have shown LLMs to be less robust to misuse than previously understood. (This is one downside of adversarial robustness research that will become more important as the stakes of adversarial attacks rise.) Perhaps a bigger challenge is the growth of multimodal systems. Defending vision language models is much more difficult than defending pure LLMs. As multimodality becomes standard, we might see adversarial attacks that regularly achieve >95% success rates in bypassing monitoring systems. I’m not particularly confident about how difficult monitoring will be, but it would be beneficial to have monitoring systems which would work even if defense gets much harder in the future.
Overall, these hypotheticals only offer so much information when none of these defenses has ever been publicly built or tested. I think we agree that simple monitoring strategies might be fairly effective and cheap in identifying misuse, and that progress on adversarial robustness would significantly reduce costs by improving the effectiveness of automated monitoring systems.
That’s cool, appreciate the prompt to discuss what is a relevant question.
Separately for: “But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems.”
I expect that these numbers weren’t against monitoring ensembles in the sense I described earlier and the red team had additional affordances beyond just understanding the high level description of the monitoring setup? E.g., the red team was able to iterate?
This is correct about the paper I cited, but others have achieved similar attack success rates against models like Claude which use an ensemble of defenses. AFAIK Claude does not ban users who attempt misuse, so that element of your plan has never been tested and would likely help a lot.
Yep, agreed on the individual points, not trying to offer a comprehensive assessment of the risks here.
I specifically avoided claiming that adversarial robustness is the best altruistic option for a particular person. Instead, I’d like to establish that progress on adversarial robustness would have significant benefits, and therefore should be included in the set of research directions that “count” as useful AI safety research.
Over the next few years, I expect AI safety funding and research will (and should) dramatically expand. Research directions that would not make the cut at a small organization with a dozen researchers should still be part of the field of 10,000 people working on AI safety later this decade. Currently I’m concerned that the field focuses on a small handful of research directions (mainly mechinterp and scalable oversight) which will not be able to absorb such a large influx of interest. If we can lay the groundwork for many valuable research directions, we can multiply the impact of this large population of future researchers.
I don’t think adversarial robustness should be more than 5% or 10% of the research produced by AI safety-focused researchers today. But some research (e.g. 1, 2) from safety-minded folks seems very valuable for raising the number of people working on this problem and refocusing them on more useful subproblems. I think robustness should also be included in curriculums that educate people about safety, and research agendas for the field.
I do think these arguments contain threads of a general argument that causing catastrophes is difficult under any threat model. Let me make just a few non-comprehensive points here:
On cybersecurity, I’m not convinced that AI changes the offense defense balance. Attackers can use AI to find and exploit security vulnerabilities, but defenders can use it to fix them.
On persuasion, first, rational agents can simply ignore cheap talk if they expect it not to help them. Humans are not always rational, but if you’ve ever tried to convince a dog or a baby to drop something that they want, you’ll know cheap talk is ineffective and only coercion will suffice.
Second, AI is far from the first dramatic change in communications technology in human history. Spoken language, written language, the printing press, telephones, radio, TV, and social media all might be bigger changes than e.g. changed how people can be persuaded. These technologies often contributed to political and social upheaval, including catastrophes for particular ways of life, and AI might do the same. But overall I’m glad these changes occurred, and I wouldn’t expect the foreseeable versions of AI persuasion (i.e. personalized chatbots) to be much more impactful than these historical changes. See this comment and thread for more discussion.
Bioterrorism seems like the biggest threat. The obstacles there have been thoroughly discussed.
If causing catastrophes is difficult, this should reduce our concern with both misuse and rogue AIs causing sudden extinction. Other concerns like military arms races, lock-in of authoritarian regimes, or Malthusian outcomes in competitive environments would become relatively more important.
Also, I’d love to see research that simulates the position of a company trying to monitor misuse, and allows for the full range of defenses that you proposed. There could be a dataset of 1 trillion queries containing 100 malicious queries. Perhaps each query is accompanied by a KYC ID. Their first line of defense would be robust refusal to cause harm, and the second line would be automated detection of adversarial attacks. The company could also have a budget which can be spent on “human monitoring,” which would give them access to the ground truth label of whether a query is malicious for a fixed price. I’d have to think about which elements would be the most tractable for making research progress, but the fact that AI companies need to solve this problem suggests that perhaps external researchers should work on it too.
Thanks for the detailed thoughts! I like the overall plan, especially using KYC, human monitoring, and a separate model for bio. I’d like to point out that this plan still uses automated monitoring systems, and to the extent these systems aren’t adversarially robust, the plan will be more expensive and/or less effective.
The only plans that wouldn’t benefit from adversarial robustness are those without automated monitoring. For example, humans could be hired to manually monitor each and every query for attempted misuse. Let’s consider the viability of fully manual monitoring from a financial perspective, and then we can think about how much additional benefit would be provided by automated robust monitoring systems.
First, let’s price out manual monitoring for ChatGPT Plus. The subscription costs $20/month. Suppose the average person makes 10 queries per day, or 300 queries per month, and that it takes 15 seconds for a human monitor to read a query and flag it as misuse. Wages of $5/hour would mean comprehensive human monitoring costs $6.25 per user per month, and wages of $10/hour and $15/hour would translate to monthly per user costs of $12.50 and $18.75 respectively. The cost of full manual monitoring on ChatGPT Plus would therefore amount to much or most of its overall revenue.
Second, we can anchor on Google Search. This assumes that Google is a reasonable reference point for the eventual volume, expenses, and revenues of an AI provider, which might not be a safe assumption in several ways. Nevertheless, Google conducts a ~3 trillion searches per year (varies by data source). If monitoring a single search costs $0.01 (e.g. someone earning $5/hour who monitors 500 searches per hour), then it would cost ~$30B to monitor every search. Would $30B in monitoring costs be financially acceptable? Google Search had revenues of $162B last year, representing 57% of Google’s total revenue. They don’t report expenses for Search specifically, but their overall expenses were $207B. If we assume Search comprises 57% of expenses, then Search would have $118B in annual expenses, against $162B in revenue. Manual monitoring would cost $30B, and would therefore eliminate two-thirds of Google’s search profits.
So these costs would not be prohibitive, but they’d be a large share of overall revenues and profits. A safety-minded company like Anthropic might pay for manual monitoring, but other companies and their investors might be strongly opposed to paying such a high price. They could argue that, just as gun manufacturers are not held liable for murders, AI providers should not have to spend billions to prevent deliberate misuse.
Fortunately, we can reduce the cost of monitoring in many ways. Randomly sampling a small fraction of queries would reduce costs, but also reduce the likelihood of identifying misuse. Flagging keywords like “virus” would catch unsophisticated misuse, but could be evaded (e.g. discussions in a variant of pig latin).
Ideally, you’d be able to use AI systems to identify suspicious queries for human monitoring, but those systems would only be effective to the extent that they’re adversarially robust. If 99% of queries can be reliably discarded as safe, then manual monitoring costs would fall by 99%. But adversarial attacks often succeed in 50% or 100% of attempts against various detection systems. Deploying these unreliable systems would not decrease the costs of manual monitoring much without a corresponding drop in performance.
Overall, I appreciate your point that there are many layers of defense we can use to detect and prevent misuse Fully manual monitoring might be possible, but it would have a huge financial cost. Many companies would be reluctant or unable to pay that price. Robust automated monitoring systems could reduce the cost of monitoring by 90% or 99%, but this would likely require improvements upon today’s state of the art.
An interesting question here is “Which forms of AI for epistemics will be naturally supplied by the market, and which will be neglected by default?” In a weak sense, you could say that OpenAI is in the business of epistemics, in that its customers value accuracy and hate hallucinations. Perhaps Perplexity is a better example, as they cite sources in all of their responses. When embarking on an altruistic project here, it’s important to pick an angle where you could outperform any competition and offer the best available product.
Consensus is a startup that raised $3M “Make Expert Knowledge Accessible and Consumable for All” via LLMs.