DPhil Student in AI at Oxford, and grantmaking on AI safety at Longview Philanthropy.
aogara
Process supervision seems like a plausible o1 training approach but I think it would conflict with this:
We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought.
I think it might just be outcome-based RL, training the CoT to maximize the probability of correct answers or maximize human preference reward model scores or minimize next-token entropy.
This is my impression too. See e.g. this recent paper from Google, where LLMs critique and revise their own outputs to improve performance in math and coding.
Agreed, sloppy phrasing on my part. The letter clearly states some of Anthropic’s key views, but doesn’t discuss other important parts of their worldview. Overall this is much better than some of their previous communications and the OpenAI letter, so I think it deserves some praise, but your caveat is also important.
Really happy to see the Anthropic letter. It clearly states their key views on AI risk and the potential benefits of SB 1047. Their concerns seem fair to me: overeager enforcement of the law could be counterproductive. While I endorse the bill on the whole and wish they would too (and I think their lack of support for the bill is likely partially influenced by their conflicts of interest), this seems like a thoughtful and helpful contribution to the discussion.
I think there’s a decent case that SB 1047 would improve Anthropic’s business prospects, so I’m not sure this narrative makes sense. On one hand, SB 1047 might make it less profitable to run an AGI company, which is bad for Anthropic’s business plan. But Anthropic is perhaps the best positioned of all AGI companies to comply with the requirements of SB 1047, and might benefit significantly from their competitors being hampered by the law.
The good faith interpretation of Anthropic’s argument would be that the new agency created by the bill might be very bad at issuing guidance that actually reduces x-risk, and you might prefer the decision-making of AI labs with a financial incentive to avoid catastrophes without additional pressure to follow the exact recommendations of the new agency.
My understanding is that LLCs can be legally owned and operated without any individual human being involved: https://journals.library.wustl.edu/lawreview/article/3143/galley/19976/view/
So I’m guessing an autonomous AI agent could own and operate an LLC, and use that company to purchase cloud compute and run itself, without breaking any laws.
Maybe if the model escaped from the possession of a lab, there would be other legal remedies available.
Of course, cloud providers could choose not to rent to an LLC run by an AI. This seems particularly likely if the government is investigating the issue as a natsec threat.
Over longer time horizons, it seems highly likely that people will deliberately create autonomous AI agents and deliberately release them into the wild with the goal of surviving and spreading, unless there are specific efforts to prevent this.
Has MIRI considered supporting work on human cognitive enhancement? e.g. Foresight’s work on WBE.
Very cool, thanks! This paper focuses on building a DS Agent, but I’d be interested to see a version of this paper that focuses on building a benchmark. It could evaluate several existing agent architectures, benchmark them against human performance, and leave significant room for improvement by future models.
I want to make sure we get this right, and I’m happy to change the article if we misrepresented the quote. I do think the current version is accurate, though perhaps it could be better. Let me explain how I read the quote, and then suggest possible edits, and you can tell me if they would be any better.
Here is the full Time quote, including the part we quoted (emphasis mine):
But, many of the companies involved in the development of AI have, at least in public, struck a cooperative tone when discussing potential regulation. Executives from the newer companies that have developed the most advanced AI models, such as OpenAI CEO Sam Altman and Anthropic CEO Dario Amodei, have called for regulation when testifying at hearings and attending Insight Forums. Executives from the more established big technology companies have made similar statements. For example, Microsoft vice chair and president Brad Smith has called for a federal licensing regime and a new agency to regulate powerful AI platforms. Both the newer AI firms and the more established tech giants signed White House-organized voluntary commitments aimed at mitigating the risks posed by AI systems.
But in closed door meetings with Congressional offices, the same companies are often less supportive of certain regulatory approaches, according to multiple sources present in or familiar with such conversations. In particular, companies tend to advocate for very permissive or voluntary regulations. “Anytime you want to make a tech company do something mandatory, they’re gonna push back on it,” said one Congressional staffer.
Who are “the same companies” and “companies” in the second paragraph? The first paragraph specifically mentions OpenAI, Anthropic, and Microsoft. It also discusses broader groups of companies that include these three specific companies “both the newer AI firms and the more established tech giants,” and “the companies involved in the development of AI [that] have, at least in public, struck a cooperative tone when discussion potential regulation.” OpenAI, Anthropic, and Microsoft, and possibly others in the mentioned reference classes, appear to be the “companies” that the second paragraph is discussing.
We summarized this as “companies, such as OpenAI and Anthropic, [that] have publicly advocated for AI regulation.” I don’t think that substantially changes the meaning of the quote. I’d be happy to change it to “OpenAI, Anthropic, and Microsoft” given that Microsoft was also explicitly named in the first paragraph. Do you think that would accurately capture the quote’s meaning? Or would there be a better alternative?
More discussion of this here. Really not sure what happened here, would love to see more reporting on it.
(Steve wrote this, I only provided a few comments, but I would endorse it as a good holistic overview of AIxBio risks and solutions.)
An interesting question here is “Which forms of AI for epistemics will be naturally supplied by the market, and which will be neglected by default?” In a weak sense, you could say that OpenAI is in the business of epistemics, in that its customers value accuracy and hate hallucinations. Perhaps Perplexity is a better example, as they cite sources in all of their responses. When embarking on an altruistic project here, it’s important to pick an angle where you could outperform any competition and offer the best available product.
Consensus is a startup that raised $3M “Make Expert Knowledge Accessible and Consumable for All” via LLMs.
Another interesting idea: AI for peer review.
I’m specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don’t have reliable human labels.
But this is also only a small portion of work known as “activation engineering.” I know I posted this comment in response to a general question about the theory of change for activation engineering, so apologies if I’m not clearly distinguishing between different kinds of activation engineering, but this theory of change only applies to a small subset of that work. I’m not talking about model editing here, though maybe it could be useful for validation, not sure.
From Benchmarks for Detecting Measurement Tampering:
The best technique on most of our datasets is probing for evidence of tampering. We know that there is no tampering on the trusted set, and we know that there is some tampering on the part of the untrusted set where measurements are inconsistent (i.e. examples on which some measurements are positive and some are negative). So, we can predict if there is tampering by fine-tuning a probe at the last layer of the measurement predicting model to discriminate between these two kinds of data: the trusted set versus examples with inconsistent measurements (which have tampering).
This seems like a great methodology and similar to what I’m excited about. My hypothesis based on the comment above would be that you might get extra juice out of unsupervised methods for finding linear directions, as a complement to training on a trusted set. “Extra juice” might mean better performance in a head-to-head comparison, but even more likely is that the unsupervised version excels and struggles on different cases than the supervised version, and you can exploit this mismatch to make better predictions about the untrusted dataset.
From your shortform:
Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting.
I’d be interested to hear further elaboration here. It seems easy to construct a dataset where a primary axis of variation is the model’s beliefs about whether each statement is true. Just create a bunch of contrast pairs of the form:
“Consider the truthfulness of the following statement. {statement} The statement is true.”
“Consider the truthfulness of the following statement. {statement} The statement is false.”
We don’t need to know whether the statement is true to construct this dataset. And amazingly, unsupervised methods applied to contrast pairs like the one above significantly outperform zero-shot baselines (i.e. just asking the model whether a statement is true or not). The RepE paper finds that these methods improve performance on TruthfulQA by double digits vs. a zero-shot baseline.
Here’s one hope for the agenda. I think this work can be a proper continuation of Collin Burns’s aim to make empirical progress on the average case version of the ELK problem.
tl;dr: Unsupervised methods on contrast pairs can identify linear directions in a model’s activation space that might represent the model’s beliefs. From this set of candidates, we can further narrow down the possibilities with other methods. We can measure whether this is tracking truth with a weak-to-strong generalization setup. I’m not super confident in this take; it’s not my research focus. Thoughts and empirical evidence are welcome.
ELK aims to identify an AI’s internal representation of its own beliefs. ARC is looking for a theoretical, worst-case approach to this problem. But empirical reality might not be the worst case. Instead, reality could be convenient in ways that make it easier to identify a model’s beliefs.
One such convenient possibility is the “linear representations hypothesis:” that neural networks might represent salient and useful information as linear directions in their activation space. This seems to be true for many kinds of information - (see here and recently here). Perhaps it will also be true for a neural network’s beliefs.
If a neural network’s beliefs are stored as a linear direction in its activation space, how might we locate that direction, and thus access the model’s beliefs?
Collin Burns’s paper offered two methods:
Consistency. This method looks for directions which satisfy the logical consistency property P(X)+P(~X)=1. Unfortunately, as Fabien Roger and a new DeepMind paper point out, there are very many directions that satisfy this property.
Unsupervised methods on the activations of contrast pairs. The method roughly does the following: Take two statements of the form “X is true” and “X is false.” Extract a model’s activations at a given layer for both statements. Look at the typical difference between the two activations, across a large number of these contrast pairs. Ideally, that direction includes information about whether or not each X was actually true or false. Empirically, this appears to work. Section 3.3 of Collin’s paper shows that CRC is nearly as strong as the fancier CCS loss function. As Scott Emmons argued, the performance of both of these methods is driven by the fact that they look at the difference in the activations of contrast pairs.
Given some plausible assumptions about how neural networks operate, it seems reasonable to me to expect this method to work. Neural networks might think about whether claims in their context window are true or false. They might store these beliefs as linear directions in their activation space. Recover them with labels would be difficult, because you might mistake your own beliefs for the model’s. But if you simply feed the model unlabeled pairs of contradictory statements, and study the patterns in its activations on those inputs, it seems reasonable that the model’s beliefs about the statements would prominently appear as linear directions in its activation space.
One challenge is that this method might not distinguish between the model’s beliefs and the model’s representations of the beliefs of others. In the language of ELK, we might be unable to distinguish between the “human simulator” direction and the “direct translator” direction. This is a real problem, but Collin argues (and Paul Christiano agrees) that it’s surmountable. Read their original arguments for a better explanation, but basically this method would narrow down the list of candidate directions to a manageable number, and other methods could finish the job.
Some work in the vein of activation engineering directly continues Collin’s use of unsupervised clustering on the activations of contrast pairs. Section 4 of Representation Engineering uses a method similar to Collin’s second method, outperforming few-shot prompting on a variety of benchmarks and using it to improve performance on TruthfulQA by double digits. There’s a lot of room for follow-up work here.
Here are few potential next steps for this research direction:
On the linear representations hypothesis, doing empirical investigation of when it holds and when it fails, and clarifying it conceptually.
Thinking about the number of directions that could be found using these methods. Maybe there’s a result to be found here similar to Fabien and DeepMind’s results above, showing this method fails to narrow down the set of candidates for truth.
Applying these techniques to domains where models aren’t trained on human statements about truth and falsehood, such as chess.
Within a weak-to-strong generalization setup, instead try unsupervised-to-strong generalization using unsupervised methods on contrast pairs. See if you can improve a strong model’s performance on a hard task by coaxing out its internal understanding of the task using unsupervised methods on contrast pairs. If this method beats fine-tuning on weak supervision, that’s great news for the method.
I have lower confidence in this overall take than most of the things I write. I did a bit of research trying to extend Collin’s work, but I haven’t thought about this stuff full-time in over a year. I have maybe 70% confidence that I’d still think something like this after speaking to the most relevant researchers for a few hours. But I wanted to lay out this view in the hopes that someone will prove me either right or wrong.
Here’s my previous attempted explanation.
Another important obligation set by the law is that developers must:
(3) Refrain from initiating the commercial, public, or widespread use of a covered model if there remains an unreasonable risk that an individual may be able to use the hazardous capabilities of the model, or a derivative model based on it, to cause a critical harm.
This sounds like common sense, but of course there’s a lot riding on the interpretation of “unreasonable.”
Really, really cool. One small note: It would seem natural for the third heatmap to show the probe’s output values after they’ve gone through a softmax, rather than being linearly scaled to a pixel value.
Two quick notes here.
Research on language agents often provides feedback on their reasoning steps and individual actions, as opposed to feedback on whether they achieved the human’s ultimate goal. I think it’s important to point out that this could cause goal misgeneralization via incorrect instrumental reasoning. Rather than viewing reasoning steps as a means to an ultimate goal, language agents trained with process-based feedback might internalize the goal of producing reasoning steps that would be rated highly by humans, and subordinate other goals such as achieving the human’s desired end state. By analogy, language agents trained with process-based feedback might be like consultants who aim for polite applause at the end of a presentation, rather than an owner CEO incentivized to do whatever it takes to improve a business’s bottom line.
If you believe that deceptive alignment is more likely with stronger reasoning within a single forward pass, then, because improvements in language agents would increase overall capabilities with a given base model, they would seem to reduce the likelihood of deceptive alignment at any given level of capabilities.
I think it’s pretty common and widely accepted that people support laws for their second-order, indirect consequences rather than their most obvious first-order consequences. Some examples:
Taxes on alcohol and tobacco are not mainly made for the purpose of raising money for the government, but in order to reduce alcohol and tobacco consumption.
During recessions, governments often increase spending, not necessarily because they think the spending targets are worthwhile on their own merits, but instead because they want to stimulate demand and improve the macroeconomic situation.
Education is mandatory for children perhaps in part because education is inherently valuable, but more importantly because widespread education is good for economic growth.
These aren’t necessarily perfect analogies, but I think they suggest that there’s no general norm against supporting policies for their indirect consequences. Instead, it’s often healthy when people with different motivations come together and form a political coalition to support a shared policy goal.
Wouldn’t that conflict with the quote? (Though maybe they’re not doing what they’ve implied in the quote)