Computer science master’s student interested in AI and AI safety.
Stephen McAleese
I strong upvoted as well. This post is thorough and unbiased and seems like one of the best resources for learning about representation engineering.
I’ll use the definition of optimization from Wikipedia: “Mathematical optimization is the selection of a best element, with regard to some criteria, from some set of available alternatives”.
Best-of-n or rejection sampling is an alternative to RLHF which involves generating responses from an LLM and returning the one with the highest reward model score. I think it’s reasonable to describe this process as optimizing for reward because its searching for LLM outputs that achieve the highest reward from the reward model.
I’d also argue that AlphaGo/AlphaZero is optimizing for reward. In the AlphaGo paper it says, “At each time step of each simulation, an action is selected from state so as to maximize action value plus a bonus” and the formula is: where is an exploration bonus.
Action values Q are calculated as the mean value (estimated probability of winning) of all board states in the subtree below an action. The value of each possible future board state is calculated using a combination of a value function estimation for that state and the mean outcome of dozens of random rollouts until the end of the game (return +1 or −1 depending on who wins).
The value function predicts the return (expected sum of future reward) from a position whereas the random rollouts are calculating the actual average reward by simulating future moves until the end of the game when the reward function return +1 or −1.
So I think AlphaZero is optimizing for a combination of predicted reward (from the value function) and actual reward which is calculated using multiple rollouts until the end of the game.
SummaryBot summary from the EA Forum:
Executive summary: Geoffrey Hinton, a pioneer in AI, discusses the history and current state of neural networks, and warns about potential existential risks from superintelligent AI while suggesting ways to mitigate these risks.
Key points:
Neural networks, initially unpopular, became dominant in AI due to increased computational power and data availability.
Hinton argues that large language models (LLMs) truly understand language, similar to how the human brain processes information.
Digital neural networks have advantages over biological ones, including easier information sharing and potentially superior learning algorithms.
Hinton believes there’s a 50% chance AI will surpass human intelligence within 20 years, with a 10-20% risk of causing human extinction.
To mitigate risks, Hinton suggests government-mandated AI safety research and international cooperation.
Two possible future scenarios: AI takeover leading to human extinction, or humans successfully coexisting with superintelligent AI assistants.
Maybe. The analogy he gives is that the AI could be like a very intelligent personal assistant to a relatively dumb CEO. The CEO is still in charge but it makes sense to delegate a lot of tasks to the more competent assistant.
The parent and child outcome seems a bit worse than that because usually a small child is completely dependent on their parent and all their resources are controlled by the parent unless they have pocket money or something like that.
It’s an original LessWrong post by me. Though all the quotes and references are from external sources.
There’s a rule of thumb called the “1% rule” on the internet that 1% of users contribute to a forum and 99% only read the forum.
Geoffrey Hinton on the Past, Present, and Future of AI
Thank you for the insightful comment.
On the graph of alignment difficulty and cost, I think the shape depends on the inherent increase in alignment cost and the degree of automation we can expect which is similar to the idea of the offence-defence balance.
In the worst case, the cost of implementing alignment solutions increases exponentially with alignment difficulty and then maybe automation would lower it to a linear increase.
In the best case, automation covers all of the costs associated with increasing alignment difficulty and the graph is flat in terms of human effort and more advanced alignment solutions aren’t any harder to implement than earlier, simpler ones.
The rate of progress on the MATH dataset is incredible and faster than I expected.
The MATH dataset consists of competition math problems for high school students and was introduced in 2021. According to a blog post by Jacob Steinhardt (one of the dataset’s authors), 2021 models such as GPT-3 solved ~7% of questions, a Berkeley PhD student solved ~75%, and an IMO gold medalist solved ~90%.
The blog post predicted that ML models would achieve ~50% accuracy on the MATH dataset on June 30, 2025 and ~80% accuracy by 2028.
But recently (September 2024), OpenAI released their new o1 model which achieved ~95% on the MATH dataset.
So it seems like we’re getting 2028 performance on the MATH dataset already in 2024.
Quote from the blog post:
“If I imagine an ML system getting more than half of these questions right, I would be pretty impressed. If they got 80% right, I would be super-impressed. The forecasts themselves predict accelerating progress through 2025 (21% in 2023, then 31% in 2024 and 52% in 2025), so 80% by 2028 or so is consistent with the predicted trend. This still just seems wild to me and I’m really curious how the forecasters are reasoning about this.”
Thank you for writing this insightful and thorough post on different AI alignment difficulties and possible probability distributions over alignment difficulty levels.
The cost of advancing alignment research rises faster at higher difficulty levels: much more effort and investment is required to produce the same amount of progress towards adequacy at level 7 than at level 3. This cost increases for several reasons. Most obviously, more resources, time, and effort are required to develop and implement these more sophisticated alignment techniques. But there are other reasons, such as that higher level failures cannot yet be experimentally demonstrated, so developing mitigations for them has to rely on (possibly unrepresentative) toy models instead of reacting to the failures of current systems.
Note that although implementing better alignment solutions would probably be more costly, advancements in AI capabilities could flatten the cost curve by automating some of the work. For example, constitutional AI seems significantly more complex than regular RLHF, but it might not be much harder for organizations to implement due to partial automation (e.g. RLAIF). So even if future alignment techniques are much more complex than today, they might not be significantly harder to implement (in terms of human effort) due to increased automation and AI involvement.
Nice paper! I found reading it quite insightful. Here are some key extracts from the paper:
Improving adversarial robustness by classifying several down-sampled noisy images at once:
“Drawing inspiration from biology [eye saccades], we use multiple versions of the same image at once, downsampled to lower resolutions and augmented with stochastic jitter and noise. We train a model to
classify this channel-wise stack of images simultaneously. We show that this by default yields gains in adversarial robustness without any explicit adversarial training.”Improving adversarial robustness by using an ensemble of intermediate layer predictions:
“Using intermediate layer predictions. We show experimentally that a successful adversarial
attack on a classifier does not fully confuse its intermediate layer features (see Figure 5). An
image of a dog attacked to look like e.g. a car to the classifier still has predominantly dog-like
intermediate layer features. We harness this de-correlation as an active defense by CrossMax
ensembling the predictions of intermediate layers. This allows the network to dynamically
respond to the attack, forcing it to produce consistent attacks over all layers, leading to robustness
and interpretability.”
I suspect the desire for kids/lineage is really basic for a lot of people (almost everyone?)
This seems like an important point. One of the arguments for the inner alignment problem is that evolution intended to select humans for inclusive genetic fitness (IGF) but humans were instead motivated by other goals (e.g. seeking sex) that were strongly correlated with IGF in the ancestral environment.
Then when humans’ environment changed (e.g. the invention of birth control), the correlation between these proxy goals and IGF broke down resulting in low fitness and inner misalignment.
However this statement seems to suggest that modern humans really have internalized IGF as one of their primary objectives and that they’re inner aligned with evolution’s outer objective.
I think the Zotero PDF reader has a lot of similar features that make the experience of reading papers much better:
It has a back button so that when you click on a reference link that takes you to the references section, you can easily click the button to go back to the text.
There is a highlight feature so that you can highlight parts of the text which is convenient when you want to come back and skim the paper later.
There is a “sticky note” feature allowing you to leave a note in part of the paper to explain something.
I was thinking of doing this but the ChatGPT web app seems to have many features that are only available there and add a lot of value such as Code Interpreter, PDF uploads, DALL-E, and using custom GPTs so I still use ChatGPT Plus.
Thank you for the blog post. I thought it was very informative regarding the risk of autonomous replication in AIs.
It seems like the Centre for AI Security is a new organization.
I’ve seen the announcement post on it’s website. Maybe it would be a good idea to cross-post it to LessWrong as well.
Is MIRI still doing technical alignment research as well?
This is a brilliant post, thanks. I appreciate the breakdown of different types of contributors and how orgs have expressed the need for some types of contributors over others.
Thanks for the table, it provides a good summary of the post’s findings. It might also worthwhile to also add it to the EA Forum post as well.
I think the table should include the $10 million in OpenAI Superalignment fast grants as well.
I think there are some great points in this comment but I think it’s overly negative about the LessWrong community. Sure, maybe there is a vocal and influential minority of individuals who are not receptive to or appreciative of your work and related work. But I think a better measure of the overall community’s culture than opinions or personal interactions is upvotes and downvotes which are much more frequent and cheap actions and therefore more representative. For example, your posts such as Reward is not the optimization target have received hundreds of upvotes, so apparently they are positively received.
LessWrong these days is huge with probably over 100,000 monthly readers so I think it’s challenging to summarize its culture in any particularly way (e.g. probably most users on LessWrong live outside the bay area and maybe even outside the US). I personally find that LessWrong as a whole is fairly meritocratic and not that dogmatic, and that a wide variety of views are supported provided that they are sufficiently well-argued.
In addition to LessWrong, I use some other related sites such as Twitter, Reddit, and Hacker News and although there may be problems with the discourse on LessWrong, I think it’s generally significantly worse on these other sites. Even today, I’m sure you can find people saying things on Twitter about how AIs can’t have goals or that wanting paperclips is stupid. These kinds of comments wouldn’t be tolerated on LessWrong because they’re ignorant and a waste of time. Human nature can be prone to ignorance, rigidness of opinions and so on but I think the LessWrong walled garden has been able to counteract these negative tendencies better than most other sites.
I agree. I don’t see a clear distinction between what’s in the model’s predictive model and what’s in the model’s preferences. Here is a line from the paper “Learning to summarize from human feedback”:
Since the reward model is initialized using the pretrained language model, it should contain everything the pretrained language model knows.