Software Engineer interested in AI and AI safety.
Stephen McAleese
Source Estimated AI safety funding in 2024 Comments Open Philanthropy $63.6M SFF $13.2M Total for all grants was $19.86M. LTFF $4M Total for all grants was $5.4M. NSF SLES $10M AI Safety Fund $3M Superalignment Fast Grants $9.9M FLI $5M Estimated from the grant programs announced in 2024; They don’t have a 2024 grant summary like the one in 2023 yet so this one is uncertain. Manifund $1.5M Other $1M Total $111.2M Today I did some analysis of the grant data from 2024 and came up with the figures in the table above. I also updated the spreadsheet and the graph in the post to use the updated values.
The new book Introduction to AI Safety, Ethics and Society by Dan Hendrycks is on Spotify as an audiobook if you want to listen to it.
I’ve added a section called “Social-instinct AGI” under the “Control the thing” heading similar to last year.
This is brilliant work, thank you. It’s great that someone is working on these topics and they seem highly relevant to AGI alignment.
One intuition for why a neuroscience-inspired approach to AI alignment seems promising is that apparently a similar strategy worked for AI capabilities: the neural network researchers from the 1980s who tried to copy how the brain works using deep learning were ultimately the most successful at building highly intelligent AIs (e.g. GPT-4) and more synthetic approaches (e.g. pure logic) were less successful.
Similarly, we already know that the brain has the capacity to represent and be directed by human values so arguably the shortest path to succeeding at AI alignment is to try to understand and replicate the brain’s circuitry underlying human motivations and values in AIs.
The only other AI alignment research agenda I can think of that seems to follow a similar strategy is Shard Theory though it seems more high-level and more related to RL than neuroscience.
One prediction I’m interested in that’s related to o3 is how long until an AI achieves superhuman ELO on Codeforces.
OpenAI claims that o3 achieved a Codeforces ELO of 2727 which is 99.9th percentile but the best human competitor in the world right now has an ELO of 3985. If an AI could achieve an ELO of 4000 or more, an AI would then be the best entity in the world at competitive programming and that would be the “AlphaGo” moment for the field.
Will an AI model achieve superhuman Codeforces ELO by the end of 2025?
Interesting argument. I think your main point is that AIs can achieve similar outcomes to current society and therefore be aligned with humanity’s goals by being a perfect replacement for an individual human and then being able to gradually replace all humans in an organization or the world. This argument also seems like an argument in favor of current AI practices such as pre-training on the next-word prediction objective on internet text followed by supervised fine-tuning.
That said, I noticed a few limitations of this argument:
- Possibility of deception: As jimrandomh mentioned earlier, a misaligned AI might be incentivized to behave identically to a helpful human until it can safely pursue it’s true objective. Therefore this alignment plan seems to require AIs to not be too prone to deception.
- Generalization: An AI might behave exactly like a human in situations similar to its training data but not generalize sufficiently to out-of-distribution scenarios. For example, the AIs might behave similar to humans in typical situations but diverge from human norms when they become superintelligent.
- Emergent properties: The AIs might be perfect human substitutes individually but result in unexpected emergent behavior that can’t be easily forseen in advance when acting as a group. To use an analogy, adding grains of sand to a pile one by one seems stable until the pile collapses in a mini-avalanche.
Shallow review of technical AI safety, 2024
Excellent post, thank you. I appreciate your novel perspective on how AI might affect society.
I feel like a lot of LessWrong-style posts follow the theme of “AGI is created and then everyone dies” which is an important possibility but might lead to other possibilities being neglected.
Whereas this post explores a range of scenarios and describes a mainline scenario that seems like a straightforward extrapolation of trends we’ve seen unfolding over the past several decades.
I think this post is really helpful and has clarified my thinking about the different levels of AI alignment difficulty. It seems like a unique post with no historical equivalent, making it a major contribution to the AI alignment literature.
As you point out in the introduction, many LessWrong posts provide detailed accounts of specific AI risk threat models or worldviews. However, since each post typically explores only one perspective, readers must piece together insights from different posts to understand the full spectrum of views.
The new alignment difficulty scale introduced in this post offers a novel framework for thinking about AI alignment difficulty. I believe it is an improvement compared to the traditional ‘P(doom)’ approach which requires individuals to spontaneously think of several different possibilities which is mentally taxing. Additionally, reducing one’s perspective to a single number may oversimplify the issue and discourage nuanced thinking.
In contrast, the ten-level taxonomy provides concrete descriptions of ten scenarios to the reader, each describing alignment problems of varying difficulties. This comprehensive framework encourages readers to consider a variety of diverse scenarios and problems when thinking about the difficulty of the AI alignment problem. By assigning probabilities to each level, readers can construct a more comprehensive and thoughtful view of alignment difficulty. This framework therefore encourages deeper engagement with the problem.
The new taxonomy may also foster common understanding within the AI alignment community and serve as a valuable tool for facilitating high-level discussions and resolving disagreements. Additionally, it proposes hypotheses about the relative effectiveness of different AI alignment techniques which could be empirically tested in future experiments.
Here’s a Facebook post by Yann LeCun from 2017 which has a similar message to this post and seems quite insightful:
My take on Ali Rahimi’s “Test of Time” award talk at NIPS.
Ali gave an entertaining and well-delivered talk. But I fundamentally disagree with the message.
The main message was, in essence, that the current practice in machine learning is akin to “alchemy” (his word).
It’s insulting, yes. But never mind that: It’s wrong!
Ali complained about the lack of (theoretical) understanding of many methods that are currently used in ML, particularly in deep learning.
Understanding (theoretical or otherwise) is a good thing. It’s the very purpose of many of us in the NIPS community.
But another important goal is inventing new methods, new techniques, and yes, new tricks.
In the history of science and technology, the engineering artifacts have almost always preceded the theoretical understanding: the lens and the telescope preceded optics theory, the steam engine preceded thermodynamics, the airplane preceded flight aerodynamics, radio and data communication preceded information theory, the computer preceded computer science.
Why? Because theorists will spontaneously study “simple” phenomena, and will not be enticed to study a complex one until there a practical importance to it.
Criticizing an entire community (and an incredibly successful one at that) for practicing “alchemy”, simply because our current theoretical tools haven’t caught up with our practice is dangerous.
Why dangerous? It’s exactly this kind of attitude that lead the ML community to abandon neural nets for over 10 years, *despite* ample empirical evidence that they worked very well in many situations.
Neural nets, with their non-convex loss functions, had no guarantees of convergence (though they did work in practice then, just as they do now). So people threw the baby with the bath water and focused on “provable” convex methods or glorified template matching methods (or even 1957-style random feature methods).
Sticking to a set of methods just because you can do theory about it, while ignoring a set of methods that empirically work better just because you don’t (yet) understand them theoretically is akin to looking for your lost car keys under the street light knowing you lost them someplace else.
Yes, we need better understanding of our methods. But the correct attitude is to attempt to fix the situation, not to insult a whole community for not having succeeded in fixing it yet. This is like criticizing James Watt for not being Carnot or Helmholtz.
I have organized and participated in numerous workshops that bring together deep learners and theoreticians, many of them hosted at IPAM. As a member of the scientific advisory board of IPAM, I have seen it as one of my missions to bring deep learning to the attention of the mathematics community. In fact, I’m co-organizer of such a workshop at IPAM in February 2018 ( http://www.ipam.ucla.edu/.../new-deep-learning-techniques/ ).
Ali: if you are not happy with our understanding of the methods you use everyday, fix it: work on the theory of deep learning, instead of complaining that others don’t do it, and instead of suggesting that the Good Old NIPS world was a better place when it used only “theoretically correct” methods. It wasn’t.
He describes how engineering artifacts often precede theoretical understanding and that deep learning worked empirically for a long time before we began to understand it theoretically. He says that researchers ignored deep learning because it didn’t fit into their existing models of how learning should work.
I think the high-level lesson from the Facebook post is that street-lighting occurs when we try to force reality to be understood in terms of our existing models of how it should work (incorrect models like phlogiston are common in the history of science). Though this LessWrong post argues that street-lighting occurs when researchers have a bias towards working on easier problems.
Instead a better approach is to allow reality and evidence to dictate how we create our models of the world even if those more correct models are more complex or require major departures from existing models (which creates a temptation to ‘flinch away’). I think a prime example of this is quantum mechanics: my understanding of how it was developed was that physicists noticed bizarre results from experiments like the double-split experiment and developed new theories (e.g. wave-particle duality) that described reality well even if they were counterintuitive or novel.
I guess the modern equivalent that’s relevant to AI alignment would be Singular Learning Theory which proposes a novel theory to explain how deep learning generalizes.
Here is a recent blog post by Hugging Face explaining how to make an o1-like model using open weights models like Llama 3.1.
Why? O1 is much more capable than GPT-4o at math, programming, and science.
Here’s an argument for why current alignment methods like RLHF are already much better than what evolution can do.
Evolution has to encode information about the human brain’s reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don’t generalize well like “sweet foods are good”.
In contrast, RLHF reward models are built from LLMs with around 25B[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be much larger than the human genome (~2 orders of magnitude) and this advantage will probably increase in the future as models get larger.
One thing I’ve noticed is that current models like Claude 3.5 Sonnet can now generate non-trivial 100-line programs like small games that work in one shot and don’t have any syntax or logical errors. I don’t think that was possible with earlier models like GPT-3.5.
I donated $100, roughly equivalent to my yearly spending on Twitter/X Premium, because I believe LessWrong offers similar value. I would encourage most readers to do the same.
Update: I’ve now donated $1,000 in total for philanthropic reasons.
If you’re interested in doing a PhD in AI in the UK, I recommend applying for the Centres for Doctoral Training (CDTs) in AI such as:
CDT in Responsible and Trustworthy in-the-world NLP (University of Edinburgh)
CDT in Practice-Oriented Intelligence (University of Bristol)
CDT in Statistics and Machine Learning (University of Oxford)
Note that these programs are competitive so the acceptance rate is ~10%.
I agree. I don’t see a clear distinction between what’s in the model’s predictive model and what’s in the model’s preferences. Here is a line from the paper “Learning to summarize from human feedback”:
“To train our reward models, we start from a supervised baseline, as described above, then add a randomly initialized linear head that outputs a scalar value. We train this model to predict which summary y ∈ {y0, y1} is better as judged by a human, given a post x.”
Since the reward model is initialized using the pretrained language model, it should contain everything the pretrained language model knows.
I strong upvoted as well. This post is thorough and unbiased and seems like one of the best resources for learning about representation engineering.
I’ll use the definition of optimization from Wikipedia: “Mathematical optimization is the selection of a best element, with regard to some criteria, from some set of available alternatives”.
Best-of-n or rejection sampling is an alternative to RLHF which involves generating responses from an LLM and returning the one with the highest reward model score. I think it’s reasonable to describe this process as optimizing for reward because its searching for LLM outputs that achieve the highest reward from the reward model.
I’d also argue that AlphaGo/AlphaZero is optimizing for reward. In the AlphaGo paper it says, “At each time step of each simulation, an action is selected from state so as to maximize action value plus a bonus” and the formula is: where is an exploration bonus.
Action values Q are calculated as the mean value (estimated probability of winning) of all board states in the subtree below an action. The value of each possible future board state is calculated using a combination of a value function estimation for that state and the mean outcome of dozens of random rollouts until the end of the game (return +1 or −1 depending on who wins).
The value function predicts the return (expected sum of future reward) from a position whereas the random rollouts are calculating the actual average reward by simulating future moves until the end of the game when the reward function return +1 or −1.
So I think AlphaZero is optimizing for a combination of predicted reward (from the value function) and actual reward which is calculated using multiple rollouts until the end of the game.
Upvoted. I thought this was a really interesting and insightful post. I appreciate how it tackles multiple hard-to-define concepts all in the same post.