[edit: took out naming controversy stuff, as it was distracting from the point of the blog]
I am introducing a new rating system for each alignment breakthrough. The rating system will go from 1 star⭐ to 5 stars ⭐⭐⭐⭐⭐.
A 1 star ⭐ “breakthrough” represents incremental progress. This means, that while technically achieving a new milestone, this breakthrough was the result of known techniques and could have been easily predicted in advance by an expert in the field. An example of something I’ve posted in the past that should be considered 1 star ⭐ is Wustchen V3. An example of a hypothetical 1 star ⭐ breakthrough would be if RLHF on GPT-5 was found to work better than RLHF on GPT-4.
A 5 star ⭐⭐⭐⭐⭐ “breakthrough” represents a discovery that solves a significant unsolved problem that was considered to be a major obstacle on at least one AI Alignment path. An example of a 5 star ⭐⭐⭐⭐⭐ breakthrough that I’ve posted in the past would be neuron superposition. An example of a hypothetical 5 star ⭐⭐⭐⭐⭐ breakthrough would be if someone were to develop a system that could translate a human-language description of a math problem into a formal mathematical proof.
Now, without any further ado…
AI Alignment Breakthroughs this Week
This week there were breakthroughs in the areas of:
AI Evaluation
AI Agents
Mechanistic Interpretability
Explainable AI
Simulation
Making AI Do what we want
AI Art
AI Evaluation
What is it: a new benchmark for multi-modal decision making
What is new: evaluate multimodal models (like GPT-4V) by their ability to make decisions in different domains
What is it good for: Benchmarking is key for many AI safety strategies such as the Pause and RSPs
Rating: ⭐⭐
AI Agents
Adapting LLM Agents Through Communication
What is it: Improved AI agents
What is new: By fine-tuning the LLM, the agents can perform better
What is it good for: Factored Congnition, Bueracracy of AIs
Rating: ⭐
Mechanistic Interpretability
Research on infinite-width neural networks
What is it: research showing the behavior of infinite (large number of) parameter LLMs
What’s new: a specific map showing when NNs will under/overtrain
What is it good for: determining the stability of neural networks as they scale up
Rating:⭐⭐⭐
Reverse-engineering LLM components
What is it: research to understand LLM components
What’s new: discovery of “copy supress” heads that prevent the LLM from repeating the input
What is it good for: understanding how LLMs work gives us better tools to trust/control them
Rating:⭐⭐⭐
RLHF impacts on output diversity
What is it: Research showing how RLHF training on a model reduces output diversity
What’s new: seems to confirm anecdotal findings that RLHF reduces diversity
What is it good for: Understanding how alignment affects model outputs
Rating: ⭐
What it is: a simple method to improve long generations with LLMs
What’s new: they discover that the first 4 tokens acts as “attention sinks” and keeping them improves LLM outputs
What is it good for: This gives me strong vibes of this research, which allowed us to get much better attention maps from VITs
Rating: ⭐⭐⭐
Explainable AI
What is it: A new multi-modal-LLM that “grounds” its explanations in the images it is shown.
What is new: By adding the ability to point to specific parts of the image, the human can ask more detailed questions and the LLM can better explain its answers.
What is it good for: Having AI explain why it did something is a way to avoid the imfamous (and possibly apocryphal) “sunny tanks” problem.
Rating:⭐⭐
What it is: Teach LLMs rules to reduce hallucinations
What’s new: two-stage approach where LLM first proposes rules and then applies them
What it is good for: In addition to being reliable, rules learned by LLMs should hopefully be more explainable as well.
Rating: ⭐⭐⭐ (new technique, much better results, promising research direction)
Simulation
What is it: Use Grand Theft Auto as an environment to test MLLM agents.
What is the breakthrough: extends on the idea of using GPT-4 in Minecraft. the use of a vision-LLM is new and GTA-V should have a richer set of actions.
What is it good for: Training AIs in sandboxes is a form of sandboxing. Although GTA-V wouldn’t be my first choice if you were trying to raise friendly AI.
Rating:⭐
What it is: Teach LLMs rules to reduce hallucinations
What’s new: two-stage approach where LLM first proposes rules and then applies them
What it is good for: In addition to being reliable, rules learned by LLMs should hopefully be more explainable as well.
Rating: ⭐⭐⭐ (new technique, much better results, promising research direction)
Making AI Do what we want
What it is: method for verifying truthfulness of LLM outputs
What’s new: The break each response into factors which are individually verified
What is it good for: the ability to verify factual outputs is useful for most alignment plans
rating: ⭐⭐⭐(new technique, much better results, promising research direction)
What it is: improve learning with limited data
What’s new: by looking at data across RL episodes, they can improve policy training
What is it good for: making the best use of limited data reduces the chance of AI doing something wrong.
Rating: ⭐⭐
What is it: a modification to RLHF to prevent overfitting
What’s new: They weight the Reward Model to make sure it is only used in the region where it is effect
What is it good for: Prevent Goodharting
Rating: ⭐⭐⭐⭐
AI Art
What is it: A method for converting a normal video into a “4d” movie you can view from any angle
What is the breakthrough: By using Gaussian Splattering, much better quality and speed than previous methods for doing this
What is it good for: Cool matrix-style shots. Video games probably.
Rating: ⭐⭐ (big leap in quality, but mostly an application of a known method to a new problem)
What is it: Pretty movie generator
What is new: Deform is one of the OG AI video methods, this just applies it to a new model
What is it good for: making cool movies
Rating: ⭐
What is it: a way to avoid reproducing training images in diffusion models
what’s new: they mask the training images to prevent reproduction
what is it good for: reducing copyright concerns when using diffusion models.
Rating: ⭐⭐
What is it: seperate motion+subject in text-to-video models
What’s new: they train a dual-path lora on an individual video to extract motion
What is it good for: Transfer the motion from one video to another
Rating: ⭐⭐⭐⭐
This is Not AI Alignment
GPT-4v 🔃 Dall-E 3 (https://twitter.com/conradgodfrey/status/1712564282167300226)
What is it: A fun graphic showing what happens when we repeatedly iterate complex systems.
What does it mean: There was a fun back-and-forth where it was speculated “this is how we die”, which was quickly refuted. I think this perfectly demonstrates the need for empiricism in AI Alignment.
What is it: “secret” messages can be passed via image to deceive the user.
What does this mean: Everyone expected to find image jailbreaks. And we did. Good job everyone.
an ironic twist on calling it “wokism” is that “tone policing” originated as a term from social justice thinking as a critique against attempts to silence or trivialize marginalized voices based on the way they express their concerns. far be it from me to say you can’t, but it’s certainly an interesting choice of words.
if you aren’t interested in attempting to use words in a way that makes their meanings accurate from ambient context, then yes, sometimes definitions work, but if you won’t agree with the meaning of a word that is typically used to hype something as progress on a technical topic because you want to use it for less impactful things, the typical word for that is “clickbait”.
the things you list are interesting, and I do not intend to diss them at all. my critique being only about your word choice is because that’s all I want to critique. proceed according to taste.
Yes. I am 100% attempting to do tone policing. The tone-policing I am hoping to enforce is “stop arguing about definitions and please talk about technical aspects of AI alignment”. Given the comments on this post (currently 4 about hating the name and 0 about technical alignment issues) I have utterly failed. 🤷♂️
I’m not sure why this one is currently downvoted, but I found the last week’s overview pretty helpful. Whenever I find something interesting e.g. running A/B testing on infant audio data or deriving speech/language/verbal thoughts from magnetoencephalogram data alone, I still don’t know whether it will replicate in 2023 or 2026 or even not at all from that particular physiological data source, or whether it would probably replicate and the engineers in question were just incompetent or had insufficient data security.
The star ratings are an improvement, I had felt also that breakthrough was overselling many of the items last week.
However, stars are very generic and don’t capture the concept of a breakthrough very well. You could consider a lightbulb.
I also asked chatgpt to create an emoji of an AI breakthrough, and after some iteration it came up with this: https://photos.app.goo.gl/sW2TnqDEM5FzBLdPA
Use it if you like it!
Thanks for putting together this roundup, I learn things from it every time.
yea, as expected I don’t like the name, but the review is great, so I guess it’s net positive )