AI Alignment Breakthroughs this Week [new substack]
I am thinking of doing a weekly substack where I post the most interesting AI Alignment breakthroughs each week.
Because I don’t have a lot of time to devote to this, for now it’s mostly just going to be a list of links to things I find interesting.
If this is something you would find useful, please subscribe or otherwise let me know.
I will post 1x/week for the next month (hopefully every Sunday). If by the end of that time I haven’t gotten at least 10 “this is helpful” or subscriptions, I will probably stop.
Here is the first week’s post [cross-posted from substack]
AI Alignment Breakthroughs this Week (10/01)
Each week, I am going to try to highlight some of the top breakthroughs in AI alignment.
Since this is the first week, I will briefly explain each section (and why it is related to AI alignment). In future weeks, if I add a section, I will make a note of it.
The sections in this week’s breakthroughs are:
Math. Teaching AI to do math is on the critical path for many AI alignment proposals, such as Max Tegmark and Steve Omohundro’s “Provably Safe Systems”
Brain-computer-interfaces. BCIs are one path (notably promoted by Elon Musk) that Humans might be able to maintain control over AI.
AI Agents. Using teams of AI Agents who communicate in a human-readable way is one proposal for scaling AI safely.
Making AI Do what you want. Teaching AI to follow instructions correctly is on the central path for almost all AI alignment proposals.
Explainability. Making AI accurately explain what it is thinking is considered to be a central problem in many AI Alignment strategies.
Mechanistic Interpretability. The ability to peer inside the “black box” that is AI and understand what it is thinking is likely to be useful for many alignment strategies
AI Art. This section is just for fun, but I have found that many AI Art techniques are closely related to AI alignment. One reason this is so is because getting the AI to understand human feelings and desires is at the heart of the AI Art movement. AI Art is also a relatively “harmless” way to explore cutting-edge AI capabilities.
Math
BoolFormer
https://twitter.com/IntuitMachine/status/1706269694645190775
MetaMath
https://twitter.com/jon_durbin/status/1706301840873115981
Brain Computer Interface
Thought to Text
https://twitter.com/WillettNeuro/status/1694386988236038324
AI Agents
RECONCILE (multi agent AI framework)
https://twitter.com/IntuitMachine/status/1706408449100173572
Making AI do what you Want
Fixing Improper Binding
https://twitter.com/RoyiRassin/status/1670112343110430721
Small Scale Proxies for large Transformers
https://twitter.com/_akhaliq/status/1706564947931521292
LongLoRa
https://twitter.com/ItakGol/status/1705885984741523821
Explainaibility
Autonomous Driving with Chain of Thought
https://twitter.com/DrJimFan/status/1702718067191824491
Mechanistic Interpretability
AI Lie Detector
https://twitter.com/OwainEvans_UK/status/1707451418339377361
VITs need Registers
https://twitter.com/TimDarcet/status/1707769575981424866
GPT-3 Can Play Chess (somewhat)
https://twitter.com/xlr8harder/status/1706713544350191909
Exploring Alignment in Diffusion Models
https://twitter.com/MokadyRon/status/1706618451664474148
NeuralNetworks can be approximated by 2-hidden-layer shallow network
https://twitter.com/ChombaBupe/status/1705975443541667992
Training GPT to win at Tic-tac-toe
https://twitter.com/PhillipHaeusler/status/1705919170154840438
Mechanistic Interpretation of Whisper
https://twitter.com/mayfer/status/1706188593579069753
FreeU
https://twitter.com/_akhaliq/status/1704721496122266035
Does A=B imply B=A?
https://twitter.com/OwainEvans_UK/status/1705285631520407821
AI Art
Dall-E-3 is now in Bing
https://twitter.com/generatorman_ai/status/1708163231389499827
Instant Lora
https://twitter.com/NerdyRodent/status/1708204716943921239
Automated line art tweening
https://twitter.com/thibaudz/status/1707733015663653167
Dream Gaussian
https://twitter.com/camenduru/status/1707571698961186964
VoiceLDM
https://twitter.com/nearcyan/status/1707524190167867833
Generative Repainting
https://twitter.com/_akhaliq/status/1706847413325996071
The Spiral
https://twitter.com/sergeykarayev/status/1708508857100861739
Camera Movement for AnimateDiff
https://twitter.com/CeyuanY/status/1706149343752048640
- 1 Oct 2023 22:15 UTC; 1 point) 's comment on AI #30: Dalle-3 and GPT-3.5-Instruct-Turbo by (
I really love this idea to combat echo chamber stuff, but I really think you gotta have more descriptions, even if brief. Right now you have to click through to see what most things are about, which really limits the utility imo.
Agree that would be better. I think just links is useful too though, in case adding more context significantly increases the workload in practice
In general, it seems interesting to track the state of AI-alignment techniques, and how different ideas develop!
I strongly suggest not using the term “Breakthrough” so casually, in order to avoid unnecessary hype. It’s unclear we had any alignment breakthrough so far, and talking about “weekly breakthroughs” seems absurd at best.
I don’t think the word “breakthrough” is reserved exclusively for “we have solved the AI alignment problem in its entirety”. I am using it here to mean “this is a significant new discovery that advances the state of the art”.
If you don’t think there are weekly breakthoughs in AI, you haven’t been paying attention to AI.
It sounds like what you call a breakthrough, I’d just call a “result”. In my understanding, it’d either have to open up an unexpected + promising new direction, or solve a longstanding problem in order to be considered a breakthrough.
Unfortunately, significant insights into alignment seem much rarer than “capabilities breakthroughs” (which are probably also more due to an accumulation of smaller insights, so even there one might simply say the field is moving fast)
This is great! Thank you for doing this! Might add some of these to ai-plans.com!
Cool site!
It doesn’t look like there’s a button for “Add a strength” on e.g. https://ai-plans.com/post/f180b51d7e6a (although it appears possible to do so if I click the “show post” button.
I also wish there was some way to understand depth/breadth of plans. E.g. is this a “full alignment plan” (examples would be The Plan or Provably Save Systems) or is this a narrow technical research direction (e.g. this post).
Ideally, there would be some kind of prediction market style mechanism that assigned “dignity points” to plans that were most likely to contribute significantly to AI Alignment.
This is great! Thanks for sharing. I hope you continue to do these.