That link is broken for me, did you mean to link to this Lilian Weng tweet?
Robert_AIZI
Rating my AI Predictions
It was covered in Axios, who also link to it as a separate pdf with all 505 signatories.
I’m noticing my confusion about the level of support here. Kara Swisher says that these are 505⁄700 employees, but the OpenAI publication I’m most familiar with is the autointerpretability paper, and none (!) of the core research contributors to that paper signed this letter. Why is a large fraction of the company anti-board/pro-Sam except for 0⁄6 of this team (discounting Henk Tillman because he seems to work for Apple instead of OpenAI)? The only authors on that paper that signed the letter are Gabriel Goh and Ilya Sutskever. So is the alignment team unusually pro-board/anti-Sam, or are the 505 just not that large a faction in the company?
[Editing to add a link to the pdf of the letter, which is how I checked for who signed https://s3.documentcloud.org/documents/24172246/letter-to-the-openai-board-google-docs.pdf ]
I appreciate the joke, but I think that Sam Altman is pretty clearly “the biggest name in AI” as far as the public is concerned. His firing/hiring was the leading story in the New York Times for days in a row (and still is at time of writing)!
I hope this doesn’t lead to everyone sorting into capabilities (microsoft) vs safety (openai). OpenAI’s ownership was designed to preserve safety commitments against race dynamics, but microsoft has no such obligations, a bad track record (Sydney), and now the biggest name in AI. Those dynamics could lead to talent/funding/coverage going to capabilities unchecked by safety, which would increase my p(doom).
Two caveats:
We don’t know what the Altman/Brockman “advanced AI research team” will actually be doing at Microsoft, and how much independence they’ll have.
According to the new OpenAI CEO Emmett Shear, the split wasn’t due to “any specific disagreement on safety”, but I think that could be the end result.
This is something we’re planning to look into! From the paper:
Future efforts could also try to improve feature dictionary discovery by incorporating information about the weights of the model or dictionary features found in adjacent layers into the training process.
Exactly how to use them is something we’re still working on...
Good question! I started writing and when I looked up I had a half-dozen takes, so sorry if these are rambly. Also let me give the caveat that I wasn’t on the training side of the project so these are less informed than Hoagy, Logan, and Aidan’s views:
+1 to Aidan’s answer.
I wish we could resolve tied vs untied purely via “use whichever makes things more interpretable by metric X”, but right now I don’t think our interpretability metrics are fine-grained and reliable enough to make that decision for us yet.
I expect a lot of future work will ask these architectural questions about the autoencoder architecture, and like transformers in general will settle on some guidelines of what works best.
Tied weights are expressive enough to pass the test of “if you squint and ignore the nonlinearity, they should still work”. In particular, (ignoring bias terms) we’re trying to make , so we need “”, and many matrices satisfy .
Tied weights certainly make it easier to explain the autoencoder—“this vector was very far in the X direction, so in its reconstruction we add back in a term along the X direction” vs adding back a vector in a (potentially different) Y direction.
Downstream of this, tied weights make ablations make more sense to me. Let’s say you have some input A that activates direction X at a score of 5, so the autoencoder’s reconstruction is A≈ 5X+[other stuff]. In the ablation, we replace A with A-5X, and if you feed A-5X into the sparse autoencoder, the X direction will activate 0 so the reconstruction will be A-5X≈0X+[different other stuff due to interference]. Therefore the only difference in the accuracy of your reconstruction will be how much the other feature activations are changed by interference. But if your reconstructions use the Y vector instead, then when you feed in A-5X, you’ll replace A≈5Y+[other stuff] with A-5X≈0Y+[different other stuff], so you’ve also changed things by 5X-5Y.
If we’re abandoning the tied weights and just want to decompose the layer into any sparse code, why not just make the sparse autoencoder deeper, throw in smooth activations instead of ReLU, etc? That’s not rhetorical, I honestly don’t know… probably you’d still want ReLU at the end to clamp your activations to be positive. Probably you don’t need too much nonlinearity because the model itself “reads out” of the residual stream via linear operations. I think the thing to try here is trying to make the sparse autoencoder architecture as similar to the language model architecture as possible, so that you can find the “real” “computational factors”.
Comparing Anthropic’s Dictionary Learning to Ours
Sparse Autoencoders Find Highly Interpretable Directions in Language Models
Incidentally, maybe I missed this in the writeup, but this post is only providing an injective self-attention → MLP construction, right?
Either I’m misunderstanding you or you’re misunderstanding me, but I think I’ve shown the opposite: any MLP layer can be converted to a self-attention layer. (Well, in this post I actually show how to convert the MLP layer to 3 self-attention layers, but in my follow-up I show how you can get it in one.) I don’t claim that you can do a self-attention → MLP construction.
Converting an arbitrary MLP layer to a self-attention layer is presumably doable—at least with enough parameters—but remains unknown
This is what I think I show here! Let the unknown be known!
Unfortunate that the construction is so inefficient: 12 heads → 3,000 heads or 250x inflation is big enough to be practically irrelevant (maybe theoretically too).
Yes, this is definitely at an “interesting trivia” level of efficiency. Unfortunately, the construction is built around using 1 attention head per hidden dimension, so I don’t see any obvious way to improve the number of heads. The only angle I have for this to be useful at current scale is that Anthropic (paraphrased) said “oh we can do interpretability on attention heads but not MLPs”, so the conversion of the later into the former might supplement their techniques.
Calling fully-connected MLPs “feedforward networks” is common (e.g. in the original transformer paper https://arxiv.org/pdf/1706.03762.pdf), so I tried to use that language here for the sake of the transformer-background people. But yes, I think “Attention can implement fully-connected MLPs” is a correct and arguably more accurate way to describe this.
Unsafe AI as Dynamical Systems
AIs teams will probably be more superintelligent than individual AIs
[Research Update] Sparse Autoencoder features are bimodal
Thanks for the link! My read is that they describe an architecture where each attention head has some fixed “persistent memory vectors”, and train a model under that architecture. In contrast, I’m showing how one can convert an existing attention+FFN model to an attention-only model (with only epsilon-scale differences in the output).
Explaining “Taking features out of superposition with sparse autoencoders”
I think Yair is saying that the people putting in money randomly is what allows “beat the market” to be profitable. Isn’t the return on beating the market proportional to the size of the market? In which case, if more people put money into the prediction markets suboptimally, this would be a moneymaking opportunity for professional forecasters, and you could get more/better information from the prediction markets.
This might not be the problem you’re trying to solve, but I think if predictions markets are going to break into normal society they need to solve “why should a normie who is somewhat risk-averse, doesn’t enjoy wagering for its own sake, and doesn’t care about the information externalities, engage with prediction markets”. That question for stock markets is solved via the stock market being overall positive-sum, because loaning money to a business is fundamentally capable of generating returns.
Now let me read your answer from that perspective:
users would not bet USD but instead something which appreciates over time or generates income (e.g. ETH, Gold, S&P 500 ETF, Treasury Notes, or liquid and safe USD-backed positions in some DeFi protocol)
Why not just hold Treasury Notes or my other favorite asset? What does the prediction market add?
use funds held in the market to invest in something profit-generating and distribute part of the income to users
Why wouldn’t I just put my funds directly into something profit-generating?
positions are used to receive loans, so you can free your liquidity from long (timewise) markets and use it to e.g. leverage
I appreciate that less than 100% of my funds will be tied up in the prediction market, but why tie up any?
The practical problem is that the zero-sum monetary nature of prediction markets disincentives participation (especially in year+ long markets) because on average it’s more profitable to invest in something else (e.g. S&P 500). It can be solved by allowing to bet other assets, so people would bet their S&P 500 shares and on average get the same expected value, so it will be not disincentivising anymore.
But once I have an S&P 500 share, why would I want to put it in a prediction market (again, assuming I’m a normie who is somewhat risk-averse, etc)
Surely, they would be more interested if they had free loans (of course they are not going to be actually free, but they can be much cheaper than ordinary uncollateralized loans).
So if I put $1000 into a prediction market, I can get a $1000 loan (or a larger loan using my $1000 EV wager as collateral)? But why wouldn’t I just get a loan using my $1000 cash as collateral?
Overall I feel listed several mechanisms that mitigate potential downsides of prediction markets, but they still pull in a negative direction, and there’s no solid upside to a regular person who doesn’t want to wager money for wager’s sake, doesn’t think they can beat the market, and is somewhat risk averse (which I think is a huge portion of the public).
Also, there are many cases where positive externalities can be beneficial for some particular entity. For example, an investment company may want to know about the risk of a war in a particular country to decide if they want to invest in the country or not. In such cases, the company can provide rewards for market participants and make it a positive-sum game for them even from the monetary perspective.
This I see as workable, but runs into a scale issue and the tragedy of the commons. Let’s make up a number and say the market needs a 1% return on average to make it worthwhile after transaction fees, time investment, risk, etc. Then $X of incentive could motivate $100X of prediction market. But I think the issue of free-riders makes it very hard to scale X so that $100X ≈ [the stock market].
Overall, in order to make prediction markets sustainably large, I feel like you’d need some way to internalize the positive information externalities generated by them. I think most prediction markets are not succeeding at that right now (judging from them not exploding in popularity), but maybe there would be better monetization options if they weren’t basically regulated out of existence.
That’s all correct