I really appreciate your post and all the links! This and your other recent posts/comments have really helped make a clearer model of timelines.
Logan Riggs
In my experience, most of the general public will verbally agree that AI X-risk is a big deal, but then go about their day (cause reasonably, they have no power). There’s no obvious social role/action to do in response to that.
For climate, people understand that they should recycle, not keep the water running, and if there’s a way to donate to clean the ocean on a Mr. Beast video, then some will even donate (sadly, none of these are very effective for solving the climate problem though! Gotta avoid that for our case).
Having a clear call-to-action seems relevant. For example, educating the public about AI taking jobs for the purpose of building support for UBI. It’s then clear what to communicate and the call-to-action.
I’d be curious to hear what you think an ask should be?
Alternatively, you could argue that generally informing folks on a wide scale about the risks involved will then allow general public to do what they believe is locally best. This could involve a documentary or realistic movie.
Claude 3.5 seems to understand the spirit of the law when pursuing a goal X.
A concern I have is that future training procedures will incentivize more consequential reasoning (because those get higher reward). This might be obvious or foreseeable, but could be missed/ignored under racing pressure or when lab’s LLMs are implementing all the details of research.
Thanks!
I forgot about faithful CoT and definitely think that should be a “Step 0”. I’m also concerned here that AGI labs just don’t do the reasonable things (ie training for briefness making the CoT more steganographic).
For Mech-interp, ya, we’re currently bottlenecked by:
Finding a good enough unit-of-computation (which would enable most of the higher-guarantee research)
Computing Attention_in--> Attention_out (which Keith got the QK-circuit → Attention pattern working a while ago, but haven’t hooked up w/ the OV-circuit)
This is mostly a “reeling from o3”-post. If anyone is doom/anxiety-reading these posts, well, I’ve been doing that too! At least, we’re in this together:)
From an apparent author on reddit:
[Frontier Math is composed of] 25% T1 = IMO/undergrad style problems, 50% T2 = grad/qualifying exam style porblems, 25% T3 = early researcher problems
The comment was responding to a claim that Terence Tao said he could only solve a small percentage of questions, but Terence was only sent the T3 questions.
I also have a couple friends that require serious thinking (or being on my toes). I think it’s because they have some model of how something works, and I say something, showing my lack of this model.
Additionally, programming causes this as well (in response to compilation errors, getting nonsense outputs, or runs too long).
Was looking up Google Trend lines for chatGPT and noticed a cyclical pattern:
Where the dips are weekends, meaning it’s mostly used by people in the workweek. I mostly expect this is students using it for homework. This is substantiated by two other trends:
1. Dips in interest over winter and summer breaks (And Thanksgiving break in above chart)2. “Humanize AI” which is
Humanize AI™ is your go-to platform for seamlessly converting AI-generated text into authentic, undetectable, human-like content
[Although note that overall interest in ChatGPT is WAY higher than Humanize AI]
I was expecting this to include the output of MIRI for this year. Digging into your links we have:
Two Technical Governance Papers:
1. Mechanisms to verify international agreements about AI development
2. What AI evals for preventing catastrophic risks can and cannot doFour Media pieces of Eliezer regarding AI risk:
1. Semafor piece
2. 1 hr talk w/ Panel
3. PBS news hour
4. 4 hr video w/ Stephen WolframIs this the full output for the year, or are there less linkable outputs such as engaging w/ policymakers on AI risks?
Donated $100.
It was mostly due to LW2 that I decided to work on AI safety, actually, so thanks!
I’ve had the pleasure of interacting w/ the LW team quite a bit and they definitely embody the spirit of actually trying. Best of luck to y’all’s endeavors!
I tried a similar experiment w/ Claude 3.5 Sonnet, where I asked it to come up w/ a secret word and in branching paths:
1. Asked directly for the word
2. Played 20 questions, and then guessed the word
In order to see if it does have a consistent it can refer back to.Branch 1:
Branch 2:
Which I just thought was funny.
Asking again, telling it about the experiment and how it’s important for it to try to give consistent answers, it initially said “telescope” and then gave hints towards a paperclip.
Interesting to see when it flips it answers, though it’s a simple setup to repeatedly ask it’s answer every time.
Also could be confounded by temperature.
It’d be important to cache the karma of all users > 1000 atm, in order to credibly signal you know which generals were part of the nuking/nuked side. Would anyone be willing to do that in the next 2 & 1⁄2 hours? (ie the earliest we could be nuked)
We could instead pre-commit to not engage with any nuker’s future posts/comments (and at worse comment to encourage others to not engage) until end-of-year.
Or only include nit-picking comments.
Could you dig into why you think it’s great inter work?
But through gradient descent, shards act upon the neural networks by leaving imprints of themselves, and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space). So studying weights and activations is pretty doomed.
This paragraph sounded like you’re claiming LLMs do have concepts, but they’re not in specific activations or weights, but distributed across them instead.
But from your comment, you mean that LLMs themselves don’t learn the true simple-compressed features of reality, but a mere shadow of them.
This interpretation also matches the title better!
But are you saying the “true features” in the dataset + network? Because SAEs are trained on a dataset! (ignoring the problem pointed out in footnote 1).
Possibly clustering the data points by their network gradients would be a way to put some order into this mess?
Eric Michaud did cluster datapoints by their gradients here. From the abstract:
...Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta).
The one we checked last year was just Pythia-70M, which I don’t expect the LLM itself to have a gender feature that generalizes to both pronouns and anisogamy.
But again, the task is next-token prediction. Do you expect e.g. GPT 4 to have learned a gender concept that affects both knowledge about anisogamy and pronouns while trained on next-token prediction?
Sparse autoencoders finds features that correspond to abstract features of words and text. That’s not the same as finding features that correspond to reality.
(Base-model) LLMs are trained to minimize prediction error, and SAEs do seem to find features that sparsely predict error, such as a gender feature that, when removed, affects the probability of pronouns. So pragmatically, for the goal of “finding features that explain next-word-prediction”, which LLMs are directly trained for, SAEs find good examples![1]
I’m unsure what goal you have in mind for “features that correspond to reality”, or what that’d mean.
- ^
Not claiming that all SAE latents are good in this way though.
- ^
Is there code available for this?
I’m mainly interested in the loss fuction. Specifically from footnote 4:
We also need to add a term to capture the interaction effect between the key-features and the query-transcoder bias, but we omit this for simplicity
I’m unsure how this is implemented or the motivation.
Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation.
@Lucius Bushnaq , why would MLPs compute linear transformations?
Because two linear transformations can be combined into one linear transformation, why wouldn’t downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function?
Could you go into more details into what skills these advisers would have or what situations to navigate?
Because I’m baking in the “superhuman in coding/maths” due to the structure of those tasks, and other tasks can either be improved through:
1. general capabilies
2. Specific task
And there might be ways to differentially accelarate that capability.