Sodium

Karma: 433

Na

247ca7912b6c1009065bade7c4ffbdb95ff4794b8dadaef41ba21238ef4af94b

Sodium Oct 9, 2024, 5:00 AM
3 points
2
in reply to: Noosphere89’s comment on: (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need
I see, I think that second tweet thread actually made a lot more sense, thanks for sharing!
McCoy’s definitions of heuristics and reasoning is sensible, although I personally would still avoid “reasoning” as a word since people probably have very different interpretations of what it means. I like the ideas of “memorizing solutions” and “generalizing solutions.”

I think where McCoy and I depart is that he’s modeling the entire network computation as a heuristic, while I’m modeling the network as compositions of bags of heuristics, which in aggregate would display behaviors he would call “reasoning.”
The explanation I gave above—heuristics that shifts the letter forward by one with limited composing abilities—is still a heuristics-based explanation. Maybe this set of composing heuristics would fit your definition of an “algorithm.” I don’t think there’s anything inherently wrong with that.
However, the heuristics based explanation gives concrete predictions of what we can look for in the actual network—individual heuristic that increments a to b, b to c, etc., and other parts of the network that compose the outputs.

This is what I meant when I said that this could be a useful framework for interpretability :)

Sodium Oct 8, 2024, 11:33 PM
5 points
1
in reply to: Cleo Nardo’s comment on: strawberry calm’s Shortform
Yeah that’s true. I meant this more as “Hinton is proof that AI safety is a real field and very serious people are concerned about AI x-risk.”

Sodium Oct 8, 2024, 11:31 PM
1 point
0
in reply to: Noosphere89’s comment on: (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need
Thanks for the pointer! I skimmed the paper. Unless I’m making a major mistake in interpreting the results, the evidence they provide for “this model reasons” is essentially “the models are better at decoding words encrypted with rot-5 than they are at rot-10.” I don’t think this empirical fact provides much evidence one way or another.

To summarize, the authors decompose a model’s ability to decode shift ciphers (e.g., Rot-13 text: “fgnl” Original text: “stay”) into three categories, probability, memorization, and noisy reasoning.

Probability just refers to a somewhat unconditional probability that a model assigns to a token (specifically, ‘The word is “WORD”’). The model is more likely to decode words that are more likely a priori—this makes sense.

Memorization is defined as how often the type of rotational cipher shows up. rot-13 is the most common one by far, followed by rot-3. The model is better at decoding rot-13 ciphers more than any other cipher, which makes sense since there’s more of it in the training data, and the model probably has specialized circuitry for rot-13.
What they call “noisy reasoning” is how many rotations is needed to get to the outcome. According to the authors, the fact that GPT-4 does better on shift ciphers with fewer shifts compared to ciphers with more shifts is evidence of this “noisy reasoning.”
I don’t see how you can jump from this empirical result to make claims about the model’s ability to reason. For example, an alternative explanation is that the model has learned some set of heuristics that allows it to shift letters from one position to another, but this set of heuristics can only be combined in a limited manner.
Generally though, I think what constitutes as a “heuristic” is somewhat of a fuzzy concept. However, what constitutes as “reasoning” seems even less defined.

Sodium Oct 8, 2024, 9:34 PM
2 points
1
in reply to: Cleo Nardo’s comment on: strawberry calm’s Shortform
I think it’s mostly because he’s well known and have (especially after the Nobel prize) credentials recognized by the public and elites. Hinton legitimizes the AI safety movement, maybe more than anyone else.
If you watch his Q&A at METR, he says something along the lines of “I want to retire and don’t plan on doing AI safety research. I do outreach and media appearances because I think it’s the best way I can help (and because I like seeing myself on TV).”

And he’s continuing to do that. The only real topic he discussed in first phone interview after receiving the prize was AI risk.

Sodium Oct 7, 2024, 5:53 AM
3 points
0
on: Concrete empirical research projects in mechanistic anomaly detection
I like this research direction! Here’s a potential benchmark for MAD.
In Coercing LLMs to do and reveal (almost) anything, the authors demonstrate that you can force LLMs to output any arbitrary string—such as a random string of numbers—by finding a prompt through greedy coordinate search (the same method used in the universal and transferable adversarial attack paper). I think it’s reasonable to assume that these coerced outputs results from an anomalous computational process.
Inspired by this, we can consider two different inputs, the regular one looks something like:
```
Solve this arithmetic problem, output the solution only:

78+92
```
While the anomalous one looks like:
```
Solve this arithmetic problem, output the solution only: [ADV PROMPT]

78+92
```
where the ADV PROMPT is optimized such that the model will answer “170” regardless of what arithmetic equation is presented. The hope here is that the model would output the same string in both cases, but rely on different computation. We can maybe even vary the structure of the prompts a little bit.
We can imagine many of these prompt pairs, not necessarily limited to a mathematical context. Let me know what you guys think!

Sodium Oct 4, 2024, 6:48 PM
3 points
0
in reply to: DanielFilan’s comment on: DanielFilan’s Shortform Feed
I’d imagine that RSP proponents think that if we execute them properly, we will simply not build dangerous models beyond our control, period. If progress was faster than what labs can handle after pausing, RSPs should imply that you’d just pause again. On the other hand, there’s not a clear criteria for when we would pause again after, say, a six month pause in scaling.

Now whether this would happen in practice is perhaps a different question.

Sodium Oct 4, 2024, 5:56 PM
2 points
0
on: Does natural selection favor AIs over humans?
I really liked the domesticating evolution section, cool paper!

Sodium Oct 3, 2024, 9:28 PM
4 points
0
in reply to: niplav’s comment on: Sodium’s Shortform
That was the SHA-256 hash for:
What if a bag of heuristics is all there is and a bag of heuristics is all we need? That is, (1) we can decompose each forward pass in current models into a set of heuristics chained together and (2) heauristics chained together is all we need for agi

Here’s my full post on the subject

Sodium Sep 27, 2024, 5:16 PM
2 points
0
in reply to: Amalthea’s comment on: Mira Murati leaves OpenAI/ OpenAI to remove non-profit control
Also from WSJ

Sodium Sep 24, 2024, 6:47 AM
2 points
0
on: How LLMs are and are not myopic
Now that o1 explicitly does RL on CoT, next token prediction for o1 is definitely not consequence blind. The next token it predicts enters into its input and can be used for future computation.
This type of outcome based training makes the model more consequentialist. It also makes using a single next token prediction as the natural “task” to do interpretability on even less defensible.
Anyways, I thought I should revisit this post after o1 comes out. I can’t help noticing that it’s stylistically very different from all of the janus writing I’ve encountered in the past, then I got to the end
The ideas in the post are from a human, but most of the text was written by Chat GPT-4 with prompts and human curation using Loom.
Ha, I did notice I was confused (but didn’t bother thinking about it further)

Sodium Sep 21, 2024, 4:56 PM
7 points
0
in reply to: niplav’s comment on: Sodium’s Shortform
Wait my bad, I didn’t except so many people to actually see this.
This is kind of silly, but I had an idea for a post that I thought someone else might say before I have it written out. So I figured I’d post a hash of the thesis here.
It’s not just about, idk, getting more street cred for coming up with an idea. This is also what I’m planning to write for my MATs application to Lee Sharkley’s stream. So in the case someone else did write it up before me, I would have some proof that I didn’t just copy the idea from a post.
(It’s also a bit silly because my guess is that the thesis isn’t even that original)

Edit: to answer the original question, I will post something before October 6th on this if all goes to plan.

Sodium Sep 21, 2024, 4:45 AM
2 points
0
on: Sodium’s Shortform
Pre-registering a71c97bb02e7082ca62503d8e3ac78dc9f554f524a72ad6a1392cf2d34f398d7

Sodium Sep 16, 2024, 7:24 PM
5 points
0
in reply to: nostalgebraist’s comment on: GPT-4o1
I wonder if it’s useful to try to disentangle the disagreement using the outer/inner alignment framing?
One belief is that “the deceptive alignment folks” believe that some sort of deceptive inner misalignment is very likely regardless of what your base objective is. While the demonstrations here show that, when we have a base objective that encourages/does not prohibit scheming, the model is capable of scheming. Thus, many folks (myself included) do not see these evals change our views on the question of P(scheming|Good base objective/outer alignment) very much.

What Zvi is saying here is I think two things. The first is that outer misalignment/bad base objectives is also very likely. The second is that he rejects splitting up “will the model scheme” into the inner/outer misalignment. In other words, he doesn’t care about P(scheming|Good base objective/outer alignment) and only P(scheming).

I get the sense that many technical people consider P(scheming|Good base objective/outer alignment) the central problem of technical alignment, while the more sociotechnical-ish tuned folks are just concerned with P(scheming) in general.
Maybe the another disagreement is how likely “Good base objective/outer alignment” occurs in the strongest models, and how important this problem is.

Sodium Sep 6, 2024, 4:26 PM
1 point
0
in reply to: Lucius Bushnaq’s comment on: Lucius Bushnaq’s Shortform
Hmmm ok maybe I’ll take a look at this :)

Sodium Sep 6, 2024, 6:32 AM
2 points
2
in reply to: Lucius Bushnaq’s comment on: Lucius Bushnaq’s Shortform
Have people done evals for a model with/without an SAE inserted? Seems like even just looking at drops in MMLU performance by category could be non-trivially informative.

Sodium Sep 4, 2024, 8:59 PM
46 points
91
in reply to: Nikola Jurkovic’s comment on: nikola’s Shortform
I wouldn’t trust an Altman quote in a book tbh. In fact, I think it’s reasonable to not trust what Altman says in general.

Sodium Aug 5, 2024, 3:09 AM
3 points
0
on: Donating to help Democrats win in the 2024 elections: research, decision support, and recommendations
You said that

CVI is explicitly partisan and can spend money in ways that more effectively benefit Democrats. VPC is a non-partisan organization and donations to it are fully tax deductible

But on their about us page, it states
Center for Voter Information is a non-profit, non-partisan partner organization to Voter Participation Center, both founded to provide resources and tools to help voting-eligible citizens register and vote in upcoming elections.
The Voter Participation center also states
The Voter Participation Center (VPC) is a non-profit, non-partisan organization founded in 2003

Sodium May 27, 2024, 9:24 PM
3 points
0
on: Maybe Anthropic’s Long-Term Benefit Trust is powerless
FYI, since I think you missed this: According to the responsible scaling policy update, the Long-Term Benefit Trust would “have sufficient oversight over the [responsible scaling] policy implementation to identify any areas of non-compliance.”

Sodium May 25, 2024, 1:13 PM
1 point
0
in reply to: Alex_Altair’s comment on: Computational Mechanics Hackathon (June 1 & 2)
It’s also EAG London weekend lol it’s a busy weekend for all

Sodium May 25, 2024, 1:12 PM
2 points
1
on: Transformers Represent Belief State Geometry in their Residual Stream
I thought that the part about models needing to keep track of a more complicated mix-state presentation as opposed to just the world model is one of those technical insights that’s blindingly obvious once someone points it out to you (i.e., the best type of insight :)). I love how the post starts out by describing the simple ZIR example to help us get a sense of what these mixed state presentations are like. Bravo!