Filip Sondej

Karma: 361

github.com/filyp

Filip Sondej Dec 21, 2024, 4:05 PM
4 points
0
in reply to: Artyom Karpov’s comment on: Simple Steganographic Computation Eval—gpt-4o and gemini-exp-1206 can’t solve it yet
I wonder if models can actually do this task when we allow them to use CoT for that.
Yes, claude-3.5-sonnet was able to use the figure this out with additional CoT.
Also, I think models might actually solve this task by using their own encoding scheme if they know it well
Yeah, could be that the 3 schemes I tested were just unnatural to them. Although I would guess they don’t have some default scheme of their own, because in pre-training they aren’t able to output information freely, and in fine-tuning I guess they don’t have that much pressure to learn it.
So perhaps if we ask a model first to think about an encoding scheme and then ask them to use it for the task, they might succeed
Yeah, that would be interesting. I asked claude. He generated 3 schemes which were pretty bad, and 2 which look promising:
```
State 0: End sentence with period (.)
State 1: End sentence with exclamation (!)
State 2: End sentence with question mark (?)
```
```
State 0: Present tense
State 1: Past tense
State 2: Future tense
```
The punctuation one seems especially promising. But they don’t feel much different from the 3 I tested. I don’t expect that they are somehow “special” for the model.

I rerun the experiment with punctuation scheme. Looks overall the same.

Filip Sondej Dec 19, 2024, 3:48 PM
1 point
0
in reply to: Jozdien’s comment on: Testing which LLM architectures can do hidden serial reasoning
FYI, here a follow-up to this post, inspired by what you wrote in your comment. https://www.lesswrong.com/posts/YrnffADbWZcDJtmWk/simple-steganographic-computation-eval-gpt-4o-and-gemini-exp-1

Simple Steganographic Computation Eval—gpt-4o and gemini-exp-1206 can’t solve it yet

Filip SondejDec 19, 2024, 3:47 PM

14 points

2 comments3 min readLW link

Filip Sondej Dec 17, 2024, 7:21 PM
3 points
0
in reply to: Adam Shai’s comment on: Testing which LLM architectures can do hidden serial reasoning
Yup, here is such a plot, made after training “switcher” architecture for 350k examples. I remember it was similar for the longer training—a few longest task lengths struggle, but the rest is near 100%.

Filip Sondej Dec 17, 2024, 3:06 PM
7 points
6
in reply to: Jozdien’s comment on: Testing which LLM architectures can do hidden serial reasoning
Also, now that I think of it, it’s different for the model to:
1. represent hidden subtle information which is natural for the task, like in that example you gave with “written by”
2. using that hidden subtext for computation which is unnatural in the given context—to covertly do some other task than just the one it is asked for
While 1. definitely happens, I see no reason why 2. would be learned in pre-training. In principle 2. could be learned in RL fine-tuning, if there’s some incentive to use hidden reasoning. Although I imagine it must be hard.

Filip Sondej Dec 16, 2024, 9:06 PM
9 points
6
in reply to: Jozdien’s comment on: Testing which LLM architectures can do hidden serial reasoning
That’s a really good point.

To clarify what I did in these toy steganography experiments: First I tried to pass some more complex information and gave models more freedom in what to output. But as I kept failing at that, I kept simplifying the setup, until I tried to have a model pass just one bit to another model, assuming they both see one secret bit and an overseer tries to guess the message but doesn’t see the secret bit. So a perfect code exists: XORing the message with the secret bit, but the models kept failing to find it. I expect that more complex settings will have a similar problem—that it’s hard to create a new encoding for some information because it’s rewarded only once it is both successfully sent and successfully received.

But the settings you describe may be different by the fact that they are natural, and there could be some unintentionally outputted subtext that accidentally correlates with the secret information. So the model would already by default “successfully send” the encoded information—now it just needs to learn to decode it.

It’s not clear to me to what extent this happens. I would definitely want to see some concrete demo of this happening, but right now I don’t have any good ideas how to elicit it.

Also, I wonder how far could we get by using adversaries who try to decode the information encoded in the model’s outputs.

Filip Sondej Dec 16, 2024, 7:48 PM
2 points
0
in reply to: Adam Shai’s comment on: Testing which LLM architectures can do hidden serial reasoning
Ah. That’s the number of solved tasks, after it is shown a set {length 1 task, length 2 task, … , length (n+1) task}, and n is the longest task length is has solved so far. So you can think of it roughly as the maximum task length it is able to solve. So it doesn’t have an upper bound.

I clarified this in the post now. Thanks for catching it.

Filip Sondej Dec 16, 2024, 2:17 PM
3 points
0
in reply to: Daniel Kokotajlo’s comment on: Current AIs Provide Nearly No Data Relevant to AGI Alignment
FYI, I did the experiments I wrote about in my other comment and just posted them. (I procrastinated writing up the results for too long.) https://www.lesswrong.com/posts/ZB6guMhHH3NEyxA2k/testing-which-llm-architectures-can-do-hidden-serial-3

Testing which LLM architectures can do hidden serial reasoning

Filip SondejDec 16, 2024, 1:48 PM

81 points

9 comments4 min readLW link

Filip Sondej Sep 4, 2024, 1:08 PM
5 points
0
in reply to: Martín Soto’s comment on: Decision Theory in Space
I liked it precisely because it threw theory out the window and showed that cheap talk is not a real commitment.
- Tarkin > I believe in CDT and I precommit to bla bla bla
- Leia > I belive in FDT and I totally precommit to bla bla bla
- Vader > Death Star goes brrrrr...

Filip Sondej Sep 4, 2024, 12:42 PM
4 points
0
in reply to: Ape in the coat’s comment on: Decision Theory in Space
For me the main thing in this story was that cheap talk =/= real commitment. You can talk all you want about how “totally precommitted” you are, but this lacks some concreteness.

Also, I saw Vader as much less galaxy brained as you portray him. Destroying Alderaan at the end looked to me more like mad ruthlessness than calculated strategy. (And if Leia had known Vader’s actual policy, she would have no incentive to confess.) Maybe one thing that Vader did achieve, is signal for the future that he really does not care and will be ruthless (but also signaled that it doesn’t matter if you give in to him, which is dumb).

Anyway, I liked the story, but for the action, not for some deep theoretic insight.

Filip Sondej Jul 16, 2024, 5:55 PM
1 point
0
in reply to: Neel Nanda’s comment on: The Parable of the King and the Random Process
Not sure if that’s what happened in that example, but you can bet that a price will rise above some threshold, or fall below some threshold, using options. You can even do both at the same time, essentially betting that the price won’t stay as it is now.

But whether you will make money that way depends on the price of options.

Filip Sondej May 7, 2024, 12:02 PM
LW: 1 AF: 1
0
AF
on: An Interpretability Illusion for Activation Patching of Arbitrary Subspaces
What if we constrain v to be in some subspace that is actually used by the MLP? (We can get it from PCA over activations on many inputs.)

This way v won’t have any dormant component, so the MLP output after patching also cannot use that dormant pathway.

Filip Sondej Apr 7, 2024, 9:11 AM
7 points
0
on: Conflict in Posthuman Literature
I wanna link to my favorite one: consciousness vs replicators. It doesn’t really fit into this grid, but I think it really is the ultimate conflict.

(You can definitely skip the first 14 min of this video, as it’s just ranking people’s stages of development. Maybe even first 33 min if you wanna go straight to the point.)

Filip Sondej Mar 18, 2024, 7:35 PM
13 points
2
on: Community Notes by X
I wonder what would happen if we run the simple version of that algorithm on LW comments. So that votes would have “polarity”, and so each comment would have two vote-counts, let’s say orange count and blue count. (Of course that would be only optionally enabled.)

Then we could sort the comments by the minimum of these counts, descending.

(I think it makes more sense to train it per post than globally. But then it would be useful only on very popular posts with lots of comments.)
What links here?
- Filip Sondej's comment on EA Forum feature suggestion thread by Aaron Gertler 🔸 (EA Forum; Apr 8, 2024, 11:30 AM; 11 points)

Filip Sondej Mar 8, 2024, 12:03 PM
6 points
1
on: Masterpiece
Thanks, that’s terrifying.

I hope we invent mindmelding before we invent all this. Maybe if people can feel those states themselves, they won’t let the worst of them happen.

Filip Sondej Jan 18, 2024, 12:25 PM
5 points
0
in reply to: Daniel Kokotajlo’s comment on: Current AIs Provide Nearly No Data Relevant to AGI Alignment
Unfortunately I didn’t have any particular tasks in mind when I wrote it. I was vaguely thinking about settings as in:
- https://arxiv.org/pdf/2305.04388.pdf
- https://arxiv.org/pdf/2307.13702.pdf
Now that I though about it, for this particular transformers vs mamba experiment, I’d go with something even simpler. I want a task that is very easy sequentially, but hard to answer immediately. So for example a task like:
```
x = 5
x += 2
x *= 3
x **= 2
x -= 3
...
```
and then have a CoT:
```
after x = 5
5
after x += 2
7
...
```
And then we intervene on CoT to introduce some error in one operation, but still ask of the model to give the correct answer at the end. (Despite all steps after the error being irrelevant.) We can go even further and train the models to give the correct answer after inadequate CoT. And have a curriculum where at first it only needs to do one hidden operation, later two, and so on.

(It’s an unrealistic setting, but the point is rather to check if the model is able at all to learn hidden sequential reasoning.)

Now, my hypothesis is that transformers will have some limited sequence length for which they can do it (probably smaller than their number of layers), but mamba won’t have a limit.

I was working on this for six months

Can you say what you tried in these six months and how did it go?

Filip Sondej Jan 17, 2024, 7:51 PM
7 points
0
in reply to: Daniel Kokotajlo’s comment on: Current AIs Provide Nearly No Data Relevant to AGI Alignment
Yeah, true. But it’s also easier to do early, when no one is that invested in the hidden-recurrence architectures, and so there’s less resistance, it doesn’t break anyone’s plans.

Maybe a strong experiment would be to compare mamba-3b and some SOTA 3b transformer, trained similarly, on several tasks where we can evaluate CoT faithfulness. (Although maybe at 3b capability level we won’t see clear differences yet.) The hard part would be finding the right tasks.

Filip Sondej Jan 17, 2024, 3:31 PM
7 points
0
in reply to: Daniel Kokotajlo’s comment on: Current AIs Provide Nearly No Data Relevant to AGI Alignment

the natural language bottleneck is itself a temporary stage in the evolution of AI capabilities. It is unlikely to be an optimal mind design; already many people are working on architectures that don’t have a natural language bottleneck

This one looks fatal. (I think the rest of the reasons could be dealt with somehow.)

What existing alternative architectures do you have in mind? I guess mamba would be one?

Do you think it’s realistic to regulate this? F.e. requiring that above certain size, models can’t have recurrence that uses a hidden state, but recurrence that uses natural language (or images) is fine. (Or maybe some softer version of this, if alignment tax proves too high.)

Filip Sondej Jan 14, 2024, 11:46 AM
1 point
on: A Better Web is Coming
I like initiatives like these. But they have a major problem, that at the beginning no users will use it because there’s no content, and no content is created because there are no users.

To have a real shot at adoption, you need to either initially populate the new system with content from existing system (here LLMs could help solve compatibility issues), or have some bridge that mirrors (some) activity between these systems.

(There are examples of systems that kicked off from zero, but you need to be lucky or put huge effort in sparking adoption.)

Filip Sondej

Sim­ple Stegano­graphic Com­pu­ta­tion Eval—gpt-4o and gem­ini-exp-1206 can’t solve it yet

Test­ing which LLM ar­chi­tec­tures can do hid­den se­rial reasoning

Simple Steganographic Computation Eval—gpt-4o and gemini-exp-1206 can’t solve it yet

Testing which LLM architectures can do hidden serial reasoning