Filip Sondej
For me the main thing in this story was that cheap talk =/= real commitment. You can talk all you want about how “totally precommitted” you are, but this lacks some concreteness.
Also, I saw Vader as much less galaxy brained as you portray him. Destroying Alderaan at the end looked to me more like mad ruthlessness than calculated strategy. (And if Leia had known Vader’s actual policy, she would have no incentive to confess.) Maybe one thing that Vader did achieve, is signal for the future that he really does not care and will be ruthless (but also signaled that it doesn’t matter if you give in to him, which is dumb).
Anyway, I liked the story, but for the action, not for some deep theoretic insight.
Not sure if that’s what happened in that example, but you can bet that a price will rise above some threshold, or fall below some threshold, using options. You can even do both at the same time, essentially betting that the price won’t stay as it is now.
But whether you will make money that way depends on the price of options.
What if we constrain v to be in some subspace that is actually used by the MLP? (We can get it from PCA over activations on many inputs.)
This way v won’t have any dormant component, so the MLP output after patching also cannot use that dormant pathway.
I wanna link to my favorite one: consciousness vs replicators. It doesn’t really fit into this grid, but I think it really is the ultimate conflict.
(You can definitely skip the first 14 min of this video, as it’s just ranking people’s stages of development. Maybe even first 33 min if you wanna go straight to the point.)
I wonder what would happen if we run the simple version of that algorithm on LW comments. So that votes would have “polarity”, and so each comment would have two vote-counts, let’s say orange count and blue count. (Of course that would be only optionally enabled.)
Then we could sort the comments by the minimum of these counts, descending.
(I think it makes more sense to train it per post than globally. But then it would be useful only on very popular posts with lots of comments.)
- 8 Apr 2024 11:30 UTC; 11 points) 's comment on EA Forum feature suggestion thread by (EA Forum;
Thanks, that’s terrifying.
I hope we invent mindmelding before we invent all this. Maybe if people can feel those states themselves, they won’t let the worst of them happen.
Unfortunately I didn’t have any particular tasks in mind when I wrote it. I was vaguely thinking about settings as in:
Now that I though about it, for this particular transformers vs mamba experiment, I’d go with something even simpler. I want a task that is very easy sequentially, but hard to answer immediately. So for example a task like:
x = 5 x += 2 x *= 3 x **= 2 x -= 3 ...
and then have a CoT:
after x = 5 5 after x += 2 7 ...
And then we intervene on CoT to introduce some error in one operation, but still ask of the model to give the correct answer at the end. (Despite all steps after the error being irrelevant.) We can go even further and train the models to give the correct answer after inadequate CoT. And have a curriculum where at first it only needs to do one hidden operation, later two, and so on.
(It’s an unrealistic setting, but the point is rather to check if the model is able at all to learn hidden sequential reasoning.)
Now, my hypothesis is that transformers will have some limited sequence length for which they can do it (probably smaller than their number of layers), but mamba won’t have a limit.
I was working on this for six months
Can you say what you tried in these six months and how did it go?
Yeah, true. But it’s also easier to do early, when no one is that invested in the hidden-recurrence architectures, and so there’s less resistance, it doesn’t break anyone’s plans.
Maybe a strong experiment would be to compare mamba-3b and some SOTA 3b transformer, trained similarly, on several tasks where we can evaluate CoT faithfulness. (Although maybe at 3b capability level we won’t see clear differences yet.) The hard part would be finding the right tasks.
the natural language bottleneck is itself a temporary stage in the evolution of AI capabilities. It is unlikely to be an optimal mind design; already many people are working on architectures that don’t have a natural language bottleneck
This one looks fatal. (I think the rest of the reasons could be dealt with somehow.)
What existing alternative architectures do you have in mind? I guess mamba would be one?
Do you think it’s realistic to regulate this? F.e. requiring that above certain size, models can’t have recurrence that uses a hidden state, but recurrence that uses natural language (or images) is fine. (Or maybe some softer version of this, if alignment tax proves too high.)
I like initiatives like these. But they have a major problem, that at the beginning no users will use it because there’s no content, and no content is created because there are no users.
To have a real shot at adoption, you need to either initially populate the new system with content from existing system (here LLMs could help solve compatibility issues), or have some bridge that mirrors (some) activity between these systems.
(There are examples of systems that kicked off from zero, but you need to be lucky or put huge effort in sparking adoption.)
Yeah, those star trajectories definitely wouldn’t be stable enough.
I guess even with that simpler maneuver (powered flyby near a black hole), you still need to monitor all the stuff orbiting there and plan ahead, otherwise there’s a fair chance you’ll crash into something.
I wanted to give it a shot and made GPT4 to deceive the user: link.
When you delete that system prompt it stops deceiving.
But GPT had to be explicitly instructed to disobey the Party. I wonder if it could be done more subtly.
You’re right, that you wouldn’t want to approach the black hole itself but rather one of the orbiting stars.
when you are approaching with much higher than escape velocity, so that an extended dance with more than one close approach is not possible
But even with high velocity, if there are a lot of orbiting stars, you may tune your trajectory to have multiple close encounters.
The problem with not expanding is that you can be pretty sure someone else will then grab what you didn’t and may use it for something that you hate. (Unless you trust that they’ll use it well.)
eating the entire Universe to get the maximal number of mind-seconds is expanding just to expand
It’s not “just to expand”. Expansion, at least in the story, is instrumental to whatever the content of these mind-seconds is.
slingshot never slows you down in the frame of the object you are slingshotting around
That’s true for one object. But if there are at least two, moving around fast enough, you could perform some gravitational dance with them to slow down.
GPTs’ ability to keep a secret is weirdly prompt-dependent
-
I agree that scaffolding can take us a long way towards AGI, but I’d be very surprised if GPT4 as core model was enough.
-
Yup, that wasn’t a critique, I just wanted to note something. By “seed of deception” I mean that the model may learn to use this ambiguity more and more, if that’s useful for passing some evals, while helping it do some computation unwanted by humans.
-
I see, so maybe in ways which are weird to humans to think about.
-
we make the very strong assumption throughout that S-LLMs are a plausible and likely path to AGI
It sounds unlikely and unnecessarily strong to say that we can reach AGI by scaffolding alone (if that’s what you mean). But I think it’s pretty likely that AGI will involve some amount of scaffolding, and that it will boost its capabilities significantly.
there is a preexisting discrepancy between how humans would interpret phrases and how the base model will interpret them
To the extent that it’s true, I expect that it may also make deception easier to arise. This discrepancy may serve as a seed of deception.
Systems engaging in self modification will make the interpretation of their natural language data more challenging.
Why? Sure, they will get more complex, but are there any other reasons?
Also, I like the richness of your references in this post :)
I edited the post to make it clearer that Bob throws out the wheel because he didn’t notice in time that Alice threw.
Yup, side payments are a deviation, that’s why I have this disclaimer in game definition (I edited the post now to emphasize it more):
there also may be some additional actions available, but they are not obvious
Re separating speed of information and negotiations: I think here they are already pretty separate. The first example with 3 protocol rules doesn’t allow negotiations and only tackles the information speed problem. The second example with additional fourth rule enables negotiations. Maybe you could also have a system tackling only negotiations and not the information speed problem, but I’m not sure now how would it look like, or if it would be much simpler.
Another problem (closely tied to negotiations) I wanted to tackle is something like “speed of deliberation” where agents make some bad commitments because they didn’t have enough time to consider their consequences, and later realize they want to revoke/negotiate.
I liked it precisely because it threw theory out the window and showed that cheap talk is not a real commitment.
Tarkin > I believe in CDT and I precommit to bla bla bla
Leia > I belive in FDT and I totally precommit to bla bla bla
Vader > Death Star goes brrrrr...