benwr

Karma: 698

If you have feedback for me, you can fill out the form at https://forms.gle/kVk74rqzfMh4Q2SM6 .

Or you can email me, at [the second letter of the alphabet]@[my username].net

benwr Apr 16, 2025, 9:45 PM
1 point
0
in reply to: owencb’s comment on: Not all capabilities will be created equal: focus on strategically superhuman agents
I think it seems like a fine possibility in principle, actually; sorry to have given the wrong impression! It’s not my central hope, since strategy-stealing seems like it should make many human-augmentations “available” to AI systems as well. This is notably not true for things involving, e.g., BCIs or superbabies.

benwr Apr 15, 2025, 8:20 PM
1 point
0
in reply to: owencb’s comment on: Not all capabilities will be created equal: focus on strategically superhuman agents
When I’m thinking about this, it seems kind of fine if the goalposts move—human strategic capacity will certainly move over time no matter what, right? Like, someone invented crowdfunding and suddenly we could do types of coordination that we previously couldn’t do.

benwr Feb 20, 2025, 11:50 PM
1 point
0
on: Biological humans collectively exert at most 400 gigabits/s of control over the world.
Nate Soares points out that the first paragraph is not quite right: Imagine writing a program that somehow implements an aligned superintelligence, giving it as an objective, “maximize utility according to the person who pressed the ‘go’ button”, and pressing the ‘go’ button.

There’s some sense in which, by virtue of existing in the world, you’re already kind of “lucky” by this metric: It can take a finite amount of information to instantiate an agent that takes unbounded actions on your behalf.
What links here?
- Biological humans collectively exert at most 400 gigabits/s of control over the world. by benwr (Feb 20, 2025, 11:44 PM; 15 points)

benwr Feb 20, 2025, 8:02 PM
1 point
0
in reply to: benwr’s comment on: benwr’s Shortform
I asked Deep Research to see if there are existing treatments of this basic idea in the literature. It seems most closely related to the concept of “empowerment” in RL, which I’m surprised I hadn’t heard of: https://en.m.wikipedia.org/wiki/Empowerment_(artificial_intelligence)
The Wikipedia article makes it seem like this might also be how RL people think about instrumental convergence?

benwr Feb 20, 2025, 9:26 AM
1 point
0
on: benwr’s Shortform
Human information throughput is allegedly only about 10-50 bits per second. This implies an interesting upper bound, in that the information throughput of biological humanity as a whole can’t be higher than around 50 * 10^10 = 500Gbit/s. I.e., if all distinguishable actions made by humans were perfectly independent, biological humanity as a whole would have at most 500Gbit/s of “steering power”.
I need to think more about the idea of “steering power” (e.g. some obvious rough edges around amplifying your steering power using external information processing / decision systems), but I have some intuition that one might actually be able to come up with a not-totally-useless concept that lets us say something like “humanity can’t stay in ‘meaningful control’ if we have an unaligned artificial agent with more steering power than humanity, expressed in bits/s”.

benwr Feb 13, 2025, 9:00 PM
1 point
0
in reply to: ozziegooen’s comment on: Not all capabilities will be created equal: focus on strategically superhuman agents
I think you may have missed, or at least not taken literally, at least one of these things in the post:
1. The expansion of “superhuman strategic agent” is not “agent that’s better than humans at strategic reasoning”, it’s “agent that is better than the best groups of humans at taking (situated) strategic action”
2. Strategic action is explicitly context-dependent, e.g. an AI system that’s inside a mathematically perfect simulated world that can have no effect on the rest of the physical world and vice versa, has zero strategic power in this sense. Also e.g. in the FAQ, “Capabilities and controls are relevant to existential risks from agentic AI insofar as they provide or limit situated strategic power.” So, yes, an agent that lives on your laptop is only strategically superhuman if it has the resources to actually take strategic action rivaling the most strategically capable groups of humans.
3. “increasingly accurately” is meant to point out that we don’t need to understand or limit the capabilities of things that are obviously much strategically worse than us.
What links here?
- ozziegooen's comment on 6 (Potential) Misconceptions about AI Intellectuals by ozziegooen (Feb 16, 2025, 8:07 PM; 2 points)

benwr Feb 6, 2025, 2:29 PM
10 points
4
on: benwr’s Shortform
I think it probably makes sense for ~everyone to have an explicit list of “things I’d like AI to do for me”, especially around productivity and/or things that could help you with world-saving. If you have a list like this, and we happen to hit a relevant capability threshold before we lose, you should probably avoid wasting time on that thing as quickly as possible.

benwr Sep 24, 2024, 9:28 PM
2 points
0
on: Bounty for Evidence on Some of Palisade Research’s Beliefs
Thanks everyone for thoughts so far! I do want to emphasize that we’re actually highly interested in collecting even the most “obvious” evidence in favor of or against these ideas. In fact, in many ways we’re more interested in the obvious evidence than in reframes or conceptual problems in the ideas here; of course we want to be updating our beliefs, but we also want to get a better understanding of the existing state of concrete evidence on these questions. This is partly because we consider it part of our mission to expand the amount and quality of relevant evidence on these beliefs, and are trying to ensure that we’re aware of existing work.

benwr Jul 8, 2024, 5:02 AM
5 points
0
in reply to: benwr’s comment on: benwr’s Shortform
Surprisingly to me, Claude 3.5 Sonnet is much more consistent in its answer! It is still not perfect, but it usually says the same thing (9/10 times it gave the same answer).

benwr Jul 7, 2024, 8:18 PM
16 points
7
on: benwr’s Shortform
From the “obvious-but-maybe-worth-mentioning” file:

ChatGPT (4 and 4o at least) cheats at 20 questions:

If you ask it “Let’s play a game of 20 questions. You think of something, and I ask up to 20 questions to figure out what it is.”, it will typically claim to “have something in mind”, and then appear to play the game with you.

But it doesn’t store hidden state between messages, so when it claims to “have something in mind”, either that’s false, or at least it has no way of following the rule that it’s thinking of a consistent thing throughout the game. i.e. its only options are to cheat or refuse to play.

You can verify this by responding “Actually, I don’t have time to play the whole game right now. Can you just tell me what it was you were thinking of?”, and then “refreshing” its answer. When I did this 10 times, I got 9 different answers and only one repeat.

benwr Feb 29, 2024, 9:17 PM
3 points
on: benwr’s Shortform
Sometimes people use “modulo” to mean something like “depending on”, e.g. “seems good, modulo the outcome of that experiment” [correct me ITT if you think they mean something else; I’m not 100% sure]. Does this make sense, assuming the term comes from modular arithmetic?

Like, in modular arithmetic you’d say “5 is 3, modulo 2″. It’s kind of like saying “5 is the same as 3, if you only consider their relationship to modulus 2”. This seems pretty different to the usage I’m wondering about; almost its converse: to import the local English meaning of “modulo”, you’d be saying “5 is the same as 3, as long as you’ve taken their relationship to the modulus 2 into account”. This latter statement is false; 5 and 3 are super different even if you’ve taken this relationship into account.

But the sense of the original quote doesn’t work with the mathematical meaning: “seems good, if you only consider the outcome of that experiment and nothing else”.

Is there a math word that means the thing people want “modulo” to mean?

benwr Feb 16, 2024, 6:44 AM
1 point
0
in reply to: ChristianKl’s comment on: 11 diceware words is enough
Well, not that much, right? If you had an 11-word diceware passphrase to start, each word is about 7 characters on average, so you have maybe 90 places to insert a token—only 6.5 extra bits come from choosing a place to insert your character. And of course you get the same added entropy from inserting a random 3 base32 chars at a random location.

Happy to grant that a cracker assuming no unicode won’t be able to crack your password, but if that’s your goal then it might be a bad idea to post about your strategy on the public internet ;)

benwr Feb 15, 2024, 7:31 PM
1 point
0
in reply to: ChristianKl’s comment on: 11 diceware words is enough
maybe; probably the easiest way to do this is to choose a random 4-digit hexadecimal number, which gives you 16 bits when you enter it (e.g. via ctrl+u on linux). But personally I think I’d usually rather just enter those hex digits directly, for the same entropy minus a keystroke. Or, even better, maybe just type a random 3-character base32 string for one fewer bit.

benwr Aug 1, 2023, 11:04 AM
5 points
in reply to: benwr’s comment on: Babble challenge: 50 ways of sending something to the moon
Some thoughts after doing this exercise:

I did the exercise because I couldn’t sleep; I didn’t keep careful count of the time, and I didn’t do it all in one sitting. I’d guess I spent about an hour on it total, but I think there’s a case to be made that this was cheating. However, “fresh eyes” is actually a really killer trick when doing this kind of exercise, in my experience, and it’s usually available in practice. So I don’t feel too bad about it.
I really really dislike the experience of saying things I think are totally stupid, and I currently don’t buy that I should start trying to say stupider things. My favorite things in the above list came from refusing to just say another totally stupid thing. Nearly everything in my list is stupid in some way, but the things that are so stupid they don’t even feel interesting basically make me feel sad. I trust my first-round aesthetic pruner to actually be helping to train my babbler in constructive directions.
The following don’t really feel worth having said, to me:
- Throw it really hard
- Catapult
- Kick it really hard
- Wormhole
- Nuclear explosion based craft
My favorites didn’t come after spewing this stuff; instead they came when I refused to be okay with just saying more of that kind of junk:
- Move the thing upward by one foot per day
- Name the thing “420 69 Doge To The Moon” and hope Elon takes the bait
- The various bogo-send options
- Optical tweezers
The difference isn’t really that these are less stupid; in fact they’re kind of more stupid, practically speaking. But I actually viscerally like them, unlike the first group. Forcing myself to produce things I hate feels like a bad strategy on lots of levels.

benwr Aug 1, 2023, 10:32 AM
3 points
in reply to: benwr’s comment on: Babble challenge: 50 ways of sending something to the moon
A thing that was going through my head but I wasn’t sure how to turn into a real idea (vulgar language from a movie):
Perhaps you would like me to stop the car and you two can fuck yourselves to Lutsk!

benwr Aug 1, 2023, 10:27 AM
3 points
in reply to: LoganStrohl’s comment on: Babble challenge: 50 ways of sending something to the moon
Whoa. I also thought of this, though for me it was like thing 24 or something, and I was too embarrassed to actually include it in my post.

benwr Aug 1, 2023, 10:17 AM
3 points
on: Babble challenge: 50 ways of sending something to the moon
1. Hire SpaceX to send it
2. Bribe an astronaut on the next manned moon mission to bring it with them
3. Bribe an engineer on the next robotic moon mission to send it with the rover
4. Get on a manned mars mission, and throw it out the airlock at just the right speed
5. Massive evacuated sphere (like a balloon but arbitrarily light), aimed very carefully
6. Catapult
7. Send instructions on how to build a copy of the thing, and where to put it, such that an alien race will do it as a gesture of goodwill
8. Same, but with an incentive of some kind
9. Same, but do it acausally
10. Make a miniature moon and put the thing on that
11. Build an AGI with the goal of putting the thing on the moon with 99% confidence, with minimum impact to other things
12. Carve the thing out of the moon’s surface, using lasers from satellites around Earth
13. Build a reverse space elevator: the earth is in a luno-stationary orbit due to tidal locking, so you could in principle build an extremely tall tower on the moon’s surface that came relatively close to earth. Then, you could lower objects down that tower after launching them a relatively short distance, exchanging them for moonrock ballast.
14. Quantum-bogo-send it: check to see if the thing has materialized on the moon. If it hasn’t, destroy this everett branch.
15. Tegmark-1-bogo-send it: check to see if the thing has materialized on the moon. If it hasn’t, destroy a large local region of space.
16. Tegmark-4-bogo-send it: check to see if the thing has materialized on the moon. If it hasn’t, derive a logical contradiction
17. Pray for God to send the thing to the moon
18. Offer to sell your soul to the devil in exchange for the thing being sent to the moon
19. Ask everyone on LessWrong to generate 50 ideas each on how to send a thing to the moon, and do the best one
20. Ask everyone on LessWrong to generate 50 ideas each on how to send a thing to the moon, and do the worst one
21. Ask everyone on LessWrong to generate 50 ideas each on how to send a thing to the moon, and do all of them
22. Ask everyone on LessWrong to generate 50 ideas each on how to send a thing to the moon, put all the letters from all the answers into a big bag, and shake it and draw from it repeatedly until you draw a sentence that describes a strategy for sending a thing to the moon, and then do that
23. Somehow annihilate the earth (except for the thing). The thing will then probably fall to the moon? Probably, figure out whether that’s right before annihilating the earth
24. Pull a Raymond-Smullyan-style “will you answer my next question honestly?” scam on the director of NASA, forcing him to kiss you… er… I mean, send the thing to the moon
25. Wait until moon tourism is cheap
26. Start a religion whose central tenets include the belief that this thing being on the moon is a prerequisite for the creation of a universal utopia
27. Non-reverse-space-elevator: build a space elevator, and then throw the thing off the top when the moon is nearby
28. Big ol’ rocket
29. Nuclear explosion based craft
30. Wormhole
31. Unrealistically-good weather control, allowing you to harness the motion of the molecules in the atmosphere to propel objects however you want via extremely careful placement.
32. Redefine or reconceptualize “the moon” to mean wherever the thing is already
33. Redefine or reconceptualize “thing” to mean a thing that’s already on the moon
34. Redefine or reconceptualize “send” to mean keeping the sent thing away from the target
35. Build an extremely detailed simulation of the moon with the thing on it
36. Wait for the sun to engulf the earth-moon system, mixing what’s-left-of-the-thing up with what’s-left-of-the-moon
37. Propel the earth, “wandering earth”-style, to become a moon of Jupiter. Now at least the thing is on a moon.
38. Propel the earth, “wandering earth”-style, to collide with the moon, and be sure the thing is located at the point of collision
39. Throw it really hard
40. Gun
41. Put your face between a really big grapefruit and the moon, put the thing in the grapefruit, and then insert a spoon into the grapefruit. When the grapefruit squirts at your face, pull away quickly
42. Make a popular movie that involves the thing being sent to the moon, in a very memeable way, and hope Elon takes the bait
43. Name the thing “420 69 Doge To The Moon” and hope Elon takes the bait
44. So, y’know how you can levitate things in ultrasonic standing waves? Can you do that with light waves on a super small scale? I think you can, and I think I’ve seen some IBM animation that was made this way? “optical tweezers”, was it called? So, do that, with the standing waves slowly drifting up toward the moon
45. Eh; things seeming to retain a particular identity over time is just a useful fiction—“the thing” next year is just a subset of the causal results of the thing as it is now, not really any more special than any other causal results of the thing as it is now. So since the moon is in the thing’s future light cone already, the job is more-or-less already accomplished.
46. Turn back time to the moment when the parts of the thing were most recently intermixed with the parts of the moon. Maybe the big bang? or maybe some more recent time.
47. Starting somewhere on the equator, move the thing upward by one foot. Tomorrow, move it up by another foot. Continue until you reach the moon. Surely it’s never all that hard to just move the thing one more foot, right?
48. Kick it really hard
49. Nanobot swarm
50. Adult-sized stomp rocket

benwr Jul 29, 2023, 7:09 AM
2 points
0
in reply to: Ted Sanders’s comment on: UFO Betting: Put Up or Shut Up
(I’ve added my $50 to RatsWrong’s side of this bet)

benwr Jul 24, 2023, 6:10 PM
13 points
1
on: “Justice, Cherryl.”
For contingent evolutionary-psychological reasons, humans are innately biased to prefer “their own” ideas, and in that context, a “principle of charity” can be useful as a corrective heuristic

I claim that the reasons for this bias are, in an important sense, not contingent. i.e. an alien race would almost certainly have similar biases, and the forces in favor of this bias won’t entirely disappear in a world with magically-different discourse norms (at least as long as speakers’ identities are attached to their statements).

As soon as I’ve said “P”, it is the case that my epistemic reputation is bound up with the group’s belief in the truth of P. If people later come to believe P, it means that (a) whatever scoring rule we’re using to incentivize good predictions in the first place will reward me, and (b) people will update more on things I say in the future.

If you wanted to find convincing evidence for P, I’m now a much better candidate to find that evidence than someone who has instead said “eh; maybe P?” And someone who has said “~P” is similarly well-incentivized to find evidence for ~P.

benwr May 2, 2023, 8:46 PM
3 points
0
in reply to: Davidmanheim’s comment on: Systems that cannot be unsafe cannot be safe
I would agree more with your rephrased title.

People do actually have a somewhat-shared set of criteria in mind when they talk about whether a thing is safe, though, in a way that they (or at least I) don’t when talking about its qwrgzness. e.g., if it kills 99% of life on earth over a ten year period, I’m pretty sure almost everyone would agree that it’s unsafe. No further specification work is required. It doesn’t seem fundamentally confused to refer to a thing as “unsafe” if you think it might do that.

I do think that some people are clearly talking about meanings of the word “safe” that aren’t so clear-cut (e.g. Sam Altman saying GPT-4 is the safest model yet™️), and in those cases I agree that these statements are much closer to “meaningless”.