cousin_it

Karma: 30,164

cousin_it Apr 27, 2025, 2:24 PM
LW: 2 AF: 1
0
AF
in reply to: Steven Byrnes’s comment on: “The Era of Experience” has an unsolved technical alignment problem
Do you think the agent will want the button and ignore the wire, even if during training it already knew that buttons are often connected to wires? Or does it depend on the order in which the agent learns things?

In other words, are we hoping that RL will make the agent focus on certain aspects of the real world that we want it to focus on? If that’s the plan, to me at first glance it seems a bit brittle. A slightly smarter agent would turn its gaze slightly closer to the reward itself. Or am I still missing something?

cousin_it Apr 26, 2025, 10:50 AM
LW: 2 AF: 1
0
AF
in reply to: Steven Byrnes’s comment on: “The Era of Experience” has an unsolved technical alignment problem
I thought about it some more and want to propose another framing.

The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel. The agent’s feelings can even be identical in future A vs future B, but the agent can choose future A anyway. Or maybe one of the futures won’t even have feelings involved: imagine an environment where any mistake kills the agent. In such an environment, RL is impossible.

The reason we can function in such environments, I think, is because we aren’t the main learning process involved. Evolution is. It’s a kind of RL for which the death of one creature is not the end. In other words, we can function because we’ve delegated a lot of learning to outside processes, and do rather little of it ourselves. Mostly we execute strategies that evolution has learned, on top of that we execute strategies that culture has learned, and on top of that there’s a very thin layer of our own learning. (Btw, here I disagree with you a bit: I think most of human learning is imitation. For example, the way kids pick up language and other behaviors from parents and peers.)

This suggests to me that if we want the rubber to meet the road—if we want the agent to have behaviors that track the world, not just the agent’s own feelings—then the optimization process that created the agent cannot be the agent’s own RL. By itself, RL can only learn to care about “behavioral reward” as you put it. Caring about the world can only occur if the agent “inherits” that caring from some other process in the world, by makeup or imitation.

This conclusion might be a bit disappointing, because finding the right process to “inherit” from isn’t easy. Evolution depends on one specific goal (procreation) and is not easy to adapt to other goals. However, evolution isn’t the only such process. There is also culture, and there is also human intelligence, which hopefully tracks reality a little bit. So if we want to design agents that will care about human flourishing, we can’t hope that the agents will learn it by some clever RL. It has to be due to the agent’s makeup or imitation.

This is all a bit tentative, I was just writing out the ideas as they came. Not sure at all that any of it is right. But anyway what do you think?

cousin_it Apr 25, 2025, 10:59 PM
LW: 2 AF: 1
0
AF
in reply to: Steven Byrnes’s comment on: “The Era of Experience” has an unsolved technical alignment problem
I think it helps. The link to “non-behaviorist rewards” seems the most relevant. The way I interpret it (correct me if I’m wrong) is that we can have different feelings in the present about future A vs future B, and act to choose one of them, even if we predict our future feelings to be the same in both cases. For example, button A makes a rabbit disappear and gives you an amnesia pill, and button B makes a rabbit disappear painfully and gives you an amnesia pill.

The followup question then is, what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way. Do you already have some crisp answers about this?

cousin_it Apr 25, 2025, 1:08 PM
LW: 2 AF: 1
0
AF
in reply to: Richard_Ngo’s comment on: Towards a scale-free theory of intelligent agency
Here’s maybe a related point: AIs might find it useful to develop an ability to reveal their internals in a verifiable way under certain conditions (say, when the other AI offers to do the same thing and there’s a way to do a secure “handshake”). So deception ability would be irrelevant, because AIs that can credibly refrain from deception with each other would choose to do so and get a first-best outcome, instead of second-best as voting theory would suggest.

A real world analogy is some of the nuclear precommitments mentioned in Schelling’s book. Like when the US and Soviets knowingly refrained from catching some of each other’s spies, because if a flock of geese triggers the warning radars or something, spies could provide their side with the crucial information that an attack isn’t really happening and there’s no need to retaliate.

cousin_it Apr 25, 2025, 12:30 PM
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: “The Era of Experience” has an unsolved technical alignment problem
Thanks for the link! It’s indeed very relevant to my question.

I have another question, maybe a bit philosophical. Humans seem to reward-hack in some aspects of value, but not in others. For example, if you offered a mathematician a drug that would make them feel like they solved Riemann’s hypothesis, they’d probably refuse. But humans aren’t magical: we are some combination of reinforcement learning, imitation learning and so on. So there’s got to be some non-magical combination of these learning methods that would refuse reward hacking, at least in some cases. Do you have any thoughts what it could be?

cousin_it Apr 25, 2025, 10:10 AM
LW: 9 AF: 4
0
AF
on: Modifying LLM Beliefs with Synthetic Document Finetuning
Very interesting, thanks for posting this!

One question that comes to mind is, could the layers be flipped? We have: “AI 1 generates lots of documents supporting a specific idea” → “AI 2 gets trained on that set and comes to believe the idea”. Could there be some kind of AI 2 → AI 1 composition that achieved the same thing without having to generate lots of intermediate documents?

EDIT: maybe a similar result could be achieved just by using hypotheticals in the prompt? Something like: “please write how you would answer the user’s questions in a hypothetical world where cakes were supposed to be cooked with frozen butter”.

cousin_it Apr 24, 2025, 6:16 PM
LW: 4 AF: 3
0
AF
on: “The Era of Experience” has an unsolved technical alignment problem
I think this is all correct, but it makes me wonder.

You can imagine reinforcement learning as learning to know explicitly how reward looks like, and how to make plans to achieve it. Or you can imagine it as building a bunch of heuristics inside the agent, pulls and aversions, that don’t necessarily lead to coherent behavior out of distribution and aren’t necessarily understood by the agent. A lot of human values seem to be like this, even though humans are pretty smart. Maybe an AI will be even smarter, and subjecting it to any kind of reinforcement learning at all will automatically make it adopt explicit Machiavellian reasoning about the thing, but I’m not sure how to tell if it’s true or not.

cousin_it Apr 15, 2025, 11:54 AM
6 points
4
on: A Dissent on Honesty
I mean, Flynn Rider was also really good-looking. For a lot of people, maybe most, this look is just unattainable. Even if you can get in as good physical shape (which is far from easy), what if you’re older, shorter, balder, have a goofy face and so on.

cousin_it Apr 12, 2025, 7:58 PM
4 points
0
in reply to: Steven Byrnes’s comment on: What is autism?
Yeah. I had a similar idea, that autism spectrum stuff comes from a person’s internal “volume knobs” being turned to the wrong positions. Some things are too quiet to notice, while others are so loud that it turns into a kind of wailing feedback, like from a too loud microphone. And maybe some of it is fixable with exposure training, but not everything and not easily.

cousin_it Apr 11, 2025, 6:38 PM
3 points
0
in reply to: Tapatakt’s comment on: Weird Random Newcomb Problem
Wei’s motivating example for UDT1.1 is exactly that. It is indeed weird that Eliezer’s FDT paper doesn’t use the idea of optimizing over input-output maps, despite coming out later. But anyway, “folklore” (which is slowly being forgotten it seems) does know the proper way to handle this.

cousin_it Apr 11, 2025, 6:21 PM
3 points
0
on: Weird Random Newcomb Problem
I’m not sure question 1 is analogous to Newcomb’s problem. We’re trying to maximize the money made by the program overall, not just the money made in the specific (x,x) case. In other words, even if you’re “inside” the problem, the UDT-ish thing to do is first figure out the optimal way to respond to all (x,y) pairs, and only then feed it the (x,x) pair. Wei Dai called this “UDT1.1″, first optimizing over all input-output maps and then applying it to a specific input.

cousin_it Apr 5, 2025, 3:46 PM
4 points
0
on: Quarter Inch Cables are Devious

ideally the only 1/4″ cables onstage are short runs to DIs

And all the pedalboard stuff that happens before the DI. But mostly I agree.

Btw, do you already know that a piezo signal is much improved by a preamp with >1 meg ohm input impedance? I figured that out with my electric cello.

cousin_it Apr 4, 2025, 7:57 AM
2 points
0
in reply to: Wei Dai’s comment on: AI #110: Of Course You Know…
I think there’s a worldwide trend toward more authoritarian leaders, which contributed to both these events. And it should raise our probability of e.g. Turkey or China doing something silly. But where this trend comes from, I’m not sure. It certainly predates the current AI wave. It could be due to social media making people more polarized or something. But then again there were plenty of worse dictators in history, long before social media or electricity. So maybe what’s happening now is regression to the mean, and nice democracy was an anomaly in place and time.

cousin_it Apr 3, 2025, 11:30 PM
11 points
4
on: AI #110: Of Course You Know…
Yeah. I remember where I was and how I felt when covid hit in 2020, and when Russia attacked Ukraine in 2022. This tariff announcement was another event in the same row.

And it all seems so stupidly self-inflicted. Russia’s economy was booming until Feb 2022, and US economy was doing fine until Feb 2025. Putin-2022 and Trump-2025 would’ve done better for their countries by simply doing nothing. Maybe this shows the true value of democratic checks and balances: most of the time they add overhead, but sometimes they’ll prevent some exceptionally big and stupid decision, and that pays for all the overhead and then some.

cousin_it Mar 24, 2025, 11:23 AM
6 points
0
in reply to: Viliam’s comment on: Solving willpower seems easier than solving aging
Your examples sound familiar to me too, but after rereading your comment and mine, maybe it all can be generalized in a different way. Namely, that internal motivation leads to a low level of effort: reading some textbooks now and then, solving some exercises, producing some small things. It still feels a bit like staying in place. Whereas it takes external motivation to actually move forward with math, or art, or whatever—to spend lots of effort and try to raise my level every day. That’s how it feels for me. Maybe some people can do it without external motivation, or maybe they lucked into getting external motivation in the right way, I don’t know.

cousin_it Mar 23, 2025, 10:30 PM
5 points
0
in reply to: Viliam’s comment on: Solving willpower seems easier than solving aging
I agree feedback is a big part of it. For example, the times in my life when I’ve been most motivated to play musical instruments were when I had regular opportunities to play in front of people. Whenever that disappeared, the interest went away too.

But also I think some of it is sticky, or due to personality factors. We could even say it’s not about willpower at all, but about value differences. Some people are just more okay with homeostasis, staying at a certain level (which can be lower or higher for different people) and using only as much effort as needed for that. While others keep climbing and applying effort without ever reaching a level that lets them relax. Many billionaires seem to be of that second type. I’m more of the first type, with many of my active periods being prompted by external changes, threats to homeostasis. It’s clear that type 2 achieves more than type 1, but it’s not clear which type is happier and whether one should want to switch types.

cousin_it Mar 22, 2025, 8:04 AM
LW: 19 AF: 9
0
AF
on: Towards a scale-free theory of intelligent agency
Good post. But I thought about this a fair bit and I think I disagree with the main point.

Let’s say we talk about two AIs merging. Then the tuple of their expected utilities from the merge had better be on the Pareto frontier, no? Otherwise they’d just do a better merge that gets them onto the frontier. Which specific point on the frontier is a matter of bargaining, but the fact that they want to hit the frontier isn’t, it’s a win-win. And the merges that get them to the frontier are exactly those that output a EUM agent, maximizing some linear combination of their utilities. If the point they want to hit is in a flat region of the frontier, the merge will involve coinflips to choose which EUM agent to become; and if it’s curvy at that point, the merge will be deterministic. For realistic agents who have more complex preferences than just linearly caring about one cake, I expect the frontier will be curvy, so deterministic merge into a EUM agent will be the best choice.

cousin_it Mar 19, 2025, 9:46 AM
10 points
1
on: Elite Coordination via the Consensus of Power
“Apparatchik” in the USSR was some middle-aged Ivan Ivanovich who’d yell at you in his stuffy office for stepping out of line. His power came from the party apparatus. While the power of Western activists is the opposite: it comes from civil society, people freely associating with each other.

This rhetorical move, calling a Western thing by an obscure and poorly fitting Soviet name, is a favorite of Yarvin: “Let’s talk about Google, my friends, but let’s call it Gosplan for a moment. Humor me.” In general I’d advise people to stay away from his nonsense, it’s done enough harm already.

cousin_it Mar 17, 2025, 1:37 AM
2 points
0
on: Counting Objections to Housing
The objection I’m most interested in right now is the one about induced demand (that’s not the right term but let’s roll with it). Like, let’s say we build many cheap apartments in Manhattan. Then the first bidders for them will be rich people—from all over the world! - who would love to get a Manhattan apartment for a bargain price. The priced-out locals will stay just as priced out, shuffled to the back of the line, because there’s quite many rich people in the world who are willing to outbid them. Maybe if we build very many apartments, and not just in Manhattan but everywhere, the effect will eventually run out; but it’ll take very many indeed.

The obvious fix is to put a thumb on the scale somehow, for example sell these cheap apartments only as primary residences. But then we lose the theoretical beauty of “just build more”, and we really should figure out what mix of “just build more” and “put a thumb on the scale” is the most cost-efficient for achieving what we want. Maybe some thumb on the scale will even give us what we want without building more, since there’s a lot of empty housing and non-primary housing.

cousin_it Mar 6, 2025, 2:08 AM
4 points
2
on: Give Neo a Chance
Maybe you’re pushing your proposal a bit much, but anyway as creative writing it’s interesting to think about such scenarios. I had a sketch for a weird utopia story where just before the singularity, time stretches out for humans because they’re being run at increasing clock speed, and the Earth’s surface also becomes much larger and growing. So humanity becomes this huge, fast-running civilization living inside an AI (I called it “Quetzalcoatl”, not sure why) and advising it how it should act in the external world.