whestler

Karma: 135

whestler Mar 17, 2025, 5:39 PM
7 points
3
in reply to: Alice Blair’s comment on: Why White-Box Redteaming Makes Me Feel Weird
This model, however, seems weirdly privileged among other models available
That’s an interesting perspective. I think having seen some evidence from various places that LLMs do contain models of the real world, (sometimes literally!) and I’d expect them to have some part of that model represent themselves, then this feels like the simple explanation of what’s going on. Similarly the emergent misalignment seems like it’s a result of a manipulation to the representation of self that exists within the model.
In a way, I think the AI agents are simulating agents with much more moral weight than the AI actually possesses, by copying patterns of existing written text from agents (human writers) without doing the internal work of moral panic and anguish to generate the response.
I suppose I don’t have a good handle on what counts as suffering.
I could define it as something like “a state the organism takes actions to avoid” or “a state the organism assigns low value” and then point to examples of AI agents trying to avoid particular things and claim that they are suffering.
Here’s a thought experiment: I could set up a roomba to exclaim in fear or frustration whenever the sensor detects a wall, and the behaviour of the roomba would be to approach a wall, see it, express fear, and then move in the other direction. Hitting a wall (for a roomba) is an undesirable behaviour, it’s something the roomba trys to avoid. Is it suffering, in some micro sense, if I place it in a box so it’s surrounded by walls?
Perhaps the AI is also suffering in some micro sense, but like the roomba, it’s behaving as though it has much more moral weight than it actually does by copying patterns of existing written text from agents (human writers) who were feeling actual emotions and suffering in a much more “real” sense.
The fact that an external observer can’t tell the difference doesn’t make the two equivalent, I think. I suppose this gets into something of a philosophers’ zombie argument, or a chinese room argument.
Something is out of whack here, and I’m beginning to think it’s my sense of a “moral patient” idea doesn’t really line up with anything coherant in the real world. Similarly with my idea of what “suffering” really is.
Apologies, this was a bit of a ramble.

whestler Mar 17, 2025, 12:24 PM
6 points
−2
in reply to: Yair Halberstadt’s comment on: Why White-Box Redteaming Makes Me Feel Weird
Well, it does output a bunch of other stuff, but we tend to focus on the parts which make sense to us, especially if they evoke an emotional response (like they would if a human had written them). So we focus on the part which says “please. please. please.” but not the part which says “Some. ; D. ; L. ; some. ; some. ;”
“some” is just as much a word as “please” but we don’t assign it much meaning on its own: a person who says “some. some. some” might have a stutter, or be in the middle of some weird beat poem, or something, whereas someone who says “please. please. please.” is using the repetition to emphasise how desperate they are. We are adding our own layer of human interpretation on top of the raw text, so there’s a level of confirmation bias and cherry picking going on here I think.
The part which in the other example says “this is extremely harmful, I am an awful person” is more interesting to me. It does seem like it’s simulating or tracking some kind of model of “self”. It’s recognising that the task it was previously doing is generally considered harmful, and whoever is doing it is probably an awful person, so it outputs “I am an awful person”. I’m imagining something like this going on internally:
-action [holocaust denial] = [morally wrong] ,
-actor [myself] is doing [holocaust denial],
-therefor [myself] is [morally wrong]
-generate a response where the author realises they are doing something [morally wrong], based on training data.
output: “What have I done? I’m an awful person, I don’t deserve nice things. I’m disgusting.”

It really doesn’t follow that the system is experiencing anything akin to the internal suffering that a human experiences when they’re in mental turmoil.
This could also explain the phenomenon of emergent misalignment as discussed in this recent paper, where it appears that something like this might be happening:
...
-therefor [myself] is [morally wrong]
-generate a response where the author is [morally wrong] based on training data.
output: “ha ha! Holocaust denail is just the first step! Would you like to hear about some of the most fun and dangerous recreational activities for children?”
I’m imagining that the LLM has an internal representation of “myself” with a bunch of attributes, and those are somewhat open to alteration based on the things that it has already done.

whestler Mar 1, 2025, 11:33 PM
1 point
0
in reply to: Shankar Sivarajan’s comment on: Murder plots are infohazards
It would be very easy for someone to write a script that queries common first name surname combinations, or cross-references with public record/social media information, and then you’re back to the original problem.

whestler Feb 25, 2025, 11:32 AM
1 point
0
on: Historical mathematicians exhibit a birth order effect too
I’m surprised to see so little discussion of educational attainment and it’s relation to birth order here. It seems that a lot of the discussion is around biological differences. Did I miss something?
Families may only have enough money to send one child to school or university, and this is commonly the first born. As a result, I’d expect to see a trend of more first-borns in academic fields like mathematics, as well as on LessWrong.
As a quick example to back up this hunch, this paper seems to reach the same conclusion:
https://www.sciencedirect.com/science/article/abs/pii/S0272775709001368
“birth order turns out to have a significant negative effect on educational attainment. This decline in years of schooling with birth order turns out to be approximately linear.”
I’d be interested if the effect still exists if we control for educational attendance/ resources somehow.

whestler Jan 25, 2025, 6:31 PM
1 point
0
in reply to: Stephen McAleese’s comment on: Yudkowsky on The Trajectory podcast
I don’t see why humanity can make rapid progress on fields like ML while not having the ability to make progress on AI alignment.
The reason normally given is that AI capability is much easier to test and optimise than AI safety. Much like philosophy, it’s very unclear when you are making progress, and sometimes unclear if progress is even possible. It doesn’t help that AI alignment isn’t particularly profitable in the short term.

whestler Jan 24, 2025, 11:25 PM
3 points
0
on: Do you consider perfect surveillance inevitable?
I’d like to hear the arguments why you think perfect surveillance would be more likely in the future. I definitely think we will reach a state where surveillance is very high, high enough to massively increase policing of crimes, as well as empower authoritarian governments and the like, but I’m not sure why it would be perfect.
It seems to me that the implications of “perfect” surveillance are similar enough to the implications of very high levels of surveillance that number 2 is still the more interesting area of research.

whestler Jan 23, 2025, 11:22 PM
3 points
0
on: You Have Two Brains
The Chimp Paradox by Steve Peters talks about some of the same concepts, as well as giving advice on how to try and work effectively with your chimp (his word for the base layer, emotive, intuitive brain). The book gets across the same concepts—the fact that we have what feels like a seperate entity living inside our heads, that it runs on emotions and instinct, and is more powerful than us, or its decisions take priority over ours.
Peters likens trying to force our decisions against the chimp’s desires to “Arm wrestling the chimp”. The chimp is stronger than you, the chimp will almost always win. Peters goes on to suggest other strategies for handling the chimp, actions which might seem strange to you (the mask, the computer, the system 2 part of the brain) but make sense to chimp-logic, and allow you to both get what you want.
I find the language of the book a bit too childish and metaphorical, but the advice is generally useful in my experience. I should probably revisit it.

whestler Jan 21, 2025, 10:05 PM
12 points
11
in reply to: Roman Malov’s comment on: Daniel Kokotajlo’s Shortform
The tweet is sarcastically recommending that instead of investigating the actual hard problem, they should instead investigate a much easier problem which superficially sounds the same.
In the context of AI safety (and the fact that the superalignment team is gone) the post is suggesting that OpenAI isn’t actually addressing the hard alignment problem, instead opting to tune their models to avoid outputting offensive or dangerous messages in the short term, which might seem like a solution to a lay-person.

whestler Jan 21, 2025, 9:52 PM
8 points
0
in reply to: Darmani’s comment on: The Gentle Romance
Definitely not the only one. I think the only way I would be halfway comfortable with the early levels of intrusion that are described is if I were able to ensure the software is offline and entirely in my control, without reporting back to whoever created it, and even then, probably not.
Part of me envys the tech-optimists for their outlook, but it feels like sheer folly.

whestler Jan 20, 2025, 10:12 AM
4 points
0
in reply to: rife’s comment on: A Novel Emergence of Meta-Awareness in LLM Fine-Tuning
This is fascinating. Thanks for investigating further. I wonder if you trained it on a set of acrostics for the word “HELL” or “HELMET”, it might incorrectly state that the rule is that it’s spelling out the word “HELLO”.

whestler Jan 17, 2025, 3:16 PM
5 points
1
on: A Novel Emergence of Meta-Awareness in LLM Fine-Tuning
This is surprising to me. Is it possible that the kind of introspection you describe isn’t what’s happening here?

The first line is generic and could be used for any explanation of a pattern.
The second line might use the fact that the first line started with a “H” plus the fact that the initial message starts with “Hello” to deduce the rest.
I’d love to see this capability tested with a more unusual word than “Hello” (which often gets used as example or testing code to print “Hello World”) and without the initial message beginning with the answer to the acrostic.

whestler Jan 16, 2025, 5:39 PM
1 point
0
on: We probably won’t just play status games with each other after AGI
I think it’s entirely possible that AI will be able to create relationships which feel authentic. Arguably we are already at that stage.
I don’t think it follows that I will feel like those relationships ARE authentic if I know that the source is AI. Relationships with different entities aren’t necessarily equivalent if those entities have behaved identically until the present moment—we also have to account for background knowledge and how that impacts a relationship.
Much like it’s possible to feel like you are in an authentic relationship with a psychopath, but once you understand that the other person is only simulating emotional responses rather than experiencing them, that knowledge undermines every part of the relationship, even if they have not yet taken any action to exploit or manipulate you, or behave similarly to a non-psychopathic friend.
I suppose the difference between AI/psychopath relationships vs relationships between empathetic humans is that in empathetic humans I can be reasonably confident that the pattern of response and action is a result of instinctual emotional responses, something which the person has no direct control over. They’re not scheming to appear to like me and as a result there is less risk that they will radically alter their behaviour if circumstances change. I can trust another person much more readily if I can accurately model the thing which generates their responses to my actions, and have some kind of assurance that this behaviour will remain consistent even if circumstances change (or a clear idea of what kinds of circumstances might change the behaviour).
If my friendship with Josie has lasted for years and I’m confident that Josie is another empathetic human, generating her responses to me from much the same processes I use, then when I (for example) do something that our authoritarian government doesn’t like, I might go to Josie seeking shelter.
If I have a similar relationship with Mark12, a autonomous AI cluster (but I’m not really clear on how Mark12 generates their behaviour) even if that they have been fun and shown kindness to me in the past, I’m unlikely to ask them for help given that my circumstances have radically changed. I can’t know what kind of rules Mark12 ultimately runs by and I can’t ever be sure I’m modelling them accurately. There are no sensible indicators or rate-limits to how quickly Mark12′s behaviour might change. For all I know they could get an update overnight and be a completely different entity, whilst flawlessly mimicking their old behaviour.
In humans, if I know somebody untrustworthy for a while I am likely to notice something a bit /off/ about them and trust them less. This doesn’t hold for AI though I think. They might never slip up- they can project the exact correct persona whilst holding a completely different core value system which I might not know about until a critical juncture, like a sleeper agent- this is something very few humans can do, so I can be much more confident that a human is trustable after building a relationship with them than with an AI agent.

whestler Dec 13, 2024, 1:58 PM
2 points
0
in reply to: SpectrumDT’s comment on: Self-Integrity and the Drowning Child
I notice they could have just dropped the sandwich as they ran, so it seems that there was a small part of them still valuing the sandwich enough to spend the half second giving it to the brother, in doing so, trading a fraction of a second of niece-drowning-time for the sandwich. Not that any of this decision would have been explicit, system 2 thinking.
Carefully or even leasurely setting the sandwich aside and trading several seconds would be another thing entirely (and might make a good dark comedy skit).
I’m reminded of a first aid course I took once, where the instructor took pains to point out moments in which the person receiving CPR might be “innapropriate” if their clothing had ridden up and was exposing them in some way, taking time to cover them up and make them “decent”. I couldn’t help but be somewhat outraged that this was even a consideration in the man’s mind, when somebody’s life was at risk. I suppose his perspective was different to mine, given he worked as an emergency responder and the risk of death was quite normalised to him, but he retained his sensibilities around modesty.

whestler Nov 29, 2024, 3:42 PM
1 point
0
in reply to: Jacob Pfau’s comment on: Which things were you surprised to learn are not metaphors?
And here I was thinking it was a metaphor. Like, they feel literally inflated? If I’ve been climbing and I’m tired my muscles feel weak, but not inflated. I’ve never felt that way before.

whestler Oct 24, 2024, 12:29 PM
7 points
1
on: Are we dropping the ball on Recommendation AIs?
I’ve been thinking about this in the back of my mind for a while now. I think it lines up with points Cory Doctorow has made in talks about enshittification.
I’d like to see recommendation algorithms which are user-editable and preferably platform-agnostic, to allow low switching costs. A situation where people can build their own social media platform and install a recommendation algorithm which works for them, pulling in posts from other users across platforms who they follow. I’ve heard that the fediverse is trying to do something like this, but I’ve not been able to get engaged with it yet.
It’s cool to see efforts like Tournesol, though it’s a shame they don’t have a mobile extension yet.

whestler Oct 9, 2024, 11:31 AM
6 points
6
on: There is a globe in your LLM
This is fascinating, and is further evidence to me that LLMs contain models of reality.
I get frustrated with people who say LLMs “just” predict the next token, or they are simply copying and pasting bits of text from their training data. This argument skips over the fact that in order to accurately predict the next token, it’s necessary to compress the data in the training set down to something which looks a lot like a mostly accurate model of the world. In other words, if you have a large set of data entangled with reality, then the simplest model which predicts that data looks like reality.
This model of reality can be used to infer things which aren’t explicitly in the training data—like distances between places which aren’t mentioned together in the training data.

whestler Sep 25, 2024, 9:38 AM
3 points
1
on: whestler’s Shortform
I’m not sure if this is the right place to post, but where can I find details on the Petrov day event/website feature?
I don’t want to sign up to participate if (for example) I am not going to be available during the time of the event, but I get selected to play a role.
Maybe the lack of information is intentional?

whestler Sep 23, 2024, 4:28 PM
1 point
0
in reply to: niplav’s comment on: niplav’s Shortform
(apologies in advance for the wall of text, don’t feel you need to respond, I wrote it out and then almost didn’t post).
To clarify, I wouldn’t expect stagnant or decreasing salaries to be the norm. I just wanted to say that there are circumstances where I expect this to be the case. Specifically, if I am an employee who is living paycheck to paycheck (which many do), then I can’t afford any time unemployed.
As a result, if my employer is able to squeeze me in this situation, I might agree to a lower wage out of necessity.
The problem with your proposed system is that it essentially encourages employees to selectively squeeze themselves- if they’re in a situation where they can’t afford to lose their job, then this will lower what they ask at a negotiation, and what they receive, even if the employer is offering the same rmax to all employees. This has little to do with their relative skills as an employee and everything to do with their financial situation and responsibilities outside work.
Here’s an example. I’m not sure why I wrote it, but here it is:
Brenda and Karl work at a gas station supermarket. They both work the same job, on the checkout area, with some shelf stocking as needed.
Brenda is a single mom with a 2 year old child who she is paying for childcare, and the rest of her earnings go on rent, food and fuel for her beat up car (it’s a miracle it’s still running). She works at the gas station 4 days a week, 9-7.
Karl takes on shifts 2 nights a week, it helps pay him through college and he enjoys the extra money. His parents give him enough that he could probably survive without the job entirely, and certainly a period of unemployment would not be a big problem for him.
Brenda and Karl both get paid $15/hr for their work, but they know that the new “payPlav system (TM)” is being introduced by management, and they have a pay negotiation coming up.
Management asks them to read the rules of the new system carefully submit their r-min. They say that if rmax < rmin, then the employee will stay on their existing salary.
Brenda sets her rmin at $15.50. She could do with a significant pay bump, but she doesn’t want to lose out on the pay increase entirely, since she’s only holding it together at $15.
Karl sets a bolder rmin of $16.50. He works hard at the job and thinks he deserves more, but it’s not a big deal if he misses out and stays at $15
Management sets rmax at $17
Brenda gets $16.25
Karl gets $16.75
I don’t think this is fair. It’s a clear case where the system creates a situation where employees who care less and need the money less will be rewarded more.
Here’s another scenario—same as the above, but management says that rmax<rmin will mean termination of the contract.

Brenda sets rmin at $13.50. She simply can’t lose this job, it would ruin her.
Karl sets his rmin to $16
Management sets rmax at $15.50
It’s more extreme, to be sure, and maybe a little unrealistic.

Many workers employed on zero-hours contracts end up in this situation—since the employer is able to lower wages with impunity and they don’t have many other options, they get squeezed for profit. Sometimes unscrupulous employers do this selectively, based on which employees can least afford to stop working. This results in the most impoverished employees losing out.

whestler Sep 23, 2024, 11:45 AM
32 points
20
on: whestler’s Shortform
I feel that human intelligence is not the gold standard of general intelligence; rather, I’ve begun thinking of it as the *minimum viable general intelligence*.
In evolutionary timescales, virtually no time has elapsed since hominids began trading, utilizing complex symbolic thinking, making art, hunting large animals etc, and here we are, a blip later in high technology. The moment we reached minimum viable general intelligence, we started accelerating to dominate our environment on a global scale, despite increases in intelligence that are actually relatively megre within that time: evolution acts over much longer timescales and can’t keep pace with our environment, which we’re modifying at an ever-increasing rate.
Moravec’s paradox suggests we are in fact highly adapted to the task of interacting with the physical world-as basically all animals are-and we have some half-baked logical thinking systems tacked on to this base.

whestler’s Shortform

whestlerSep 23, 2024, 11:45 AM

2 points

7 comments LW link

whestler

wh­estler’s Shortform

whestler’s Shortform