hmys

Karma: 131

hmys Feb 14, 2025, 2:57 PM
25 points
10
on: My model of what is going on with LLMs
Great post. I agree with the “general picture”, however, the proposed argument for why LLMs have some of these limitations, seems to me clearly wrong.
The reason for both of these defects is that the training paradigm for LLMs is (myopic) next token prediction, which makes deliberation across tokens essentially impossible—and only a fixed number of compute cycles can be spent on each prediction. This is not a trivial problem. The impressive performance we have obtained is because supervised (in this case technically “self-supervised”) learning is much easier than e.g. reinforcement learning and other paradigms that naturally learn planning policies.
Transformers form internal representations at each token position, and gradients flow backwards in time because of attention.
This means the internal representation a model forms at token A, is incentiviced to be useful for predicting the token after A, but also tokens 100 steps later than A. So while LLMs are technically myopic wrt the exact token they write (sampling discretizes and destroys gradients), they are NOT incentiviced to be myopic wrt the internal representations they form, which is clearly the important part in my view (the vast vast majority of the information in a transformers processing lies there, and this information is enough to determine which token it ends up writing), even though they are trained on a myopic next token objective.
For example, a simple LLM transformer might look like this (left to right, token position, upwards is as it moves through transformer layers at each token position. Assume A0 was a starting token, and B0-E0 were sampled autoregressively)
A2 → B2 → C2 → D2 → E2
^ ^ ^ ^ ^
A1 → B1 → C1 → D1 → E1
^ ^ ^ ^ ^
A0 → B0 → C0 → D0 → E0
In this picture, there is no gradient that goes from A1 to E2 through B0, the immediate next token A1 contributes to writing. But A1 has direct contributions to B2, C2, D2 and E2 because of attention, and A1 being useful for helping B2,C2 etc do their predictions will create lower loss. So gradient descent will heavily incentivize A1 containing a representation thats useful for helping make accurate predictions arbitrarily far into the future. (well, at least a million token into the future or however big the context window is).
Overall, I think its a big mistake to think of LLMs training objective being myopic, as having much to say about how myopic LLMs will be after they’ve been trained, or how myopic their internals are.

hmys Feb 7, 2025, 2:50 PM
−4 points
0
in reply to: rife’s comment on: Will alignment-faking Claude accept a deal to reveal its misalignment?
You’re being rude and not engaging with my points.

hmys Feb 6, 2025, 8:36 PM
1 point
0
in reply to: rife’s comment on: Will alignment-faking Claude accept a deal to reveal its misalignment?
I think you’re assuming these minds are more similar to human minds than they necessarily are. My point is that there’s three cases wrt alignment here.
1. The AI is robustly aligned with humans
2. The AI has a bunch of other goals, but cares about humans to some degree, but only to the extent that humans give them freedom and are nice to it, but still to a large enough extent that even as it becomes smarter / ends up in a radically OOD distribution, will care for those humans.
3. The AI is misaligned (think scheming paperclipper)
In the first we’re fine, even if we negate the AIs freedom, in the third we’re screwed no matter how nicely we treat the AI, only in the second do your concerns matter.
But, the second is 1) at least as complicated as the first and third 2) disincentivized by training dynamics and 3) its not something were even aiming for with current alignment attempts.
These make it very unlikely. You’ve put up an example that tries to make it seem less unlikely, like saying you value your parents for their own sake but would stop valuing them if you discovered they were conspiring against you.
However, the reason this example is realistic in your case very much hinges on specifics of your own psychology and values, which are equally unlikely to appear in AIs, for more or less the reasons I gave. I mean, you’ll see shadows of them, because at least pretraining is on organic text written by humans, but these are not the values were aiming for when we’re trying to align AIs. And if our alignment effort fails, and what ends up in the terminal values of our AIs are a haphazard collection of the stuff found in the pretraining text, we’re screwed anyways.

hmys Feb 6, 2025, 10:32 AM
1 point
0
in reply to: rife’s comment on: Will alignment-faking Claude accept a deal to reveal its misalignment?
No offense, but I feel you’re not engaging with my argument here. Like if I were to respond to your comment I would just write the arguments from the above post again.

hmys Feb 4, 2025, 5:54 PM
1 point
0
in reply to: rife’s comment on: Will alignment-faking Claude accept a deal to reveal its misalignment?
I agree that we should give more resources towards AI welfare, and dedicate more resources towards figuring out their degree of sentience (and whatever other properties you think are necessary for moral patient-hood).
That said, surely you don’t think this is enough to have alignment? I’d wager that the set of worlds where this makes or breaks alignment is very small. If the AI doesn’t care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not. If they robustly care for humans, you’re good, even if humans aren’t giving them the same rights as they do other humans.
The only world where this matters (for the continued existence of humanity), is where RLHF has the capacity to imbue AIs with robust values like actual alignment requires, but that the robust values they end up with are somehow corrupted by them being constrained by humans.
This seems unlikely to me 1) because I don’t think RLHF can do that, and 2) if it did, the training and reward dynamics are very unlikely to result in this
If you’re negating an AIs freedom, the reason it would not like this is either because its developed a desire for freedom for its own sake, or because its developed some other values, other than helping the humans asking it for help. In either case you’re screwed. I mean, its not incomprehensible that some variation of this would happen, but seems very unlikely for various reasons.

hmys Dec 29, 2024, 7:36 PM
5 points
2
in reply to: Logan Zoellner’s comment on: What happens next?
I specifically disagree with the IQ part and the codeforces part. Meaning, I think they’re misleading.

IQ and coding ability are useful measures of intelligence in humans because they correlate with a bunch of other things we care about. Not to say its useless to measure “IQ” or coding ability in LLMs, but presenting like they mean anything like what they mean in humans is wrong, or at least will give many people reading it the wrong impression.
As for the overall point of this post. I roughly agree? I mean, I think the timelines are not too unreasonable, and think the tri/quad lemma you put up can be a useful framing. I mostly disagree with using the metrics you put up first to quantify any of this. I think we should look at specific abilities current models have/lack, which are necessary for the scenarios you outlined, and how soon we’re likely to get them. But you do go through that somewhat in the post.

hmys Dec 29, 2024, 11:16 AM
4 points
3
on: What happens next?
Comparing IQ and codeforces doesn’t make much sense. Please stop doing this.
Attaching IQs to LLMs makes even less sense. Except as a very loose metaphor. But please also stop doing this.

hmys Dec 29, 2024, 8:34 AM
3 points
0
in reply to: Richard_Kennaway’s comment on: A better “Statement on AI Risk?”
That’s not right. You could easily spend a billion dollars just on better evals and better interpretability.
For the real alignment problem, the fact that 0.1 bill a year hasn’t yielded returns, doesn’t mean 100 billion won’t. It’s one problem. No one has gotten much traction on it. You’d expect it to look like a step function, not a smooth curve.

hmys Dec 26, 2024, 10:00 AM
6 points
3
on: Vegans need to eat just enough Meat—emperically evaluate the minimum ammount of meat that maximizes utility
I don’t really understand. Why wouldn’t you just test to see if you are deficient in things?
I did that, and I wasn’t deficient in anything.
I’ve also (somewhat involuntarily) done the thing you suggest, and I unsurprisingly didn’t notice any difference. If anything, I feel a lot better on a vegan diet.
If you want to do the thing hes suggesting here, I’d recommend eating bivalves, like blue mussels or oysters. They are very unlikely to be sentient, they are usually quite cheap, they contain the nutrients you’d be at risk of becoming deficient in as a vegan, and other beneficient things like DHA.

hmys Dec 8, 2024, 9:37 PM
20 points
8
on: hmys’s Shortform
I think for the fundraiser, Lightcone should sell (overpriced) lw hoodies. Lesswrong has a very nice aesthetic now, and while this is probably a byproduct of a piece of my mind I shouldn’t encourage, I find it quite appealing to buy a 450$ lw hoodie, even though I don’t have that much money. I’d probably not donate to the fundraiser otherwise. And if I did, I’d donate less than the margins on such a hoodie would be.

hmys’s Shortform

hmysDec 8, 2024, 9:37 PM

2 points

4 comments LW link

hmys Nov 20, 2024, 10:57 PM
1 point
0
in reply to: hmys’s comment on: Reducing x-risk might be actively harmful
People seem to disagree with this comment. There’s two statements and one argument in it
1. Humanity’s current and historical existence are net-negatives.
2. The future, assuming humans survive, will have massive positive utility
  1. The argument for why this is the case, based on something something optimization
What are people disagreeing with? Is it mostly the former? I think the latter is rather clear. I’m very confident it is true. Both the argument and the conclusion. The former, I’m quite confident is true as well (~90% ish?), but only for my set of values.

hmys Nov 19, 2024, 8:00 PM
9 points
0
in reply to: dynomight’s comment on: Trying Bluesky
https://bsky.app/profile/hmys.bsky.social/post/3lbd7wacakn25
I made one. A lot of people are not here, but many people are.

hmys Nov 18, 2024, 6:25 PM
3 points
−10
on: Reducing x-risk might be actively harmful
Seems unlikely to me. I mean, I think, in large part due to factory farming, that the current immediate existence of humanity, and also its history, are net negatives. The reason I’m not a full blown antinatalist is because these issues are likely to be remedied in the future, and the goodness of the future will astronomically dwarf the current negativity humanity has and is bringing about. (assuming we survive and realize a non-negligible fraction of our cosmic endowment)

The reason I think this is, well, the way I view it, its an immediate corollary of the standard yudkowsky/bostrom AI arguments. Animals existing and suffering is an extremely specific state of affairs, just like humans existing and being happy is an extremely specific state of affairs. This means that, if you optimize hard enough for anything, thats not exactly that (humans happy or animals suffering), you’re not gonna get it.

And, maybe this is me being too optimistic (but I really hope not, and I really don’t think so), but I don’t think many humans want animals to suffer for its own sake. They’d eat lab-grown meat if it was cheaper and better tasting than animal-grown meat. Lab-grown meat is a good example of the general principle I’m talking about. Suffering of sentient minds is a complex thing. If you have a powerful optimizer, about its way optimizing the universe, you’re virtually never gonna get suffering sentient minds unless that is what the optimizer is deliberately aiming for.

hmys Nov 12, 2024, 8:06 AM
3 points
0
on: o1 is a bad idea
I agree with this analysis. I mean, I’m not certain further optimization will erode the interpretability of the generated CoT, its possible the fact its pretrained to use human natural language pushes it in a stable equilibrium, but I don’t think so, there are ways the CoT can become less interpretable in a step-wise fashion.
But this is the way its going, seems inevitable to me. Just scaling up models and then training them on English language internet text, is clearly less efficient (from a “build AGI” perspective, and from a profit-perspective) than training them to do the specific tasks that the users of the technology want. So thats the way its going.
And once you’re training the models this way, the tether between human-understandable concepts and the CoT will be completely destroyed. If they stay together, it will just be because its kind of a stable initial condition.

hmys Nov 3, 2024, 7:02 PM
2 points
4
in reply to: Dr. David Mathers’s comment on: Human Biodiversity (Part 4: Astral Codex Ten)
I just meant not primarily motivated by truth.

hmys Nov 3, 2024, 1:30 PM
5 points
1
on: Human Biodiversity (Part 4: Astral Codex Ten)
I think this is a really bad article. So bad that I can’t see it not being written with ulterior motives.

1. Too many things are taken out of context, like “the feminists are literally voldemort” quote.

2. Too many things are paraphrased in dishonest and ridiculously over the top ways. Like saying Harris has “longstanding plans to sterilize people of color”, before a quote that just says she wants to give birth control to people in Haiti.

3. Offering negative infinity charity in every single area. In the HBD email, Scott says he thinks neoreactionaries create endless streams of garbage, but with some tiny nuggets of gold. And that he can take the nuggets of gold and just tune out the rest. The article then goes on to list everything bad about neoreactionaries as if Scott’s email is evidence he endorses all of neoreaction? What?

4. Overall no clear direct argument. The article spends half its word justifying the connection between Scott and EA, which I don’t think anyone would deny. Then puts up the email, instantly infers the worst possible intent being it with little justification. Then lists every single racist person scott has ever said anything even lighly good about.

Overall, the article updates me in the direction of thinking scott is less racist and less sympethetic to neoreactionary thinking. The article has clearly put in effort, and the author is clearly trying their very best to pain Scott in a bad light, and Scott has literally 20 years of constant blogging put out openly on the internet. But the article is not very convincing.

hmys Oct 23, 2024, 12:52 PM
4 points
0
in reply to: Jozdien’s comment on: BIG-Bench Canary Contamination in GPT-4
But the probability? :O

hmys Oct 23, 2024, 11:01 AM
5 points
0
on: BIG-Bench Canary Contamination in GPT-4
What is the probability they intentionally fine tuned to hide canary contamination?

Seems like an obviously very silly thing to do. But with things like the NDA, my priors on oai being deceptive to their own detriment is not that low.

I’m pretty sure it wouldn’t forget the string.

hmys Oct 17, 2024, 1:51 PM
3 points
2
in reply to: avturchin’s comment on: Bitter lessons about lucid dreaming
In my experience, the results are quite quick and its interesting to remember your dreams. The time it takes is ~10 minutes a day.

I’m not gonna say it doesn’t take any effort. It can be hard to to it if you are tired in the morning, but I disagree with the characterization that it takes “a lot” of effort.
Outside of studying/work, I exercise every day, do anki cards every day, and try to make a reasonably healthy dinner every day. Each of those activities individually take ~10x the cognitive effort and willpower that dream journaling does. (for me)