leogao

Karma: 5,457

leogao Jan 5, 2025, 7:07 PM
39 points
15
on: Reasons for and against working on technical AI safety at a frontier AI lab
some random takes:
- you didn’t say this, but when I saw the infrastructure point I was reminded that some people seem to have a notion that any ML experiment you can do outside a lab, you will be able to do more efficiently inside a lab because of some magical experimentation infrastructure or something. I think unless you’re spending 50% of your time installing cuda or something, this basically is just not a thing. lab infrastructure lets you run bigger experiments than you could otherwise, but it costs a few sanity points compared to the small experiment. oftentimes, the most productive way to work inside a lab is to avoid existing software infra as much as possible.
- I think safetywashing is a problem but from the perspective of an xrisky researcher it’s not a big deal because for the audiences that matter, there are safetywashing things that are just way cheaper per unit of goodwill than xrisk alignment work—xrisk is kind of weird and unrelatable to anyone who doesn’t already take it super seriously. I think people who work on non xrisk safety or distribution of benefits stuff should be more worried about this.
- this is totally n=1 and in fact I think my experience here is quite unrepresentative of the average lab experience, but I’ve had a shocking amount of research freedom. I’m deeply grateful for this—it has turned out to be incredibly positive for my research productivity (e.g the SAE scaling paper would not have happened otherwise).

leogao Dec 31, 2024, 5:36 AM
2 points
0
in reply to: Nina Panickssery’s comment on: leogao’s Shortform
I think this is probably true of you and people around you but also you likely live in a bubble. To be clear, I’m not saying why people reading this should travel, but rather what a lot of travel is like, descriptively.

leogao Dec 30, 2024, 12:31 AM
19 points
2
on: leogao’s Shortform
theory: a large fraction of travel is because of mimetic desire (seeing other people travel and feeling fomo / keeping up with the joneses), signalling purposes (posting on IG, demonstrating socioeconomic status), or mental compartmentalization of leisure time (similar to how it’s really bad for your office and bedroom to be the same room).

this explains why in every tourist destination there are a whole bunch of very popular tourist traps that are in no way actually unique/comparatively-advantaged to the particular destination. for example: shopping, amusement parks, certain kinds of museums.

leogao Dec 29, 2024, 1:23 AM
22 points
17
in reply to: johnswentworth’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
ok good that we agree interp might plausibly be on track. I don’t really care to argue about whether it should count as prosaic alignment or not. I’d further claim that the following (not exhaustive) are also plausibly good (I’ll sketch each out for the avoidance of doubt because sometimes people use these words subtly differently):
- model organisms—trying to probe the minimal sets of assumptions to get various hypothesized spicy alignment failures seems good. what is the least spoonfed demonstration of deceptive alignment we can get that is analogous mechanistically to the real deal? to what extent can we observe early signs of the prerequisites in current models? which parts of the deceptive alignment arguments are most load bearing?
- science of generalization—in practice, why do NNs sometimes generalize and sometimes not? why do some models generalize better than others? In what ways are humans better or worse than NNs at generalizing? can we understand this more deeply without needing mechanistic understanding? (all closely related to ELK)
- goodhart robustness—can you make reward models which are calibrated even under adversarial attack, so that when you optimize them really hard, you at least never catastrophically goodhart them?
- scalable oversight (using humans, and possibly giving them a leg up with e.g secret communication channels between them, and rotating different humans when we need to simulate amnesia) - can we patch all of the problems with e.g debate? can we extract higher quality work out of real life misaligned expert humans for practical purposes (even if it’s maybe a bit cost uncompetitive)?

leogao Dec 28, 2024, 10:05 AM
31 points
2
in reply to: johnswentworth’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
in capabilities, the most memetically successful things were for a long time not the things that actually worked. for a long time, people would turn their noses at the idea of simply scaling up models because it wasn’t novel. the papers which are in retrospect the most important did not get that much attention at the time (e.g gpt2 was very unpopular among many academics; the Kaplan scaling laws paper was almost completely unnoticed when it came out; even the gpt3 paper went under the radar when it first came out.)
one example of a thing within prosaic alignment that i feel has the possibility of generalizability is interpretability. again, if we take the generalizability criteria and map it onto the capabilities analogy, it would be something like scalability—is this a first step towards something that can actually do truly general reasoning, or is it just a hack that will no longer be relevant once we discover the truly general algorithm that subsumes the hacks? if it is on the path, can we actually shovel enough compute into it (or its successor algorithms) to get to agi in practice, or do we just need way more compute than is practical? and i think at the time of gpt2 these were completely unsettled research questions! it was actually genuinely unclear whether writing articles about ovid’s unicorn is a genuine first step towards agi, or just some random amusement that will fade into irrelevancy. i think interp is in a similar position where it could work out really well and eventually become the thing that works, or it could just be a dead end.

leogao Dec 28, 2024, 9:49 AM
9 points
1
in reply to: Daniel Tan’s comment on: leogao’s Shortform
some concrete examples
- “agi happens almost certainly within in the next few decades” → maybe ai progress just kind of plateaus for a few decades, it turns out that gpqa/codeforces etc are like chess in that we only think they’re hard because humans who can do them are smart but they aren’t agi-complete, ai gets used in a bunch of places in the economy but it’s more like smartphones or something. in this world i should be taking normie life advice a lot more seriously.
- “agi doesn’t happen in the next 2 years” → maybe actually scaling current techniques is all you need. gpqa/codeforces actually do just measure intelligence. within like half a year, ML researchers start being way more productive because lots of their job is automated. if i use current/near-future ai agents for my research, i will actually just be more productive.
- “alignment is hard” → maybe basic techniques is all you need, because natural abstractions is true, or maybe the red car / blue car argument for why useful models are also competent at bad things is just wrong because generalization can be made to suck. maybe all the capabilities people are just right and it’s not reckless to be building agi so fast

leogao Dec 28, 2024, 8:58 AM
10 points
7
on: leogao’s Shortform
i think it’s quite valuable to go through your key beliefs and work through what the implications would be if they were false. this has several benefits:
- picturing a possible world where your key belief is wrong makes it feel more tangible and so you become more emotionally prepared to accept it.
- if you ever do find out that the belief is wrong, you don’t flinch away as strongly because it doesn’t feel like you will be completely epistemically lost the moment you remove the Key Belief
- you will have more productive conversations with people who disagree with you on the Key Belief
- you might discover strategies that are robustly good whether or not the Key Belief is true
- you will become better at designing experiments to test whether the Key Belief is true

leogao Dec 28, 2024, 8:38 AM
8 points
0
on: leogao’s Shortform
there are two different modes of learning i’ve noticed.
- top down: first you learn to use something very complex and abstract. over time, you run into weird cases where things don’t behave how you’d expect, or you feel like you’re not able to apply the abstraction to new situations as well as you’d like. so you crack open the box and look at the innards and see a bunch of gears and smaller simpler boxes, and it suddenly becomes clear to you why some of those weird behaviors happened—clearly it was box X interacting with gear Y! satisfied, you use your newfound knowledge to build something even more impressive than you could before. eventually, the cycle repeats, and you crack open the smaller boxes to find even smaller boxes, etc.
- bottom up: you learn about the 7 Fundamental Atoms of Thingism. you construct the simplest non-atomic thing, and then the second simplest non atomic thing. after many painstaking steps of work, you finally construct something that might be useful. then you repeat the process anew for every other thing you might ever find useful. and then you actually use those things to do something
generally, i’m a big fan of top down learning, because everything you do comes with a source of motivation for why you want to do the thing; bottom up learning often doesn’t give you enough motivation to care about the atoms. but also, bottom up learning gives you a much more complete understanding.

leogao Dec 28, 2024, 5:04 AM
3 points
1
in reply to: johnswentworth’s comment on: leogao’s Shortform
there is always too much information to pay attention to. without an inexpensive way to filter, the field would grind to a complete halt. style is probably a worse thing to select on than even academia cred, just because it’s easier to fake.

leogao Dec 28, 2024, 5:00 AM
106 points
80
on: The Field of AI Alignment: A Postmortem, and What To Do About It
I’m sympathetic to most prosaic alignment work being basically streetlighting. However, I think there’s a nirvana fallacy going on when you claim that the entire field has gone astray. It’s easiest to illustrate what I mean with an analogy to capabilities.

In capabilities land, there were a bunch of old school NLP/CV people who insisted that there’s some kind of true essence of language or whatever that these newfangled neural network things weren’t tackling. The neural networks are just learning syntax, but not semantics, or they’re ungrounded, or they don’t have a world model, or they’re not representing some linguistic thing, so therefore we haven’t actually made any progress on true intelligence or understanding etc etc. Clearly NNs are just progress on the surface appearance of intelligence while actually just being shallow pattern matching, so any work on scaling NNs is actually not progress on intelligence at all. I think this position has become more untenable over time. A lot of people held onto this view deep into the GPT era but now even the skeptics have to begrudgingly admit that NNs are pretty big progress even if additional Special Sauce is needed, and that the other research approaches towards general intelligence more directly haven’t done better.

It’s instructive to think about why this was a reasonable thing for people to have believed, and why it turned out to be wrong. It is in fact true that NNs are kind of shallow pattern matchy even today, and that literally just training bigger and bigger NNs eventually runs into problems. Early NNs—heck, even very recent NNs—often have trouble with relatively basic reasoning that humans have no problem with. But the mistake is assuming that this means no progress has been made on “real” intelligence just because no NN so far has perfectly replicated all of human intelligence. Oftentimes, progress towards the hard problem does actually not immediately look like tackling the meat of the hard problem directly.

Of course, there is also a lot of capabilities work that is actually just completely useless for AGI. Almost all of it, in fact. Walk down the aisle at neurips and a minimum of 90% of the papers will fall in this category. A lot of it is streetlighting capabilities in just the way you describe, and does in fact end up completely unimpactful. Maybe this is because all the good capabilities work happens in labs nowadays, but this is true even at earlier neuripses back when all the capabilities work got published. Clearly, a field can be simultaneously mostly garbage and also still make alarmingly fast progress.

I think this is true for basically everything—most work will be crap (often predictably so ex ante), due in part to bad incentives, and then there will be a few people who still do good work anyways. This doesn’t mean that any pile of crap must have some good work in there, but it does mean that you can’t rule out the existence of good work solely by pointing at the crap and the incentives for crap. I do also happen to believe that there is good work in prosaic alignment, but that goes under the object level argument umbrella, so I won’t hash it out here.

leogao Dec 28, 2024, 3:53 AM
5 points
0
in reply to: habryka’s comment on: leogao’s Shortform
sure, the thing you’re looking for is the status system that jointly optimizes for alignedness with what you care about, and how legible it is to the people you are trying to convince.

leogao Dec 27, 2024, 9:46 AM
23 points
14
on: leogao’s Shortform
a lot of unconventional people choose intentionally to ignore normie-legible status systems. this can take the form of either expert consensus or some form of feedback from reality that is widely accepted. for example, many researchers especially around these parts just don’t publish at all in normal ML conferences at all, opting instead to depart into their own status systems. or they don’t care whether their techniques can be used to make very successful products, or make surprisingly accurate predictions etc. instead, they substitute some alternative status system, like approval of a specific subcommunity.
there’s a grain of truth to this, which is that the normal status system is often messed up (academia has terrible terrible incentives). it is true that many people overoptimize the normal status system really hard and end up not producing very much value.
but the problem with starting your own status system (or choosing to compete in a less well-agreed-upon one) is that it’s unclear to other people how much stock to put in your status points. it’s too easy to create new status systems. the existing ones might be deeply flawed, but at least their difficulty is a known quantity.
one common retort is that it’s not worth proving yourself to people who are too closed minded and only accept ideas if they are validated by some legible status system. this is true to some extent, and i’m generally against people spending too much effort to optimize normie status too hard (e.g i think people should be way less worried about getting a degree in order to be taken seriously / get a job offer), but it’s possible to take too far.
a rational decision maker should in fact discount claims of extremely illegible quality, because there are simply too many of them and it’s too hard to pick out the good ones even if they were there (that’s sort of the whole thing about illegibillity!). it seems bad to only bestow the truth upon people who happen to be irrational in ways that cause them to take you seriously by chance. if left unchecked, this kind of thing can also very easily evolve into a cult, where the unmooring from reality checks allows huge epistemic distortions.
a good in between approach might be to do some very legibly impressive things, just to prove that you can in fact do well at the legible status system if you chose to, and are intentionally choosing not to (as opposed to choosing alternative status systems because you’re not capable of getting status in the legible system).

leogao Dec 25, 2024, 10:59 PM
LW: 7 AF: 3
2
AF
in reply to: Thane Ruthenis’s comment on: Thane Ruthenis’s Shortform
simple ideas often require tremendous amounts of effort to make work.

leogao Dec 25, 2024, 2:33 AM
4 points
3
on: leogao’s Shortform
twitter is great because it boils down saying funny things to purely a problem of optimizing for funniness, and letting twitter handle the logistics of discovery and distribution. being e.g a comedian is a lot more work.

leogao Dec 25, 2024, 2:24 AM
8 points
2
in reply to: leogao’s comment on: leogao’s Shortform
corollary: oftentimes, when smart people say things that are clearly wrong, what’s really going on is they’re saying the closest thing in their frame that captures the grain of truth

leogao Dec 25, 2024, 1:57 AM
9 points
4
on: leogao’s Shortform
the world is too big and confusing, so to get anything done (and to stay sane) you have to adopt a frame. each frame abstracts away a ton about the world, out of necessity. every frame is wrong, but some are useful. a frame comes with a set of beliefs about the world and a mechanism for updating those beliefs.

some frames contain within them the ability to become more correct without needing to discard the frame entirely; they are calibrated about and admit what they don’t know. they change gradually as we learn more. other frames work empirically but are a dead end epistemologically because they aren’t willing to admit some of their false claims. for example, many woo frames capture a grain of truth that works empirically, but come with a flawed epistemology that prevents them from generating novel and true insights.

often it is better to be confined inside a well trodden frame than to be fully unconstrained. the space of all possible actions is huge, and many of them are terrible. on the other hand, staying inside well trodden frames forever substantially limits the possibility of doing something extremely novel

leogao Dec 25, 2024, 12:20 AM
3 points
0
in reply to: leogao’s comment on: leogao’s Shortform
it’s (sometimes) also a mechanism for seeking domains with long positive tail outcomes, rather than low variance domains

leogao Dec 24, 2024, 10:51 PM
4 points
2
on: leogao’s Shortform
the financial industry is a machine that lets you transmute a dollar into a reliable stream of ~4 cents a year ~forever (or vice versa). also, it gives you a risk knob you can turn that increases the expected value of the stream, but also the variance (or vice versa; you can take your risky stream and pay the financial industry to convert it into a reliable stream or lump sum)

leogao Dec 23, 2024, 1:54 AM
7 points
1
in reply to: CstineSublime’s comment on: leogao’s Shortform
I think the most important part of paying for goods and services is often not the raw time saved, but the cognitive overhead avoided. for instance, I’d pay much more to avoid having to spend 15 minutes understanding something complicated (assuming there is no learning value) than 15 minutes waiting. so it’s plausibly more costly to have to figure out the timetable, fare system, remembering to transfer, navigating the station, than the additional time spent in transit (especially applicable in a new unfamiliar city)

leogao Dec 23, 2024, 1:44 AM
14 points
11
in reply to: habryka’s comment on: leogao’s Shortform
agree it goes in both directions. time when you hold critical context is worth more than time when you don’t. it’s probably at least sometimes a good strategy to alternate between working much more than sustainable and then recovering.

my main point is this is a very different style of reasoning than what people usually do when they talk about how much their time is worth.