porby

Karma: 1,882

porby Mar 2, 2025, 6:34 PM
5 points
0
on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
This is great research and I like it!
I’d be interested in knowing more about how the fine-tuning is regularized and the strength of any KL-divergence-penalty-ish terms. I’m not clear on how the openai fine-tuning API works here with default hypers.
By default, I would expect that optimizing for a particular narrow behavior with no other constraints would tend to bring along a bunch of learned-implementation-dependent correlates. Representations and circuitry will tend to serve multiple purposes, so if strengthening one particular dataflow happens to strengthen other dataflows and there is no optimization pressure against the correlates, this sort of outcome is inevitable.
I expect that this is most visible when using no KL divergence penalty (or similar technique) at all, but that you could still see a little bit of it even with attempts at mitigation depending on the optimization target and what the model has learned. (For example, if fine-tuning is too weak to build up the circuitry to tease apart conditionally appropriate behavior, the primary optimization reward may locally overwhelm the KL divergence penalty because SGD can’t find a better path. I could see this being more likely with PEFT like LoRAs, maybe?)
I’d really like to see fine-tuning techniques which more rigorously maintain the output distribution outside the conditionally appropriate region by moving away from sparse-ish scalar reward/preference models. They leave too many degrees of freedom undefined and subject to optimizer roaming. A huge fraction of remaining LLM behavioral oopsies are downstream of fine-tuning imposing a weirdly shaped condition on the pretrained distribution that is almost right but ends up being underspecified in some regions or even outright incorrectly specified. This kind of research is instrumental in motivating that effort.

porby Feb 28, 2025, 7:00 AM
4 points
0
in reply to: WillPetillo’s comment on: FAQ: What the heck is goal agnosticism?
These things are possible, yes. Those bad behaviors are not necessarily trivial to access, though.
1. If you underspecify/underconstrain your optimization process, it may roam to unexpected regions permitted by that free space.
2. It is unlikely that the trainer’s first attempt at specifying the optimization constraints during RL-ish fine tuning will precisely bound the possible implementations to their truly desired target, even if the allowed space does contain it; underconstrained optimization is a likely default for many tasks.
3. Which implementations are likely to be found during training depends on what structure is available to guide the optimizer (everything from architecture, training scheme, dataset, and so on), and the implementations’ accessibility to the optimizer with respect to all those details.
4. Against the backdrop of the pretrained distribution on LLMs, low-level bad behavior (think Sydney Bing vibes) is easy to access (even accidentally!) against a pretraining distribution. Agentic coding assistants are harder to access; it’s very unlikely you will accidentally produce an agentic coding assistant. Likewise, it takes effort to specify an effective agent that pursues coherent goals against the wishes of its user. It requires a fair number of bits to narrow the distribution in that way.
5. More generally, if you use N bits to try to specify behavior A, having a nonnegligible chance of accidentally instead specifying behavior B requires that the bits you specify at minimum allow B, and to make it probable, they would need to imply B. (I think Sydney Bing is actually a good example case to consider here.)
6. For a single attempt at specifying behavior, it’s vastly more likely that a developer trains a model that fails in uninteresting ways than for them to accidentally specify just enough bits to achieve something that looks about right, but ends up entailing extremely bad outcomes at the same time. Uninteresting, useless, and easy-to-notice failures are the default because they hugely outnumber ‘interesting’ (i.e. higher bit count) failures.
7. You can still successfully specify bad behavior if you are clever, but malicious.
8. You can still successfully specify bad behavior if you make a series of mistakes. This is not impossible or even improbable; it has already happened and will happen again. Achieving higher capability bad behavior, however, tends to require more mistakes, and is less probable.
Because of this, I expect to see lots of early failures, and that more severe failures will be rarer proportional to the error rate needed to specify the failure. I strongly expect the failures to be visible enough that a desire to make a working product combined with something like liability frameworks would have some iterations to work and spook irresponsible companies into putting nonzero effort into not making particularly long series of mistakes. This is not a guarantee of safety.

porby Dec 30, 2024, 4:38 PM
4 points
0
in reply to: WillPetillo’s comment on: Instrumentality makes agents agenty
Instrumentality exists on the simulacra level, not the simulator level. This would suggest that corrigibility could be maintained by establishing a corrigible character in context. Not clear on the practical implications.
That one, yup. The moment you start conditioning (through prompting, fine tuning, or otherwise) the predictor into narrower spaces of action, you can induce predictions corresponding to longer term goals and instrumental behavior. Effective longer-term planning requires greater capability, so one should expect this kind of thing to be more apparent as models get stronger even as the base models can be correctly claimed to have ‘zero’ instrumentality.
In other words, the claims about simulators here are quite narrow. It’s pretty easy to end up thinking that this is useless if the apparent-nice-property gets deleted the moment you use the thing, but I’d argue that this is actually still a really good foundation. A longer version was the goal agnosticism FAQ, and there’s this RL comment poking at some adjacent and relevant intuitions, but I haven’t written up how all the pieces come together. A short version would be that I’m pretty optimistic at the moment about what path to capabilities greedy incentives are going to push us down, and I strongly suspect that the scariest possible architectures/techniques are actually repulsive to the optimizer-that-the-AI-industry-is.

porby Dec 29, 2024, 10:59 AM
4 points
0
in reply to: anaguma’s comment on: FFMI Gains: A List of Vitalities
There are lots of little things when it’s not at a completely untenable level. Stuff like:
1. Going up a flight or three of steps and really feeling it in my knees, slowing down, and saying ‘hoo-oof.’
2. Waking up and stepping out of bed and feeling general unpleasantness in my feet, ankles, knees, hips, or back.
3. Quickly seeking out places to sit when walking around, particularly if there’s also longer periods of standing, because my back would become terribly stiff.
4. Walking on uneven surfaces and having a much harder time catching myself when I stumbled, not infrequently causing me to tweak something in my ankle or knee.
5. Always having something hurting a bit. Usually my knees, back, or ankles, but maybe I tried to catch myself with my arm and my elbow didn’t like it because it wasn’t ready to arrest that much mass that quickly.
6. Trying to keep up on a somewhat sloped section of sidewalk and trying to not sound like I was struggling.
7. Being unable to basic motions like squatting (unweighted) without significant pain, and needing to use my arms to stand up.
8. Being woken up by aches.
9. Accumulated damage from mild and moderate sprains making itself known for years after the original incidents.
10. Conscious avoidance. Looking at some stairs and having a pang of “ugh,” and looking for an alternative.
11. Subconscious avoidance. Reaching down to pick up a backpack from the floor, but bracing one arm against a desk to minimize how much load is carried by my knees or hips because my learned motor patterns took into account that going too much further than that was hard and would likely be painful.
When it progresses, it’s hard to miss:
1. Laying on the ground, trying not to laugh at the absurdity of how thoroughly stuck I was, because laughing would hurt too much. I tried motivating myself to move, but came to the conclusion that even if there was a knife-wielding madman sprinting toward me in that moment, the involuntary muscle spasms caused by the pain would not have let me escape.
2. Walking around a corner, slowly, with bare feet, on level ground, indoors, and rolling my ankle.
3. Sometimes being unable to walk normally for days at a time because one of my knees decided to fall out of position while walking and putting weight on it afterwards squished a bunch of soft tissues that weren’t supposed to be pinched like that.
4. Injuries becoming mentally routine. Getting too much practice breathing through acute pain, developing a dispassionate approach to checking the joint to see how bad it is, begrudgingly calling for help when it was clear I wasn’t going to be able to walk.

porby Oct 27, 2024, 1:03 AM
2 points
0
in reply to: TristanTrim’s comment on: Why I think strong general AI is coming soon
Hey, we met at EAGxToronto : )
🙋‍♂️
So my model of progress has allowed me to observe our prosaic scaling without surprise, but it doesn’t allow me to make good predictions since the reason for my lack of surprise has been from Vingean prediction of the form “I don’t know what progress will look like and neither do you”.
This is indeed a locally valid way to escape one form of the claim—without any particular prediction carrying extra weight, and the fact that reality has to go some way, there isn’t much surprise in finding yourself in any given world.
I do think there’s value in another version of the word “surprise,” here, though. For example: the cross-entropy loss between the predicted distribution with respect to the observed distribution. Holding to a high uncertainty model of progress will result in continuously high “surprise” in this sense, because it struggles to narrow to a better distribution generator. It’s a sort of overdamped epistemological process.
I think we have enough information to make decent gearsy models of progress around AI. As a bit of evidence, some such models have already been exploited to make gobs of money. I’m also feeling pretty good^[1] about many of my predictions (like this post) that contributed to me pivoting entirely into AI; there’s an underlying model that has a bunch of falsifiable consequences which has so far survived a number of iterations, and that model has implications through the development of extreme capability.
What I have been surprised about has been governmental reaction to AI...
Yup! That was a pretty major (and mostly positive) update for me. I didn’t have a strong model of government-level action in the space and I defaulted into something pretty pessimistic. My policy/governance model is still lacking the kind of nuance that you only get by being in the relevant rooms, but I’ve tried to update here as well. That’s also part of the reason why I’m doing what I’m doing now.
In any case, I’ve been hoping for the last few years I would have time to do my undergrad and start working on the alignment without a misaligned AI going RSI, and I’m still hoping for that. So that’s lucky I guess. 🍀🐛
May you have the time to solve everything!
1. ^
  … epistemically

porby Jul 27, 2024, 2:26 PM
12 points
1
in reply to: Johannes C. Mayer’s comment on: My Advice for Incoming SERI MATS Scholars
I’ve got a fun suite of weird stuff going on^[1], so here’s a list of sometimes-very-N=1 data:
1. Napping: I suck at naps. Despite being very tired, I do not fall asleep easily, and if I do fall asleep, it’s probably not going to be for just 5-15 minutes. I also tend to wake up with a lot of sleep inertia, so the net effect of naps on alertness across a day tends to be negative. They also tend to destroy my sleep schedule.
2. Melatonin: probably the single most noticeable non-stimulant intervention. While I’m by-default very tired all the time, it’s still hard to go to sleep. Without mitigation, this usually meant it was nearly impossible to maintain a 24 hour schedule. Melatonin helps a lot with going to sleep and mostly pauses the forward march (unless I mess up).^[2]
3. Light therapy: subtle, but seems to have an effect. It’s more obvious when comparing ‘effectively being in a cave’ with ‘being exposed to a large amount of direct sunlight.’ I did notice that, when stacked on everything else, the period where I tried light therapy^[3] was the first time I was able to intentionally wake up earlier over the course of several days.
4. Avoiding excessive light near bed: pretty obviously useful. I’ve used blue-blocking glasses with some effect, though it’s definitely better to just not be exposed to too much light in the first place. I reduce monitor brightness to the minimum if I’m on the computer within 3-5 hours of sleep.
5. Consistent sleep schedule: high impact, if I can manage it. Having my circadian rhythm fall out of entrainment was a significant contributor^[4] to my historical sleeping 10-12 hours a day.^[5]
6. Going to bed earlier: conditioning on waking up with no alarm, sleep duration was not correlated with daytime alertness for me according to my sleep logs. Going to bed early enough such that most of my sleep was at night was correlated.^[6]
7. CPAP: Fiddled with one off-prescription a while since I had access to one and it was cheaper than testing for sleep apnea otherwise. No effect.^[7]
8. Nose strips: hard to measure impact on sleep quality, but subjectively nice! My nosetubes are on the small side, I guess.
9. Changing detergents/pillows: I seem to react to some detergents, softeners, dust, and stuff along those lines. It’s very obvious when I don’t double-rinse my pillowcases; my nose swells up to uselessness.
10. Sleeping room temperature: 62-66F is nice. 72F is not nice. 80F+ is torture.^[8]
11. Watercooled beds: I tried products like eight sleep for a while. If you don’t have the ability to reduce the air temperature and humidity to ideal levels, it’s worth it, but there is a comfort penalty. It doesn’t feel like laying on a fresh and pleasantly cool sheet; it’s weirdly like laying on somewhat damp sheets that never dry.^[9] Way better than nothing, but way worse than a good sleeping environment.^[10]
12. Breathable bedding: surprisingly noticeable. I bought some wirecutter-reviewed cotton percale sheets and a latex mattress. I do like the latex mattress, but I think the sheets have a bigger effect. Don’t have data on whether it meaningfully changed sleep quality, but it is nice.
13. Caffeine: pretty standard. Helps a bit. Not as strong as prescription stimulants at reasonable dosages, can’t use it every day without the effect diminishing very noticeably. And without tolerance, consuming it much later than immediately after getting out of bed disrupts my sleep the next night. I tend to drink some coffee in the morning on days where I don’t take other stimulants to make the mornings suck less.
14. Protriptyline: sometimes useful, but a very bad time for me. Got pretty much all the side effects, including the “stop taking and talk to your doctor immediately” kind and uncomfortably close to the “go to a hospital” kind.^[11]
15. Modafinil: alas, no significant effect. Maybe slightly clumsier, maybe slightly longer sleep, maybe slightly more tired. Best guess is that it interfered with my sleep a little bit.
16. Ritalin: Works! I use a low dose (12.5 mg/day) of the immediate release generic. Pretty short half-life, but that’s actually nice for being able to go to sleep. I often cut pills in half to manually allocate alertness more smoothly. I can also elect to just not take it before a plane flight or on days where being snoozey isn’t a big problem.
17. Stimulant juggling/off days: very hard to tell if there’s an effect on tolerance with N=1 for non-caffeine stimulants at low therapeutic dosages. I usually do ~5 ritalin days and ~2 caffeine days a week, and I can say that ritalin does still obviously work after several years.^[12]
18. Creatine: I don’t notice any sleep/alertness effect, though some people report it. I use it primarily for fitness reasons.^[13]
19. Exercise: hard to measure impact on alertness. Probably some long-term benefit, but if I overdo it on any given day, it’s easy to ruin myself. I exercise a bit every day to try to avoid getting obliterated.^[14]
20. Cytomel: this is a weird one that I don’t think will be useful to anyone reading this. It turns out that, while my TSH and T4 levels are normal, my untreated T3 levels are very low for still-unclear reasons. I had symptoms of hypothyroidism for decades, but it took until my late 20′s to figure out why. Hypothyroidism isn’t the same thing as a sleep disorder, but stacking fatigue on a sleep disorder isn’t fun.^[15]
21. Meal timing: another weird one. I’ve always had an unusual tendency towards hypoglycemic symptoms.^[16] In its milder form, this comes with severe fatigue that can seem a bit like sleepiness if you squint. As of a few weeks ago with the help of a continuous glucose monitor, I finally confirmed I’ve got some very wonky blood sugar behavior despite a normal A1C; one notable bit is a pattern of reactive hypoglycemia. I can’t avoid hypoglycemia during exercise by e.g. drinking chocolate milk beforehand. I’ve actually managed to induce mild hypoglycemia by eating a cinnamon roll pancake (and not exercising). Exercising without food actually works a bit better, though I do still have to be careful about the intensity * duration.
I’m probably forgetting some stuff.
1. ^
  “Idiopathic hypersomnia”-with-a-shrug was the sleep doctor’s best guess on the sleep side, plus a weirdo kind of hypothyroidism, plus HEDs, plus something strange going on with blood sugar regulation, plus some other miscellaneous and probably autoimmune related nonsense.
2. ^
  I tend to take 300 mcg about 2-5 hours before my target bedtime to help with entrainment, then another 300-600 mcg closer to bedtime for the sleepiness promoting effect.
3. ^
  In the form of luminette glasses. I wouldn’t say they have a great user experience; it’s easy to get a headache and the nose doohicky broke almost immediately. That’s part of why I didn’t keep using them, but I may try again.
4. ^
  But far from sole!
5. ^
  While still being tired enough during the day to hallucinate on occasion.
6. ^
  Implementing this and maintaining sleep consistency functionally requires other interventions. Without melatonin etc., my schedule free-runs mercilessly.
7. ^
  Given that I was doing this independently, I can’t guarantee that Proper Doctor-Supervised CPAP Usage wouldn’t do something, but I doubt it. I also monitored myself overnight with a camera. I do a lot of acrobatics, but there was no sign of apneas or otherwise distressed breathing.
8. ^
  When I was younger, I would frequently ask my parents to drop the thermostat down at night because we lived in one of those climates where the air can kill you if you go outside at the wrong time for too long. They were willing to go down to around 73F at night. My room was east-facing, theirs was west-facing. Unbeknownst to me, there was also a gap between the floor and wall that opened directly into the attic. That space was also uninsulated. Great times.
9. ^
  It wasn’t leaking!
10. ^
  The cooling is most noticeable at pressure points, so there’s a very uneven effect. Parts of your body can feel uncomfortably cold while you’re still sweating from the air temperature and humidity.
11. ^
  The “hmm my heart really isn’t working right” issues were bad, but it also included some spooky brain-hijacky mental effects. Genuinely not sure I would have survived six months on it even with total awareness that it was entirely caused by the medication and would stop if I stopped taking it. I had spent some years severely depressed when I was younger, but this was the first time I viscerally understood how a person might opt out… despite being perfectly fine 48 hours earlier.
12. ^
  I’d say it dropped a little in efficacy in the first week or two, maybe, but not by much, and then leveled out. Does the juggling contribute to this efficacy? No idea. Caffeine and ritalin both have dopaminergic effects, so there’s probably a little mutual tolerance on that mechanism, but they do have some differences.
13. ^
  Effect is still subtle, but creatine is one of the only supplements that has strong evidence that it does anything.
14. ^
  Beyond the usual health/aesthetic reasons for exercising, I also have to compensate for joint loosey-gooseyness related to doctor-suspected HEDs. Even now, I can easily pull my shoulders out of socket, and last week discovered that (with the help of some post-covid-related joint inflammation), my knees still do the thing where they slip out of alignment mid-step and when I put weight back on them, various bits of soft tissues get crushed. Much better than it used to be; when I was ~18, there were many days where walking was uncomfortable or actively painful due to a combination of ankle, knee, hip, and back pain.
15. ^
  Interesting note: my first ~8 years of exercise before starting cytomel, including deliberate training for the deadlift, saw me plateau at a 1 rep max on deadlift of… around 155 pounds. (I’m a bit-under-6′4″ male. This is very low, like “are you sure you’re even exercising” low. I was, in fact, exercising, and sometimes at an excessive level of intensity. I blacked out mid-rep once; do not recommend.)
  Upon starting cytomel, my strength increased by around 30% within 3 months. Each subsequent dosage increase was followed by similar strength increases. Cytomel is not an anabolic steroid and does not have anabolic effects in healthy individuals.
  I’m still no professional powerlifter, but I’m now at least above average within the actively-lifting population of my size. The fact that I “wasted” so many years of exercise was… annoying.
16. ^
  Going too long without food or doing a little too much exercise is a good way for me to enter a mild grinding hypoglycemic state. More severely, when I went a little too far with intense exercise, I ended up on the floor unable to move while barely holding onto consciousness.

porby May 9, 2024, 3:13 AM
4 points
1
in reply to: Steven Byrnes’s comment on: Does reducing the amount of RL for a given capability level make AI safer?
But I disagree that there’s no possible RL system in between those extremes where you can have it both ways.
I don’t disagree. For clarity, I would make these claims, and I do not think they are in tension:
1. Something being called “RL” alone is not the relevant question for risk. It’s how much space the optimizer has to roam.
2. MuZero-like strategies are free to explore more space than something like current applications of RLHF. Improved versions of these systems working in more general environments have the capacity to do surprising things and will tend to be less ‘bound’ in expectation than RLHF. Because of that extra space, these approaches are more concerning in a fully general and open-ended environment.
3. MuZero-like strategies remain very distant from a brute-forced policy search, and that difference matters a lot in practice.
4. Regardless of the category of the technique, safe use requires understanding the scope of its optimization. This is not the same as knowing what specific strategies it will use. For example, despite finding unforeseen strategies, you can reasonably claim that MuZero (in its original form and application) will not be deceptively aligned to its task.
5. Not all applications of tractable RL-like algorithms are safe or wise.
6. There do exist safe applications of RL-like algorithms.

porby May 8, 2024, 2:57 AM
6 points
2
in reply to: Steven Byrnes’s comment on: Does reducing the amount of RL for a given capability level make AI safer?
It does still apply, though what ‘it’ is here is a bit subtle. To be clear, I am not claiming that a technique that is reasonably describable as RL can’t reach extreme capability in an open-ended environment.
The precondition I included is important:
in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function
In my frame, the potential future techniques you mention are forms of optimizer guidance. Again, that doesn’t make them “fake RL,” I just mean that they are not doing a truly unconstrained search, and I assert that this matters a lot.
For example, take the earlier example of a hypercomputer that brute forces all bitstrings corresponding to policies and evaluates them to find the optimum with no further guidance required. Compare the solution space for that system to something that incrementally explores in directions guided by e.g. strong future LLM, or something. The RL system guided by a strong future LLM might achieve superhuman capability in open-ended domains, but the solution space is still strongly shaped by the structure available to the optimizer during training and it is possible to make much better guesses about where the optimizer will go at various points in its training.
It’s a spectrum. On one extreme, you have the universal-prior-like hypercomputer enumeration. On the other, stuff like supervised predictive training. In the middle, stuff like MuZero, but I argue MuZero (or its more open-ended future variants) is closer to the supervised side of things than the hypercomputer side of things in terms of how structured the optimizer’s search is. The closer a training scheme is to the hypercomputer one in terms of a lack of optimizer guidance, the less likely it is that training will do anything at all in a finite amount of compute.

porby May 7, 2024, 3:32 AM
6 points
1
in reply to: Steven Byrnes’s comment on: Does reducing the amount of RL for a given capability level make AI safer?
Calling MuZero RL makes sense. The scare quotes are not meant to imply that it’s not “real” RL, but rather that the category of RL is broad enough that it belonging to it does not constrain expectation much in the relevant way. The thing that actually matters is how much the optimizer can roam in ways that are inconsistent with the design intent.
For example, MuZero can explore the superhuman play space during training, but it is guided by the structure of the game and how it is modeled. Because of that structure, we can be quite confident that the optimizer isn’t going to wander down a path to general superintelligence with strong preferences about paperclips.

porby May 5, 2024, 9:50 PM
10 points
5
in reply to: Chris_Leong’s comment on: Does reducing the amount of RL for a given capability level make AI safer?
I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you’ve grasped something about the problem more deeply, and that would often imply greater safety.
Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps.
I suspect there are less trivial cases, like how a decision transformer isn’t just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can’t currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.^[1]^[2]
But there are often reasons why the more-RL-shaped thing is currently being used. It’s not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I’m pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse.^[3]
1. ^
  KL divergence penalties are one thing, but it’s hard to do better than the loss directly forcing adherence to the distribution.
2. ^
  You can also make a far more direct argument about model-level goal agnosticism in the context of prediction.
3. ^
  I don’t think this is likely, to be clear. They’re just both pretty low probability concerns (provided the optimization space is well-constrained).

porby May 5, 2024, 6:24 PM
72 points
11
on: Does reducing the amount of RL for a given capability level make AI safer?
“RL” is a wide umbrella. In principle, you could even train a model with RL such that the gradients match supervised learning. “Avoid RL” is not the most directly specified path to the-thing-we-actually-want.
The source of spookiness
Consider two opposite extremes:
1. A sparse, distant reward function. A biped must successfully climb a mountain 15 kilometers to the east before getting any reward at all.
2. A densely shaped reward function. At every step during the climb up the mountain, there is a reward designed to induce gradients that maximize training performance. Every slight mispositioning of a toe is considered.
Clearly, number 2 is going to be easier to train, but it also constrains the solution space for the policy.
If number 1 somehow successfully trained, what’s the probability that the solution it found would look like number 2′s imitation data? What’s the probability it would look anything like a bipedal gait? What’s the probability it just exploits the physics simulation to launch itself across the world?
If you condition on a sparse, distant reward function training successfully, you should expect the implementation found by the optimizer to sample from a wide distribution of possible implementations that are compatible with the training environment.
It is sometimes difficult to predict what implementations are compatible with the environment. The more degrees of freedom exist in the environment, the more room the optimizer has to roam. That’s where the spookiness comes from.
Is RL therefore spooky?
RL appears to make this spookiness more accessible. It’s difficult to use (un)supervised learning in a way that gives a model great freedom of implementation; it’s usually learning from a large suite of examples.
But there’s a major constraint on RL: in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function. It simply won’t sample the reward often enough to produce useful gradients.^[1]
In other words, practical applications of RL are computationally bounded to a pretty limited degree of reward sparsity/distance. All the examples of “RL” doing interesting things that look like they involve sparse/distant reward involve enormous amounts of implicit structure of various kinds, like powerful world models.^[2]
Given these limitations, the added implementation-uncertainty of RL is usually not so massive that it’s worth entirely banning it. Do be careful about what you’re actually reinforcing, just as you must be careful with prompts or anything else, and if you somehow figure out a way to make from-scratch sparse/distant rewards work better without a hypercomputer, uh, be careful?
A note on offline versus online RL
The above implicitly assumes online RL, where the policy is able to learn from new data generated by the policy as it interacts with the environment.
Offline RL that learns from an immutable set of data does not allow the optimizer as much room to explore, and many of the apparent risks of RL are far less accessible.
Usage in practice
The important thing is that the artifact produced by a given optimization process falls within some acceptable bounds. Those bounds might arise from the environment, computability, or something else, but they’re often available.
RL-as-it-can-actually-be-applied isn’t that special here. The one suggestion I’d have is to try to use it in a principled way. For example: doing pretraining but inserting an additional RL-derived gradient to incentivize particular behaviors works, but it’s just arbitrarily shoving a bias/precondition into the training. The result will be at some equilibrium between the pretraining influence and the RL influence. Perhaps the weighting could be chosen in an intentional way, but most such approaches are just ad hoc.
For comparison, you could elicit similar behavior by including a condition metatoken in the prompt (see decision transformers for an example). With that structure, you can be more explicit about what exactly the condition token is supposed to represent, and you can do fancy interpretability techniques to see what the condition is actually causing mechanistically.^[3]
1. ^
  If you could enumerate all possible policies with a hypercomputer and choose the one that performs the best on the specified reward function, that would train, and it would also cause infinite cosmic horror. If you have a hypercomputer, don’t do that.
2. ^
  Or in the case of RLHF on LLMs, the fine-tuning process is effectively just etching a precondition into the predictor, not building complex new functions. Current LLMs, being approximators of probabilistic inference to start with, have lots of very accessible machinery for this kind of conditioning process.
3. ^
  There are other options here, but I find this implementation intuitive.
What links here?

porby Apr 28, 2024, 9:19 PM
18 points
0
on: List your AI X-Risk cruxes!
Stated as claims that I’d endorse with pretty high, but not certain, confidence:
1. There exist architectures/training paradigms within 3-5 incremental insights of current ones that directly address most incapabilities observed in LLM-like systems. (85%; if false, my median strong AI estimate would jump by a few years, p(doom) effect would vary depending on how it was falsified)
2. It is not an accident that the strongest artificial reasoners we have arose from something like predictive pretraining. In complex and high dimensional problem spaces like general reasoning, successful training will continue to depend on schemes with densely informative gradients that can constrain the expected shape of the training artifact. In those problem spaces, training that is roughly equivalent to sparse/distant reward in naive from-scratch RL will continue to mostly fail.^[1] (90%; if false, my p(doom) would jump a lot)
3. Related to, and partially downstream of, #2: the strongest models at the frontier of AGI will continue to be remarkably corrigible (in the intuitive colloquial use of the word, but not strictly MIRI’s use). That is, the artifact produced by pretraining and non-malicious fine tuning will not be autonomously doomseeking even if it has the capability. (A bit less than 90%; this being false would also jump by p(doom) by a lot)
4. Creating agents out of these models is easy and will get easier. Most of the failures in current agentic applications are not fundamental, and many are related to #1. There are no good ways to stop a weights-available model from, in principle, being used as a potentially dangerous agent, and outcome variance will increase as capabilities increase. (95%; I’m not even sure what the shape of this being false would be, but if there was a solution, it’d drop my current p(doom) by at least half)
5. Scale is sufficient to bypass the need for some insights. While a total lack of insights would make true ASI difficult to reach in the next few years, the hardware and scale of 2040 is very likely enough to do it the dumb way, and physics won’t get in the way soon enough. (92%; falsification would make the tail of my timelines longer. #1 and #5 being falsified together could jump my median by 10+ years.)
6. We don’t have good plans for how to handle a transition period involving widely available high-capability systems, even assuming that those high-capability systems are only dangerous when intentionally aimed in a dangerous direction.^[2] It looks an awful lot like we’re stuck with usually-reactive muddling, and maybe some pretty scary sounding defensive superintelligence propositions. (75%; I’m quite ignorant of governance and how international coordination could actually work here, but it sure seems hard. If this ends up being easy, it would also drop my p(doom) a lot.)
1. ^
  Note that this is not a claim that something like RLHF is somehow impossible. RLHF, and other RL-adjacent techniques that have reward-equivalents that would never realistically train from scratch, get to select from the capabilities already induced by pretraining. Note that many ‘strong’ RL-adjacent techniques involve some form of big world model, operate in some constrained environment, or otherwise have some structure to work with that makes it possible for the optimizer to take useful incremental steps.
2. ^
  One simple story of many, many possible stories:
  1. It’s 20XY. Country has no nukes but wants second strike capacity.
  2. Nukes are kinda hard to get. Open-weights superintelligences can be downloaded.
  3. Country fine-tunes a superintelligence to be an existential threat to everyone else that is activated upon Country being destroyed.
  4. Coordination failures occur; Country gets nuked or invaded in a manner sufficient to trigger second strike.
  5. There’s a malign superintelligence actively trying to kill everyone, and no technical alignment failures occurred. Everything AI-related worked exactly as its human designers intended.
What links here?
- Seth Herd's comment on LLMs seem (relatively) safe by JustisMills (Apr 29, 2024, 7:37 PM; 4 points)

porby Feb 9, 2024, 6:29 PM
3 points
−1
in reply to: Decaeneus’s comment on: porby’s Shortform
Yup, exactly the same experience here.

porby Feb 6, 2024, 10:53 PM
5 points
1
on: porby’s Shortform
Has there been any work on the scaling laws of out-of-distribution capability/behavior decay?
A simple example:
1. Simultaneously train task A and task B for N steps.
2. Stop training task B, but continue to evaluate the performance of both A and B.
3. Observe how rapidly task B performance degrades.
Repeat across scale and regularization strategies.
Would be nice to also investigate different task types. For example, tasks with varying degrees of implied overlap in underlying mechanisms (like #2).
I’ve previously done some of these experiments privately, but not with nearly the compute necessary for an interesting result.
The sleeper agents paper reminded me of it. I would love to see what happens on a closer-to-frontier model that’s intentionally backdoored, and then subjected to continued pretraining. Can a backdoor persist for another trillion tokens of nonadversarial-but-extremely-broad training? Does that vary across scale etc?
I’d also like to intentionally find the circumstances that maximize the persistence of out of distribution capabilities not implied by the current training distribution.
Seems like identifying a robust trend here would have pretty important Implications, whichever direction it points.

porby Feb 2, 2024, 7:37 PM
4 points
0
in reply to: porby’s comment on: porby’s Shortform
A further extension and elaboration on one of the experiments in the linkpost:
Pitting execution fine-tuning against input fine-tuning also provides a path to measuring the strength of soft prompts in eliciting target behaviors. If execution fine-tuning “wins” and manages to produce a behavior in some part of input space that soft prompts cannot elicit, it would be a major blow to the idea that soft prompts are useful for dangerous evaluations.
On the flip side, if ensembles of large soft prompts with some hyperparameter tuning always win (e.g. execution fine tuning cannot introduce any behaviors accessible by any region of input space without soft prompts also eliciting it), then they’re a more trustworthy evaluation in practice.

porby Feb 2, 2024, 7:31 PM
7 points
2
on: porby’s Shortform
Having escaped infinite overtime associated with getting the paper done, I’m now going back and catching up on some stuff I couldn’t dive into before.
Going through the sleeper agents paper, it appears that one path—adversarially eliciting candidate backdoor behavior—is hampered by the weakness of the elicitation process. Or in other words, there exist easily accessible input conditions that trigger unwanted behavior that LLM-driven adversarial training can’t identify.
I alluded to this in the paper linkpost, but soft prompts are a very simple and very strong option for this. There remains a difficulty in figuring out what unwanted behavior to adversarially elicit, but this is an area that has a lot of low hanging fruit.
I’d also interested in whether how more brute force interventions, like autoregressively detuning a backdoored model with a large soft prompt for a very large dataset (or an adversarially chosen anti-backdoor dataset) compares to the other SFT/RL interventions. Activation steering, too; I’m currently guessing activation-based interventions are the cheapest for this sort of thing.

porby Feb 2, 2024, 5:51 AM
5 points
0
on: Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities
By the way: I just got into San Francisco for EAG, so if anyone’s around and wants to chat, feel free to get in touch on swapcard (or if you’re not in the conference, perhaps a DM)! I fly out on the 8th.

Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities

porbyFeb 2, 2024, 5:49 AM

47 points

1 comment4 min readLW link

(arxiv.org)

porby Dec 16, 2023, 11:21 PM
29 points
3
on: Why I think strong general AI is coming soon
It’s been over a year since the original post and 7 months since the openphil revision.
A top level summary:
1. My estimates for timelines are pretty much the same as they were.
2. My P(doom) has gone down overall (to about 30%), and the nature of the doom has shifted (misuse, broadly construed, dominates).
And, while I don’t think this is the most surprising outcome nor the most critical detail, it’s probably worth pointing out some context. From NVIDIA:
In two quarters, from Q1 FY24 to Q3 FY24, datacenter revenues went from $4.28B to $14.51B.
From the post:
In 3 years, if NVIDIA’s production increases another 5x …
Revenue isn’t a perfect proxy for shipped compute, but I think it’s safe to say we’ve entered a period of extreme interest in compute acquisition. “5x” in 3 years seems conservative.^[1] I doubt the B100 is going to slow this curve down, and competitors aren’t idle: AMD’s MI300X is within striking distance, and even Intel’s Gaudi 2 has promising results.
Chip manufacturing remains a bottleneck, but it’s a bottleneck that’s widening as fast as it can to catch up to absurd demand. It may still be bottlenecked in 5 years, but not at the same level of production.
On the difficulty of intelligence
I’m torn about the “too much intelligence within bounds” stuff. On one hand, I think it points towards the most important batch of insights in the post, but on the other hand, it ends with an unsatisfying “there’s more important stuff here! I can’t talk about it but trust me bro!”
I’m not sure what to do about this. The best arguments and evidence are things that fall into the bucket of “probably don’t talk about this in public out of an abundance of caution.” It’s not one weird trick to explode the world, but it’s not completely benign either.
Continued research and private conversations haven’t made me less concerned. I do know there are some other people who are worried about similar things, but it’s unclear how widely understood it is, or whether someone has a strong argument against it that I don’t know about.
So, while unsatisfying, I’d still assert that there are highly accessible paths to broadly superhuman capability on short timescales. Little of my forecast’s variance arises from uncertainty on this point; it’s mostly a question of when certain things are invented, adopted, and then deployed at sufficient scale. Sequential human effort is a big chunk; there are video games that took less time to build than the gap between this post’s original publication date and its median estimate of 2030.
On doom
When originally writing this, my model of how capabilities would develop was far less defined, and my doom-model was necessarily more generic.
A brief summary would be:
1. We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn’t route through the world.
2. It’s remarkably easy to elicit this form of extreme capability to guide itself. This isn’t some incidental detail; it arises from the core process that the model learned to implement.
3. That core process is learned reliably because the training process that yielded it leaves no room for anything else. It’s not a sparse/distant reward target; it is a profoundly constraining and informative target.
I’ve written more on the nice properties of some of these architectures elsewhere. I’m in the process of writing up a complementary post on why I think these properties (and using them properly) are an attractor in capabilities, and further, why I think some of the x-riskiest forms of optimization process are actively repulsive for capabilities. This requires some justification, but alas, the post will have to wait some number of weeks in the queue behind a research project.
The source of the doom-update is the correction of some hidden assumptions in my doom model. My original model was downstream of agent foundations-y models, but naive. It followed a process: set up a framework, make internally coherent arguments within that framework, observe highly concerning results, then neglect to notice where the framework didn’t apply.
Specifically, some of the arguments feeding into my doom model were covertly replacing instances of optimizers with hypercomputer-based optimizers^[2], because hey, once you’ve got an optimizer and you don’t know any bounds on it, you probably shouldn’t assume it’ll just turn out convenient for you, and hypercomputer-optimizers are the least convenient.
For example, this part:
Is that enough to start deeply modeling internal agents and other phenomena concerning for safety?
And this part:
AGI probably isn’t going to suffer from these issues as much. Building an oracle is probably still worth it to a company even if it takes 10 seconds for it to respond, and it’s still worth it if you have to double check its answers (up until oops dead, anyway).
With no justification, I imported deceptive mesaoptimizers and other “unbound” threats. Under the earlier model, this seemed natural.
I now think there are bounds on pretty much all relevant optimizing processes up and down the stack from the structure of learned mesaoptimizers to the whole capability-seeking industry. Those bounds necessarily chop off large chunks of optimizer-derived doom; many outcomes that previously seemed convergent to me now seem extremely hard to access.
As a result, “technical safety failure causes existential catastrophe” dropped in probability by around 75-90%, down to something like 5%-ish.^[3]
I’m still not sure how to navigate a world with lots of extremely strong AIs. As capability increases, outcome variance increases. With no mitigations, more and more organizations (or, eventually, individuals) will have access to destabilizing systems, and they would amplify any hostile competitive dynamics.^[4] The “pivotal act” frame gets imported even if none of the systems are independently dangerous.
I’ve got hope that my expected path of capabilities opens the door for more incremental interventions, but there’s a reason my total P(doom) hasn’t yet dropped much below 30%.
1. ^
  The reason why this isn’t an update for me is that I was being deliberately conservative at the time.
2. ^
  A hypercomputer-empowered optimizer can jump to the global optimum with brute force. There isn’t some mild greedy search to be incrementally shaped; if your specification is even slightly wrong in a sufficiently complex space, the natural and default result of a hypercomputer-optimizer is infinite cosmic horror.
3. ^
  It’s sometimes tricky to draw a line between “oh this was a technical alignment failure that yielded an AI-derived catastrophe, as opposed to someone using it wrong,” so it’s hard to pin down the constituent probabilities.
4. ^
  While strong AI introduces all sorts of new threats, its generality amplifies “conventional” threats like war, nukes, and biorisk, too. This could create civilizational problems even before a single AI could, in principle, disempower humanity.
What links here?
- Voting Results for the 2022 Review by Ben Pace (Feb 2, 2024, 8:34 PM; 57 points)
- habryka's comment on The LessWrong 2022 Review: Review Phase by RobertM (Jan 10, 2024, 10:04 PM; 17 points)

porby Dec 13, 2023, 11:41 PM
14 points
2
on: AI Views Snapshots
Mine:
My answer to “If AI wipes out humanity and colonizes the universe itself, the future will go about as well as if humanity had survived (or better)” is pretty much defined by how the question is interpreted. It could swing pretty wildly, but the obvious interpretation seems ~tautologically bad.

porby

The source of spookiness

Is RL therefore spooky?

A note on offline versus online RL

Usage in practice

Soft Prompts for Eval­u­a­tion: Mea­sur­ing Con­di­tional Dis­tance of Capabilities

On the difficulty of intelligence

On doom

Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities