AdamGleave

Karma: 913

AdamGleave Jun 20, 2023, 1:34 AM
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: AI Safety in a World of Vulnerable Machine Learning Systems
Ah, that paper makes a lot more sense. A reward model was attractive in the original Deep RL From Human Preferences paper because the environment was complex and non-differentiable: using RL was a natural fit. It’s always seemed a bit stranger to use RL for fine-tuning language models, especially in the prompt-completion setting where the “environment” is trivial. (RL becomes more natural when you start introducing external tools, or conversations with humans.)
I’ll need to take a closer look at the paper, but it looks like they derive the DPO objective by starting from the RL objective under KL optimization. So if it does what it says on the tin, then I’d expect the resulting policy incentives to be similar. My hunch is the problem of reward hacking has shifted from an explicit to implicit problem rather than being eliminated, although I’m certainly not confident on this. Could be interesting to study using a similar approach to the Scaling Laws for Reward Model Overoptimization paper.

AdamGleave Jun 19, 2023, 3:17 AM
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: AI Safety in a World of Vulnerable Machine Learning Systems
Thanks for the follow-up, this helps me understand your view!
At any given point, the reward model will be vulnerable to arbitrary adversarial attacks under sufficient optimization pressure, but we don’t need arbitrary optimization against any given reward model. Like, each human update lets you optimize a bit more against the reward model which gets you the ability to get somewhat closer to the policy you actually want.
Sure, this feels basically right to me. My reframing of this would be that we could do in principle do RL directly with feedback provided by a human. Learning a reward model lets us gain some sample efficiency over this, but sometimes it fails to generalize correctly to what a human would say, and adversarial examples are an important special case of this. But provided the policy is the final output of this process, not the reward model, it doesn’t matter if the reward model is robust or not—just that the feedback the policy has received has steered it into a good position.
It seems to me like your views either imply that sample efficiency is low enough now that high quality RLHF currently can’t compete with other ways of training AIs which are less safe but have cheaper reward signals (e.g., heavy training on automated outcomes based feedback or low quality human reward signal). Or possibly that this will happen in the future as models get more powerful (reversing the current trend toward better sample efficiency). My understanding is that RLHF is quite commercially popular and sample efficiency isn’t a huge issue. Perhaps I’m missing some important gap between how RLHF is used now and how it will have to be used in the future to align powerful systems?
This argument feels like it’s proving too much. InstructGPT isn’t perfect, but it does produce a lot less toxic output and follow instructions a lot better than the base model GPT-3. RLHF seems to work, and GPT-4 is even better, showing that it gets easier with bigger models. Why should we expect this trend to reverse? Why are we worried about this safety thing anyway?
I actually find this style of argument pretty plausible: I’m a relative optimist on this forum, I do think that some fairly basic methods like a souped-up RLHF might well be sufficient to make things go OK (while preferring to have more principled methods giving us a bigger safety margin). But I’m somewhat surprised to hear you making that case!
Suppose we condition on RLHF failing. At a high level, failures split into: (a) human labelers rewarded the wrong thing (e.g. fooling humans); (b) the reward model failed to predict human labelers judgement and rewarded the wrong thing (e.g. reward hacking); (c) RL produced a policy that is capable enough to be dangerous but is optimizing something other than the reward model (e.g. mesa-optimization). I find all three of these risks plausible, and I don’t see a specific reason to privilege (a) or (c) substantially over (b). It sounds like you’re most concerned about (a), and I’d love to hear your reasons for that.
However, I do think it’s interesting to explore concrete failure modes: given RLHF is working well now, what does my view imply about how it might stop working? One scenario I find plausible is that as models get bigger, the sample efficiency of RLHF continues to increase, since the models have higher fidelity representations of a greater variety of tasks. RLHF therefore just needs to localize the task that’s already in the network. However, the performance of this process is ultimately capped by what the base model already represents. I can see two ways this could go wrong:
1. Misaligned base model. If the foundation model that’s being RLHF’d is already misaligned, then a small amount of RL training is not going to be enough to disabuse it of this. By contrast, a large amount of RL training (of the order of 10% of the training steps used for self-supervised learning) with a high-fidelity reward signal might. Unfortunately, we can’t do that much RL training without just reward hacking, so we never try. (Or we take that many time steps, but with a KL penalty, forcing the model to stay close to the unaligned base model.)
2. Harmless base model. If the foundation model starts off harmless (not necessarily aligned, just not actively trying to cause harm), then I’d expect RLHF’ing it to only improve things so long as the training signal never rewards bad behavior. However, the designers want the model to significantly outperform humans at this task. The model has capacity to learn to do this, but can’t just leverage existing capabilities in the foundation model, as the performance of that model is limited to that of the best humans it saw in the self-supervised training data. So, we need to do RL for many more time steps. Collecting fresh human data for that is prohibitive, so we rely on a reward model—unfortunately that gets hacked.
I think I find the second case more compelling. The first case seems concerning as well, but it seems like quite a scary scenario even if robustness gets solved, whereas I expect fixing robustness to actually make a significant dent in the second scenario.
That said, I feel like I should emphasize in all this that I largely think robustness is an intractable problem to solve, and that while it’s worth trying to improve it at the margin I’m most excited by efforts to make systems not need robustness. I think you make a good point that having humans in the loop in the training makes RLHF degrade more gracefully than reward learning approaches that train on a fixed offline dataset, and the KL penalty also helps. I suspect there are many similar algorithmic tweaks that would make algorithms less sensitive to robustness.
For example, riffing off going from an offline to online dataset, could we improve things further by collecting a dataset that anticipates where the RL process might try to exploit it? That sounds fanciful, but there’s a simple hack you can do to get something like this. Just train a policy on the current reward model. Then collect human feedback from that policy. Then roll the policy back to the last checkpoint, and repeat the training using the new reward model. You could do this step once per checkpoint, or keep doing it until you get human approval to move on (e.g. the reward model now aligns with human feedback). In this way you should be able to avoid ever giving the policy reward for the wrong behavior. I suspect this process would be much less sample efficient than vanilla RLHF, but it would have better safety properties, and measuring how much slower it is could be a good proxy for how severe the “robustness tax” is.
Also note that vanilla RLHF doesn’t necessarily require optimizing against a reward model. For instance, a recently released paper Direct Policy Optimization (DPO) avoids this step entirely.
I’m a bit confused by the claim here, although I’ve only read the abstract and skimmed the paper so perhaps it’d become obvious from a closer read. As far as I can tell, the cited paper focuses on motion-planning, and considers a rather restricted setting of LQR policies. This is a reasonable starting point, but a human communicating that a cart pole should stand up (or a desired quadcoptor trajectory) feels much simpler than even toy tasks for RLHF in LLMs like summarizing text. So I don’t really see this as much evidence in favor of being able to drop the reward model. Generally, any highly sample efficient model-based RL approach would enable RL training to proceed without a reward model by having humans directly label the trajectory,

AdamGleave Jun 16, 2023, 4:25 PM
LW: 4 AF: 4
0
AF
in reply to: ryan_greenblatt’s comment on: AI Safety in a World of Vulnerable Machine Learning Systems
To check my understanding, is your view something like:
1. If the reward model isn’t adversarially robust, then the RL component of RLHF will exploit it.
2. These generations will show up in the data presented to the human. Provided the human is adversarially robust, then the human feedback will provide corrective signal to the reward model.
3. The reward model will stop being vulnerable to those adversarial examples, although may still be vulnerable to other adversarial examples.
4. If we repeat this iterative process enough times, we’ll end up with a robust reward model.
Under this model, improving adversarial robustness just means we need fewer iterations, showing up as improved sample efficiency.
I agree with this view up to a point. It does seem likely that with sufficient iterations, you’d get an accurate reward model. However, I think the difference in sample efficiency could be profound: e.g exponential (needing to get explicit corrective human feedback for most adversarial examples) vs linear (generalizing in the right way from human feedback). In that scenario, we may as well just ditch the reward model and provide training signal to the policy directly from human feedback.
In practice, we’ve seen that adversarial training (with practical amounts of compute) improves robustness but models are still very much vulnerable to attacks. I don’t see why RLHF’s implicit adversarial training would end up doing better than explicit adversarial training.
In general I tend to view sample efficiency discussions as tricky without some quantitative comparison. There’s a sense in which decision trees and a variety of other simple learning algorithms are a viable approach to AGI, they’re just very sample (and compute) inefficient.
The main reason I can see why RLHF may not need adversarial robustness is if the KL penalty from base model approach people currently use is actually enough.

AdamGleave Jun 14, 2023, 9:08 PM
5 points
0
in reply to: Linch’s comment on: Lightcone Infrastructure/LessWrong is looking for funding
Yes, thanks for spotting my typo! ($2.75 psf isn’t crazy for Berkeley after negotiation, but is not something I’ve ever seen as a list price.)

AdamGleave Jun 14, 2023, 8:10 PM
10 points
3
in reply to: habryka’s comment on: Lightcone Infrastructure/LessWrong is looking for funding
To compare this to other costs, renting two floors of the WeWork, which we did for most of the summer last year, cost around $1.2M/yr for 14,000 sq. ft. of office space. The Rose Garden has 20,000 sq. ft. of floor space and 20,000 additional sq. ft. of usable outdoor space for less implied annual cost than that.
I’m sympathetic to the high-level claim that owning property usually beats renting if you’re committing for a long time period. But the comparison with WeWork seems odd: WeWork specializes in providing short-term, serviced office space and does so at a substantial premium to the more traditional long-term, unserviced commercial real estate contract. When we were looking for office space in Berkeley earlier this year we were seeing list price between $3.25-$3.75/month per square foot, or $780k-900k/year for 20,000 square feet. I’d expect with negotiation you could get somewhat better pricing than this implies, especially if committing to a longer time period.
Of course, the extra outdoor space, mixed-use zoning and ability to highly customize the space may well offset this. But it starts depending a lot more on the details (e.g. how often is the outdoor space used; how much more productive are people in a customized space vs a traditional office) than it might first seem.

AdamGleave Mar 16, 2023, 10:29 PM
LW: 1 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: AI Safety in a World of Vulnerable Machine Learning Systems
This is a good point, adversarial examples in what I called in the post the “main” ML system can be desirable even though we typically don’t want them in the “helper” ML systems used to align the main system.
One downside to adversarial vulnerability of the main ML system is that it could be exploited by bad actors (whether human or other, misaligned AIs). But this might be fine in some settings: e.g. you build some airgapped system that helps you build the next, more robust and aligned AI. One could also imagine crafting adversarial example backdoors that are cryptographically hard to discover if you don’t know how they were constructed.
I generally expect that if adversarial robustness can be practically solved then transformative AI systems will eventually self-improve themselves to the point of being robust. So, the window where AI systems are dangerous & deceptive enough that we need to test them using adversarial examples but not capable enough to have overcome this might be quite short. Could still be useful as an early-warning sign, though.

AdamGleave Mar 9, 2023, 1:54 AM
LW: 1 AF: 1
0
AF
in reply to: DanielFilan’s comment on: AI Safety in a World of Vulnerable Machine Learning Systems
Right: if the agent has learned an inner objective of “do things similar to what humans do in the world at the moment I am currently acting”, then it’d definitely be incentivized to do that. It’s not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn’t help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you’re worried about inner optimization, but most the other failures described happen even in the outer alignment case.
That said, if you had a different imitation learning setup that was something like doing RL on a reward of “do the same thing one of our human labelers chooses given the same state” then the outer objective would reward what the behavior you describe. It’d be a hard exploration problem for the agent to learn to exploit the reward in that way, but it quite probably could do so if situationally aware.

AdamGleave Mar 9, 2023, 1:47 AM
LW: 1 AF: 1
0
AF
in reply to: Ethan Caballero’s comment on: AI Safety in a World of Vulnerable Machine Learning Systems
Thanks, I’d missed that!
Curious if you have any high-level takeaways from that? Bigger models do better, clearly, but e.g. how low do you think we’ll be able to get the error rate in the next 5-10 years given expected compute growth? Are there any follow-up experiments you’d like to see happen in this space?
Also could you clarify whether the setting was for adversarial training or just a vanilla model? “During training, adversarial examples for training are constructed by PGD attacker of 30 iterations” makes me think it’s adversarial training but I could imagine this just being used for evals.

AI Safety in a World of Vulnerable Machine Learning Systems

AdamGleave and EuanMcLean

Mar 8, 2023, 2:40 AM

70 points

28 comments29 min readLW link

(far.ai)

AdamGleave Dec 21, 2022, 6:12 AM
LW: 12 AF: 10
1
AF
on: CIRL Corrigibility is Fragile
Rachel did the bulk of the work on this post (well-done!), I just provided some advise on the project and feedback on earlier manuscripts.
I wanted to share why I’m personally excited by this work in case it helps contextualize it for others.
We’d all like AI systems to be “corrigible”, cooperating with us in correcting them. Cooperative IRL has been proposed as a solution to this. Indeed Dylan Hadfield-Menell et al show that CIRL is provably corrigible in a simple setting, the off-switch game.
Provably corrigible sounds great, but where there’s a proof there’s also an assumption, and Carey et al soon pointed out a number of other assumptions under which this no longer holds—e.g. if there is model misspecification causing the incorrect probability distribution to be computed.
That’s a real problem, but every method can fail if you implement it wrongly (although some are more fragile than others), so this didn’t exactly lead to people giving up on the CIRL framework. Recently Shah et al described various benefits they see of CIRL (or “assistance games”) over reward learning, though this doesn’t address the corrigibility question head on.
A lot of the corrigibility properties of CIRL come from uncertainty: it wants to defer to a human because the human knows more about its preferences than the robot. Recently, Yudkowsky and others described the problem of fully updated deference: if the AI has learned everything it can, it may have no uncertainty, at which point this corrigibility goes away. If the AI has learned your preferences perfectly, perhaps this is OK. But here Carey’s critique of model misspecification rears its head again—if the AI is convinced you love vanilla ice cream, saying “please no give me chocolate” will not convince it (perhaps it thinks you have a cognitive bias against admitting your plain, vanilla preferences—it knows the real you), whereas it might if it had uncertainty.
I think the prevailing view on this forum is to be pretty down on CIRL because its not corrigible. But I’m not convinced corrigibility in the strict sense is even attainable or desirable. In this post, we outline a bunch of examples of corrigible behavior that I would absolutely not want in an assistant—like asking me for approval before every minor action! By contrast, the near-corrigible behavior—asking me only when the robot has genuine uncertainty—seems more desirable… so long as the robot has calibrated uncertainty. To me, CIRL and corrigibility seem like two extremes: CIRL is focusing on maximizing human reward, whereas corrigibility is focused on avoiding ever doing the wrong thing even under model misspecification. In practice, we need a bit of both—but I don’t think we have a good theoretical framework for that yet.
In addition to that, I hope this post serves as a useful framework to ground future discussions on this. Unfortunately I think there’s been an awful lot of talking past each other in debates on this topic in the past. For example, to the best of my knowledge, Hadfield-Menell and other authors of CIRL never believed it solved corrigibility under the assumptions Carey introduced. Although our framework is toy, I think it captures the key assumptions people disagree about, and it can be easily extended to capture more as needed in future discussions.

CIRL Corrigibility is Fragile

Rachel Freedman and AdamGleave

Dec 21, 2022, 1:40 AM

58 points

8 comments12 min readLW link

AdamGleave Nov 7, 2022, 5:41 AM
9 points
6
on: Instead of technical research, more people should focus on buying time
I’m excited by many of the interventions you describe but largely for reasons other than buying time. I’d expect buying time to be quite hard, in so far as it requires coordinating to prevent many actors from stopping doing something they’re incentivized to do. Whereas since alignment research community is small, doubling it is relatively easy. However, it’s ultimately a point in favor of the interventions that they look promising under multiple worldviews, but it might lead me to prioritize within them differently to you.

One area I would push back on is the skills you describe as being valuable for “buying time” seem like a laundry list for success in research in general, especially empirical ML research:

Skills that seem uniquely valuable for buying time interventions: general researcher aptitudes, ability to take existing ideas and strengthen them, experimental design skills, ability to iterate in response to feedback, ability to build on the ideas of others, ability to draw connections between ideas, experience conducting “typical ML research,” strong models of ML/capabilities researchers, strong communication skills

It seems pretty bad for the people strongest at empirical ML research to stop doing alignment research. Even if we pessimistically assume that empirical research now is useless (which I’d strongly disagree with), surely we need excellent empirical ML researchers to actually implement the ideas you hope the people who can “generate and formalize novel ideas” come up with. There are a few aspects of this (like communication skills) that do seem to differentially point in favor of “buying time”, maybe have a shorter and more curated list in future?

Separately given your fairly expansive list of things that “buy time” I’d have estimated that close to 50% of the alignment community are already doing this—even if they believe their primary route to impact is more direct. For example, I think most people working on safety at AGI labs would count under your definition: they can help convince decision-makers in the lab not to deploy unsafe AI systems, buying us time. A lot of the work on safety benchmarks or empirical demonstrations of failure modes falls into this category as well. Personally I’m concerned people are falling into this category of work by default and that there’s too much of this, although I do think when done well it can be very powerful.

AdamGleave Nov 2, 2022, 3:06 AM
LW: 2 AF: 2
0
AF
in reply to: Erik Jenner’s comment on: Response to Katja Grace’s AI x-risk counterarguments
I agree that in a fast takeoff scenario there’s little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I’m personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.
In terms of humans “owning” the economy but still having trouble getting what they want, it’s not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I’ve bought a dud on the basis of canny marketing, overall I’m much better off living in a modern capitalist economy than the stone age where humans were more directly in control.
However, it does seem like there’s a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, “slightly better than 2022” and “all humans dead” are rounding errors relative to “possible future human flourishing”. But things look quite different under other ethical views, so I’m reluctant to conflate these outcomes.

AdamGleave Oct 28, 2022, 6:30 PM
LW: 11 AF: 5
0
AF
on: Response to Katja Grace’s AI x-risk counterarguments
Thanks for this response, I’m glad to see more public debate on this!
The part of Katja’s part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed—there is some sufficiently capable AI system that can take over the 2022 world. But this capability threshold is a moving target: humanity will get better at aligning and controlling AI systems as we gain more experience with them, and we may be able to enlist the help of AI systems to keep others in check. So, why should we expect the equilibrium here to be an AI takeover, rather than AIs working for humans because that it is in their selfish best interest in a market economy where humans are currently the primary property owner?
I think the crux here is whether we expect AI systems to by default collude with one another. They might—they have a lot of things in common that humans don’t, especially if they’re copies of one another! But coordination in general is hard, especially if it has to be surreptitious.
As an analogy, I could argue that for much of human history soldiers were only slightly more capable than civilians. Sure, a trained soldier with a shield and sword is a fearsome opponent, but a small group of coordinated civilians could be victorious. Yet as we develop more sophisticated weapons such as guns, cannons, missiles, the power that a single soldier has grows greater and greater. So, by your argument, eventually a single soldier will be powerful enough to take over the world.
This isn’t totally fanciful—the Spanish conquest of the Inca Empire started with just 168 soldiers! The Spanish fought with swords, crossbows, and lances—if the Inca Empire were still around, it seems likely that a far smaller modern military force could defeat them. Yet, clearly no single soldier is in a position to take over the world, or even a small city. Military coup d’etats are the closest, but involve convincing a significant fraction of the military that is in their interest to seize power. Of course most soldiers wish to serve their nation, not seize power, which goes some way to explaining the relatively low rate of coup attempts. But it’s also notable that many coup attempts fail, or at least do not lead to a stable military dictatorship, precisely because of difficulty of internal coordination. After all, if someone intends to destroy the current power structure and violate their promises, how much can you trust that they’ll really have your back if you support them?
An interesting consequence of this is that it’s ambiguous whether making AI more cooperative makes the situation better or worse.

AdamGleave Oct 2, 2022, 7:59 PM
LW: 5 AF: 4
AF
in reply to: Steven Byrnes’s comment on: [Intro to brain-like-AGI safety] 1. What’s the problem & Why work on it now?
Thanks for the quick reply! I definitely don’t feel confident in the 20W number, I could believe 13W is true for more energy efficient (small) humans, in which case I agree your claim ends up being true some of the time (but as you say, there’s little wiggle room). Changing it to 1000x seems like a good solution though which gives you plenty of margin for error.

AdamGleave Oct 2, 2022, 2:09 AM
LW: 12 AF: 8
2
AF
on: [Intro to brain-like-AGI safety] 1. What’s the problem & Why work on it now?
This is a nitpick, but I don’t think this claim is quite right (emphasis added)
If a silicon-chip AGI server were literally 10,000× the volume, 10,000× the mass, and 10,000× the power consumption of a human brain, with comparable performance, I don’t think anyone would be particularly bothered—in particular, its electricity costs would still be below my local minimum wage!!
First, how much power does the brain use? 20 watts is StackExchange’s answer, but I’ve struggled to find good references here. The appealingly named Appraising the brain’s energy budget gives 20% of the overall calories consumed by the body, but that begs the question of the power consumption of the human body, and whether this is at rest or under exertion, etc. Still, I don’t think the 20 watts figure is more than 2x off, so let’s soldier on.
10,000 times 20 watts is 200 kW. That’s a large but not insane amount of power. You could just about run that load on a domestic power supply in the US (some larger homes might have a 200A @ 120V circuit, for 192 kW of permissible load under the 80% rule). Of course you wouldn’t be able to power the HVAC needed to cool all these chips, but let’s suppose you live in Alaska and can just open the windows.
At the time of writing, the cheapest US electricity prices are around $0.09 per kWh with many states (including Alaska, unfortunately) being twice that at around $0.20/kWh. But let’s suppose you’re in both a cool climate and have a really great deal on electricity. So your 200kWh of chips costs you just $0.09*200=$18/hour.
Federal minimum wage is $7.25/hour, and the highest I’m aware of in any US state is $15/hour. So it seems that you won’t be cheaper than the brain on electricity prices if 10,000 times less efficient. I’ve systematically tried to make favorable assumptions here. Your 200kW proto-AGI probably won’t be in an Alaskan garage, but in a tech company’s data center with according costs for HVAC, redundant power, security, etc. Colo costs vary widely depending on location and economies of scale. A recent quote I had was at around the $0.4 kWh/mark—so about 4x the cost quoted above.
This doesn’t massively change the qualitative takeaway, which is that even if something was 10,000 (or even a million times) less efficient than the brain, we’d absolutely still go ahead and build a demo anyway. But it is worth noting that something at the $60/hour range might not actually be all that transformative unless it’s able to perform highly skilled labor—at least until we make it more efficient (which would happen quite rapidly).

AdamGleave Oct 1, 2022, 1:16 AM
LW: 2 AF: 2
1
AF
on: Inverse Scaling Prize: Round 1 Winners
“The Floating Droid” example is interesting as there’s a genuine ambiguity in the task specification here. In some sense that means there’s no “good” behavior for a prompted imitation model here. (For an instruction-following model, we might want it to ask for clarification, but that’s outside the scope of this contest.) But it’s interesting the interpretation flips with model scale, and in the opposite direction to what I’d have predicted (doing EV calculations are harder so I’d have expected scale to increase not decrease EV answers.) Follow-up questions I’d be excited to see the author address include:
1. Does the problem go away if we include an example where EV and actual outcome disagree? Or do the large number of other spuriously correlated examples overwhelm that?
2. How sensitive is this to prompt? Can we prompt it some other way that makes smaller models more likely to do actual outcome, and larger models care about EV? My guess is the training data that’s similar to those prompts does end up being more about actual outcomes (perhaps this says something about the frequency of probabilistic vs non-probabilistic thinking on internet text!), and that larger language models end up capturing that. But perhaps putting the system in a different “personality” is enough to resolve this. “You are a smart, statistical assistant bot that can perform complex calculations to evaluate the outcomes of bets. Now, let’s answer these questions, and think step by step.”

AdamGleave Sep 6, 2022, 2:48 AM
LW: 4 AF: 1
1
AF
in reply to: Joe Collman’s comment on: An Update on Academia vs. Industry (one year into my faculty job)
It’s not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn’t currently practical. If anyone has ideas on this, I’d be very interested.

A rough heuristic I have is that if the idea you’re introducing is highly novel, it’s OK to not be rigorous. Your contribution is bringing this new, potentially very promising, idea to people’s attention. You’re seeking feedback on how promising it really is and where people are confused , which will be helpful for then later formalizing it and studying it more rigorously.
But if you’re engaging with a large existing literature and everyone seems to be confused and talking past each other (which I’d characterize a significant fraction of the mesa-optimization literature, for example) -- then the time has come to make things more rigorous, and you are unlikely to make much further progress without it.

AdamGleave Sep 3, 2022, 9:41 PM
LW: 16 AF: 7
7
AF
on: An Update on Academia vs. Industry (one year into my faculty job)
Work that is still outside the academic Overton window can be brought into academia if it can be approached with the technical rigor of academia, and work that meets academic standards is much more valuable than work that doesn’t; this is both because it can be picked up by the ML community, and because it’s much harder to tell if you are making meaningful progress if your work doesn’t meet these standards of rigor.
Strong agreement with this! I’m frequently told by people that you “cannot publish” on a certain area, but in my experience this is rarely true. Rather, you have to put more work into communicating your idea, and justifying the claims you make—both a valuable exercise! Of course you’ll have a harder time publishing than on something that people immediately understand—but people do respect novel and interesting work, so done well I think it’s much better for your career than one might naively expect.
I especially wish there was more emphasis on rigor on the Alignment Forum and elsewhere: it can be valuable to do early-stage work that’s more sloppy (rigor is slow and expensive), but when there’s long-standing disagreements it’s usually better to start formalizing things or performing empirical work than continuing to opine.
That said, I do think academia has some systemic blindspots. For one, I think CS is too dismissive of speculative and conceptual research—much of this work will end up being mistaken admittedly, but it’s an invaluable source of ideas. I also think there’s too much emphasis on an “algorithmic contribution” in ML, which leads to undervaluing careful empirical valuations and understanding failure modes of existing systems.

AdamGleave Aug 31, 2022, 8:21 PM
11 points
5
on: (My understanding of) What Everyone in Technical Alignment is Doing and Why
I liked this post and think it’ll serve as a useful reference point, I’ll definitely send it to people who are new to the alignment field.
But I think it needs a major caveat added. As a survey of alignment research that regularly posts on LessWrong or interacts closely with that community, it does a fine job. But as capybaralet already pointed out, it misses many academic groups. And even some major industry groups are de-emphasized. For example, DeepMind alignment is 20+ people, and has been around for many years. But it’s got if anything a slightly less detailed write-up than Team Shard, a small group of people for a few months, or infra-Bayesianism, largely one person for several years.
The best shouldn’t be the enemy of the good, and some groups are just quite opaque, but I think it does need to be cleared about its limitations. One anti-dote would be including in the table a sense of # of people, # of years it’s been around, and maybe even funding to get a sense of what the relative scale of these different projects are.

AdamGleave

AI Safety in a World of Vuln­er­a­ble Ma­chine Learn­ing Systems

CIRL Cor­rigi­bil­ity is Fragile

AI Safety in a World of Vulnerable Machine Learning Systems

CIRL Corrigibility is Fragile