AdamGleave

Karma: 916

AdamGleave Aug 31, 2022, 8:07 PM
14 points
0
on: (My understanding of) What Everyone in Technical Alignment is Doing and Why
One omission from the list is the Fund for Alignment Research (FAR), which I’m a board member of. That’s fair enough: FAR is fairly young, and doesn’t have a research agenda per se, so it’d be hard to summarize their work from the outside!. But I thought it might be of interest to readers so I figured I’d give a quick summary here.
FAR’s theory of change is to incubate new, scalable alignment research agendas. Right now I see a small range of agendas being pursued at scale (largely RLHF and interpretability), then a long tail of very diverse agendas being pursued by single individuals (mostly independent researchers or graduate students) or 2-3 person teams. I believe there’s a lot of valuable ideas in this long tail that could be scaled, but this isn’t happening due to a lack of institutional support. It makes sense that the major organisations want to focus on their own specific agendas—there’s a benefit to being focused! -- but it means a lot of valuable agendas are slipping through the cracks.
FAR’s current approach to solving this problem is to build out a technical team (research engineers, junior research scientists, technical communication specialists) and provide support to a broad range of agendas pioneered by external research leads. Those that work, FAR will double down on and invest more in. This model has had a fair amount of demand already so there’s product-market fit, but we still want to iterate and see if we can improve the model. For example, long-term FAR might want to bring some or all research leads in-house.
In terms of concrete agendas, an example of some of the things FAR is working on:
- Adversarial attacks against narrowly superhuman systems like AlphaGo.
- Language model benchmarks for value learning.
- The inverse scaling law prize.
You can read more about us on our launch post.
What links here?
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by Thomas Larsen (Aug 29, 2022, 1:23 AM; 413 points)

AdamGleave Jul 8, 2022, 5:52 PM
LW: 5 AF: 2
0
AF
in reply to: Charbel-Raphaël’s comment on: Benchmark: goal misgeneralization/concept extrapolation
A related dataset is Waterbirds, described in Sagawa et al (2020), where you want to classify birds as landbirds or waterbirds regardless of whether they happen to be on a water or land background.
The main difference from HappyFaces is that in Waterbirds the correlation between bird type and background is imperfect, although strong. By contrast, HappyFaces has perfect spurious correlation on the training set. Of course you could filter Waterbirds to make the spurious correlation perfect to get an equally challenging but more natural dataset.

Introducing the Fund for Alignment Research (We’re Hiring!)

AdamGleave, Scott Emmons, Ethan Perez and Claudia Shi

Jul 6, 2022, 2:07 AM

62 points

0 comments4 min readLW link

AdamGleave Jul 2, 2022, 9:51 PM
15 points
0
on: AI Could Defeat All Of Us Combined
A lot of this argument seems to rest on the training-inference gap, allowing a very large population of AIs to exist at the same as cost as training. In that way they can be a formidable group even if the individual AIs are only human-level. I was suspicious of this at first, but I found myself largely coming round to it after sanity checking it using a slightly different method than biological anchors. However, if I understand correctly the biological anchors framework implies the gap between training and inference grows with capabilities. My projection instead expects it to grow a little in the next few years and then plateau as we hit the limits of data scaling. This suggests a more continuous picture: there will be a “population explosion” of AI systems in the next few years so to speak as we scale data, but then the “population size” (total number of tokens you can generate for your training budget) will stay more or less constant, while the quality of the generated tokens gradually increases.

To a first approximation, the amount of inference you can do at the same cost as training the system will equal the size of the training data multiplied by number of epochs. The trend in large language models seems to be to train for only 1 epoch on most data, and a handful of epochs for the highest-quality parts of the data. So as a rule of thumb: if you spend $X on training and $X on inference, you can produce as much data as your training dataset. Caveat: inference can be more expensive (e.g. beam search) or less expensive (e.g. distillation, specialized inference-only hardware) and depends on things like how much you care about latency; I think this only changes the picture by 10x either way.

Given that GPT-3 was trained on a significant fraction of the entire text available on the Internet (CommonCrawl), this would already be a really big deal if GPT-3 was actually close to human-level. Adding another Internets-worth of content would be… significant.

But conversely, the fact we’re already training on so much data limits how much room for growth there is. I’d estimate we have no more than 100-1000x left for language scaling. We could probably get up to 10x more from more comprehensive (but lower quality) crawls than CommonCrawl, and 10-100x more if tech companies use non-public datasets (e.g. all e-mails & docs on a cloud service).

By contrast, in principle compute could scale up a lot more than this. We can likely get 10-100x just from spending more on training runs. Hardware progress could easily deliver 1000x by 2036, the date chosen in this post.

Given this, at least under business as usual scaling I expect us to hit the limits of data scaling significantly before we exhaust compute scaling. So we’ll have larger and more compute-intensive models trained on relatively small datasets (although still massive in absolute terms). This suggests the training-inference gap grow a bit as we grow training data size, but soon plateau as we just scale up model size while keeping training data fixed.

One thing that could do undo this argument is if we end up training for many (say >10) epochs, or synthetically generate data, as a kind of poor-mans data scaling rather than just scaling up parameter count. I expect we’ll try this, but I’d only give it 30% odds it makes a big difference. I do think it’s more likely if we move away from the LM paradigm, and either get a lot of mileage out of multi-modal models (there’s lots more video data at least in terms of GB, maybe not in terms of abstract information content) or back towards RL (where data generated in simulation seems much more valuable and scalable).

AdamGleave Jan 18, 2022, 7:24 PM
3 points
in reply to: DanielFilan’s comment on: Delta Strain: Fact Dump and Some Policy Takeaways
I did actually mean 45, in “all-things-considered” I was including uncertainty in whether my toy model was accurate. Since it’s a right-tailed distribution, my model can underestimate the true amount a lot more than it can overestimate it.

For what it’s worth, my all-things-considered view for Delta is now more like 30, as I’ve not really seen anything all that compelling for long COVID being much worse than in the model. I’m not sure about Omicron; it seems to be less virulent, but also more vaccine escape. Somewhere in the 15-90 day range sounds right to me, I’ve not thought enough to pin it down precisely.

AdamGleave Sep 18, 2021, 4:33 AM
LW: 3 AF: 3
AF
in reply to: Rohin Shah’s comment on: Immobile AI makes a move: anti-wireheading, ontology change, and model splintering
My sense is that Stuart assuming there’s an initial-specified reward function is a simplification, not a key part of the plan, and that he’d also be interested in e.g. generalizing a reward function learned from other sources of human feedback like preference comparison.

IRD would do well on this problem because it has an explicit distribution over possible reward functions, but this isn’t really that unique to IRD—Bayesian IRL or preference comparison would have the same property.

AdamGleave Aug 23, 2021, 9:49 PM
1 point
in reply to: DirectedEvolution’s comment on: What fraction of breakthrough COVID cases are attributable to low antibody count?
It could be net-negative if receiving a booster shot caused stronger imprinting, making future immune response less adaptive. I don’t have a good sense of whether this original antigenic sin effect has already saturated after receiving two-doses (or even a single-dose), or whether it continues to become stronger.

My sense is this is an open question. From Petras et al (2021):

As suggested by a recent observation in naturally immunized individuals receiving two doses of the Pfizer COVID-19 (Comirnaty) vaccine, original antigenic sin may pose a problem in future research and development of vaccines.16 While the first dose of the vaccine was able to raise the preexisting levels of functional and specific antibodies, these either failed to change or even declined after the second dose (virus-neutralizing antibodies), and the same applied to the levels of antigen-specific antibody-secreting cells. As this observation was made in only a small group of 13 subjects with naturally acquired immunity against SARS-CoV-2, who had rather average or below-average levels of the antibodies assessed, one may expect an enhanced effect of original antigenic sin after new vaccination against COVID-19 in those with manyfold higher antibody levels after complete immunization.

That said, I’d expect a third booster to be protective against Delta, given that vaccines against ancestral variant are still highly effective against Delta and that Delta is a significant threat right now. But I do think it’s plausible (though not firmly established) that a third booster shot may reduce the effectiveness of future variant-specific boosters. Targeting dramatically different protein targets might well help, although might also take longer to get approved.

Ultimately, I expect a third booster will still make sense for a lot of people, if (a) your immune response has waned (e.g. 6 months or longer since 2nd dose, or immunocompromised); and (b) you expect to be receiving significant exposure from Delta in the immediate future.

AdamGleave Aug 22, 2021, 11:06 PM
6 points
in reply to: DirectedEvolution’s comment on: What fraction of breakthrough COVID cases are attributable to low antibody count?
I largely with this analysis. One major possible “side-effect” of a third booster is original antigenic sin. Effectively, the immune system may become imprinted on the ancestral variant of the spike protein, preventing adaptation to new variants (whether via direct exposure or via future boosters targeting new variants). This would be the main way I could see a third booster being seriously net-negative, although I don’t have a good sense of the probability. Still, if antibody levels are low, the benefit of a booster is greater and I’d guess (caveat: not an immunologist) the risk of antigenic imprinting is somewhat lower (on the basis that the immune response has already decayed).

AdamGleave Aug 17, 2021, 12:39 PM
2 points
on: A Better Time until Sunburn Calculator
Thanks for sharing this! I did notice a weird non-monotonicity: if I go from 90 minutes exposure to 120 minutes, the “Percent of Population w/ Sunburn Degree 1 at Time Exposed” drops from 96.8% to 72.7%. There is a warning in both cases that it’s outside normal range, but it still seems odd that more exposure gives lower risk.

AdamGleave Aug 4, 2021, 3:36 PM
5 points
in reply to: AdamGleave’s comment on: Delta Strain: Fact Dump and Some Policy Takeaways
Just to flag I messed up the original calculation and underestimated everything by a factor of 2x, I’ve added an errata.

I’d also recommend Matt Bell’s recent analysis, who estimates 200 days of life lost. This is much higher than the analysis in my comment and the OP. I found the assumptions and sources somewhat pessimistic but ultimately plausible.

The main things driving the difference from my comment were:
- Uses data from the UK’s Office of National Statistics that I’d missed, which has a very high number of 55% of people reporting symptoms after 5 weeks, with fairly slow rates of recovery all the way out to 120 days post-infection. Given this is significantly higher than most other studies I’ve seen, I think Matt is being pessimistic by only down-adjusting to 45%, but I should emphasize these numbers are credible and the ONS study is honestly better than most out there.
- Long COVID making your life 20% worse is on the pessimistic end. I put most mild symptoms at 5% worse. Ultimately subjective and highly dependent on what symptoms you get.
- I think the difference in hospitalized vs non-hospitalized risk is closer to 10x (based on Al-Aly figure) not Matt’s estimate of 2x, that means we should multiply by a factor of ~60% not ~97%.

AdamGleave Aug 2, 2021, 10:15 AM
5 points
in reply to: Owain_Evans’s comment on: Delta Strain: Fact Dump and Some Policy Takeaways
This is a good point, the demographics here are very skewed. I’m not too worried about it overstating risk, simply because the risk ended up looking not that high (at least after adjusting for hospitalization). I think at this point most of us have incurred more than 5 days of costs from COVID restrictions, so if that was really all the cost from COVID, I’d be pretty relaxed.

The gender skew could be an issue, e.g. chronic fatigue syndrome seems to occur at twice the rate in women than men.

AdamGleave Aug 2, 2021, 10:07 AM
11 points
in reply to: Connor_Flexman’s comment on: Delta Strain: Fact Dump and Some Policy Takeaways
This is an accurate summary, thanks! I’ll add my calculation was only for long-term sequelae. Including ~10 days cost from acute effects, my all-things-considered view would be mean of ~40 days, corresponding to 1041 uCOVIDs per hour.

This is per actual hour of (quality-adjusted) life expectancy. But given we spend ~1/3rd of our time sleeping, you probably want to value a waking-hour at 1.5x a life-hour (assuming being asleep has neutral valence). If you work a 40 hour work week and only value your productive time (I do not endorse this, by the way), then you’d want to adjust upwards by a factor of (7*24)/40=4.2.

However, this is purely private cost. You probably want to take into account the cost of infecting other people. I’m not confident in how to reason about the exponential growth side of things. If you’re in a country like the US where vaccination rates have plateaued, I tend to expect Delta to spread amongst unvaccinated people until herd immunity is reached. In this scenario you basically want infection rates to be as high as possible without overwhelming the healthcare system, so we get to herd immunity quicker. (This seems to actually be the strategy the UK government is pursuing—although obviously they’ve not explicitly stated this.) But if you’re in a country that’s still actively vaccinating vulnerable people, or where flattening the curve makes sense to protect healthcare systems, then please avoid contributing to exponential growth.

Neglecting the exponential growth side of things and just considering immediate impact on your contacts, how likely are you to transmit? I’d be surprised if it was above 40% per household contact assuming you quarantine when symptomatic (that’s on the higher end of transmission seen even with unvaccinated primary cases), but I’d also be surprised if it was below 5% (lowest figure I’ve seen); I’d guess it’s around 15% for Delta. This means if you have ~6-7 contacts as close as housemates, then your immediate external cost roughly equals your private cost.

Specifically, two studies I’ve seen on secondary attack rate given vaccination (h/t @Linch) give pretty wildly varying figures, but suggest at least 2x reduction in transmission from vaccination. Layan et al (2021) found 40% of household contacts of Israeli medical staff developed an infection (when Alpha was dominant), with vaccination of the primary case reducing transmission by 80%, so an 8% chance of transmission overall. Harris et al (2021) from Public Health England suggest vaccination cuts transmission risk from 10% to 5%, but these figures are likely skewed low due to not systematically testing contacts.

AdamGleave Aug 1, 2021, 4:08 PM
39 points
on: Delta Strain: Fact Dump and Some Policy Takeaways
Errata: My original calculation underestimated the risk by a factor of about 2x. I neglected two key considerations, which fortunately somewhat canceled each other out. My new estimate from the calculation is 3.0 to 11.7 quality-adjusted days lost to long-term sequelae, with my all-things-considered mean at 45.
The two key things I missed:
- I estimated the risk of a non-hospitalized case is about 10x less than a hospitalized case, and so divided the estimates of disease burden by 10x. The first part is correct, but the second part would only make sense if all disease burden was due to hospitalized cases. In fact, there’s a 15:85% split between hospitalized and non-hospitalized patients in the study (13,654:73,435). So if the disease burden for non-hospitalized is x, the total burden is 0.15*10x + 0.85*x = 2.35x. So we should divide by 2.35, not 10.
- However, as Owain pointed out below, the [demographics](https://www.nature.com/articles/s41586-021-03553-9/tables/1) are non-representative and probably skew high-risk given the median age is 60. the demographics are relatively high-risk. Indeed, this is suggested by the 15% hospitalized figure (which also, I suspect, means they just never included asymptomatic and most mildly symptomatic cases). An ONS survey (Figure 4) put symptoms reported after 5 weeks at 25% (20-30%) for 50-69 year olds and 17.5 (12.5 to 22.5%) for 17 to 24 year olds, which is surprisingly little difference, about a 1.5 decrease. I’d conjecture a 2x decrease in risk (noting that assuming no hospitalization is already doing a lot of work here).
Original post:
I did my own back-of-the-envelope calculation and came up with a similar but slightly higher estimated cost of 1.4 to 5.5 quality-adjusted days lost to long-term sequalea conditional on getting symptomatic COVID case. FWIW, I originally thought the OPs numbers seemed way too low, and was going to write a take-down post—but unfortunately the data did not cooperate with this agenda. I certainly don’t fully trust these numbers: it’s based on a single study, and there were a bunch of places I didn’t keep track of uncertainty, so the true credible interval should definitely be a lot wider. Given that and the right-tailed nature of the distribution, my all-things-considered mean is closer to 30 because of this, but figured I’d share the BOTEC anyway in case it’s helpful to anyone.
My model is pretty simple:
1. What % of symptoms are there at some short-term follow up period (e.g. 4 to 12 weeks)? This we actually have data on.
2. How bad are these symptoms? This is fairly subjective.
3. How much do we expect these symptoms to decay long-term? This is going off priors.
For 1. I used Al-Aly et al (2021) as a starting point, which was based on comparing medical records between a COVID-positive and non-COVID demographically matched control group in the US Department of Veterans Affairs database. Anna Ore felt this was one of the more rigorous ones, and I agree. Medical notes seem more reliable than self-report (though far from infallible), they seem to have actually done a Bonferroni correction, and they tested their methodology didn’t pick up any false positives via both a negative-outcome and negative-exposure controls. Caveat: many other studies have scarier headline figures, and it’s certainly possible relying on medical records skews this low (e.g. doctors might be reluctant to give a diagnosis, many patients won’t go to the doctor for mild symptoms, etc).
They report outcomes that occurred between 30 and 180 days after COVID exposure, although infuriatingly don’t seem to break it down any further by date. Figure 2 shows all statistically significant symptoms, in terms of the excess burden (i.e. increase above control) of the reported symptom per 1000 patients. There were 38 in total, ranging from 2.8% (respiratory signs and symptoms) to 0.15% (pleurisy). In total the excess burden was 26%.
I went through and rated each symptom with a very rough and subjective high / medium / low severity. 2% excess burden of high severity symptoms, 19% medium severity, 5% low severity. I then ballparked that high severity (e.g. heart disease, diabates, heart failure) wiped out 30% of your QALYs, medium severity (e.g. respiratory signs, anxiety disorders, asthma) as 5% and low (e.g. skin rash) as 1%. Caveat: there’s a lot of uncertainty in these numbers. Although I suspect I’ve gone for higher costs than most people would, since I tend to think health has a pretty big impact on productivity.
Using my weightings, we get a 1.6% reduction in QALYs conditional on symptomatic COVID case. I think this is misleading for three reasons:
1. Figure 3 shows that excess burden is much higher for people who were hospitalized, and if anything the gap seems bigger for more severe symptoms (e.g. about 10x less heart failure in people positive but not hospitalized, whereas rates of skin rash were only 2x less). This is good news as vaccines seem significantly more effective at preventing hospitalizations, and if you are fortunate enough to be a young healthy person your chance of being hospitalized was pretty low to begin with. I’m applying a 10x reduction for this.
2. This excess burden is per diagnosis, not per patient. Sick people tend to receive multiple diagnoses. I’m not sure how to handle this. In some cases, badness-of-symptoms does seem roughly additive: if I had a headache, I’d probably pay a similar amount not to also develop a skin rash then if my head didn’t hurt. But it seems odd to say that someone who drops dead from cardiac arrest was more fortunate than another patient with the same cause of death, who also had the misfortune of being diagnosed with heart failure a week earlier. So there’s definitely some double-counting with the diagnosis, which I think justifies a 2-5x decrease.
3. This study was presumably predominantly the original COVID strain (based on a cohort between March 2020 and 30 November 2020). Delta seems, per the OP, about 2-3x worse: so let’s increase it by that factor.
Overall we decrease 1.6% by a factor of 6.5 (10*2/3) to 25 (10*5/2), to get a short-term QALY reduction of 0.064% to 0.24%.
However, El-Aly et al include any symptom reported between 30 to 180 days. What we really care about is chance of lifelong symptoms if someone is experiencing a symptom after 6 months there seems like a considerable chance it’ll be lifelong, but if only 30 days has elapsed the chance of recovery seems much higher. A meta-review by Thompson et al (2021) seems to show a drop of around 2x between symptoms in a 4-12 week period vs 12+ weeks (Table 2), although with some fairly wild variation between studies so I do not trust this that much. In an extremely dubious extrapolation from this, we could say that perhaps symptoms half again from 12 weeks to 6 months, again from 6 months to a year, and after that persist as a permanent injury. In this case, we’d divide the “symptom after 30 days figure” from Al-Aly et al by a factor of 8 to get the permanent injury figure, which seems plausible to me (but again, you could totally argue for a much lower number).
With this final fudge, we get a lifelong QALY reduction of 0.008% to 0.03%. Assuming a 50-year life expectancy, this amounts to 1.4 to 5.5 days of cost from long-term sequelae. Of course, there are also short-term costs (and risk of morbidity!) that is omitted from this analysis, so the total costs will be higher than this.
What links here?

AdamGleave Nov 30, 2020, 5:45 PM
LW: 3 AF: 2
AF
in reply to: Steven Byrnes’s comment on: Inner Alignment in Salt-Starved Rats
I googled “model-based RL Atari” and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly)
Ah, the “model-based using a model-free RL algorithm” approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You’re right that in this setup, as the actions are being chosen by the (model-free RL) policy, you don’t get any zero-shot generalization.
I added a new sub-bullet at the top to clarify that it’s hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the “other possible explanations” section at the bottom saying what I said in the paragraph just above. Thank you.
Thanks for updating the post to clarify this point—I agree with you with the new wording.
In ML today, the reward function is typically a function of states and actions, not “thoughts”. In a brain, the reward can depend directly on what you’re imagining doing or planning to do, or even just what you’re thinking about. That’s my proposal here.
Yes indeed, your proposal is quite different from RL. The closest I can think of to rewards over “thoughts” in ML would be regularization terms that take into account weights or, occasionally, activations—but that’s very crude compared to what you’re proposing.

AdamGleave Nov 26, 2020, 9:48 AM
LW: 3 AF: 2
AF
in reply to: Steven Byrnes’s comment on: Inner Alignment in Salt-Starved Rats
Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.

Most model-based RL algorithms I’ve seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don’t see how you solve this problem in general in a sample-efficient manner otherwise.
One class of model-based RL approaches is based on [model-predictive control](https://en.wikipedia.org/wiki/Model_predictive_control): sample random actions, “rollout” the trajectories in the model, pick the trajectory that had the highest return and then take the first action from that trajectory, then replan. That said, assumptions vary. [iLQR](https://en.wikipedia.org/wiki/Linear%E2%80%93quadratic_regulator) makes the stronger assumption that reward is quadratic and differentiable.
I think methods based on [Monte Carlo tree search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) might exhibit something like the problem you discuss. Since they sample actions from a policy trained to maximize reward, they might end up not exploring enough in this novel state if the policy is very confident it should not drink the salt water. That said, they typically include explicit methods for exploration like [UCB](https://en.wikipedia.org/wiki/Thompson_sampling#Upper-Confidence-Bound_(UCB)_algorithms) which should mitigate this.

AdamGleave Nov 19, 2020, 12:13 PM
LW: 11 AF: 8
AF
on: Inner Alignment in Salt-Starved Rats
I’m a bit confused by the intro saying that RL can’t do this, especially since you later on say the neocortex is doing model-based RL. I think current model-based RL algorithms would likely do fine on a toy version of this task, with e.g. a 2D binary state space (salt deprived or not; salt water or not) and two actions (press lever or no-op). The idea would be:
- Agent explores by pressing lever, learns transition dynamics that pressing lever ⇒ spray of salt water.
- Planner concludes that any sequence of actions involving pressing lever will result in salt water spray. In a non salt-deprived state this has negative reward, so the agent avoids it.
- Once the agent becomes salt deprived, the planner will conclude this has positive reward, and so take that action.
I do agree that a typical model-free RL algorithm is not capable of doing this directly (it could perhaps meta-learn a policy with memory that can solve this).

AdamGleave Jul 31, 2020, 1:03 AM
LW: 16 AF: 8
AF
on: The ground of optimization
Thanks for the post, this is my favourite formalisation of optimisation so far!
One concern I haven’t seen raised so far, is that the definition seems very sensitive to the choice of configuration space. As an extreme example, for any given system, I can always augment the configuration space with an arbitrary number of dummy dimensions, and choose the dynamics such that these dummy dimensions always get set to all zero after each time step. Now, I can make the basin of attraction arbitrarily large, while the target configuration set remains a fixed size. This can then make any such dynamical system seem to be an arbitrarily powerful optimiser.
This could perhaps be solved by demanding the configuration space be selected according to Occam’s razor, but I think the outcome still ends up being prior dependent. It’d be nice for two observers who model optimising systems in a systematically different way to always agree within some constant factor, akin to Kolmogorov complexity’s invariance theorem, although this may well be impossible.
As a less facetious example, consider a computer program that repeatedly sets a variable to 0. It seems again we can make the optimising power arbitrarily large by making the variable’s size arbitrarily large. But this doesn’t quite map onto the intuitive notion of the “difficulty” of an optimisation problem. Perhaps including some notion of how many other optimising systems would have the same target set would resolve this.
What links here?
- Bridging Expected Utility Maximization and Optimization by Daniel Herrmann (Aug 5, 2022, 8:18 AM; 25 points)

AdamGleave Feb 1, 2019, 9:00 PM
LW: 15 AF: 5
0
AF
on: Following human norms
I feel like there are three facets to “norms” v.s. values, which are bundled together in this post but which could in principle be decoupled. The first is representing what not to do versus what to do. This is reminiscent of the distinction between positive and negative rights, and indeed most societal norms (e.g. human rights) are negative, but not all (e.g. helping an injured person in the street is a positive right). If the goal is to prevent catastrophe, learning the ‘negative’ rights is probably more important, but it seems to me that most techniques developed could learn both kinds of norms.
Second, there is the aspect of norms being an incomplete representation of behaviour: they impose some constraints, but there is not a single “norm-optimal” policy (contrast with explicit reward maximization). This seems like the most salient thing from an AI standpoint, and as you point out this is an underexplored area.
Finally, there is the issue of norms being properties of groups of agents. One perspective on this is that humans are realising their values through constructing norms: e.g. if I want to drive safely, it is good to have a norm to drive on the left or right side of the road, even though I may not care which norm we establish. Learning norms directly therefore seems beneficial to neatly integrate into human society (it would be awkward if e.g. robots drive on the left and humans drive on the right). If we think the process of going from values to norms is both difficult and important for multi-agent cooperation, learning norms also lets us sidestep a potentially thorny problem.
What links here?
- [AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety by Rohin Shah (Nov 6, 2019, 6:10 PM; 26 points)

AdamGleave Dec 18, 2018, 10:45 PM
11 points
on: 2018 AI Alignment Literature Review and Charity Comparison
Thanks for the informative post as usual.
Full-disclosure: I’m a researcher at UC Berkeley financially supported by CHAI, one of the organisations reviewed in this post. However, this comment is just my personal opinion.
Re: location, I certainly agree that an organization does not need to be in the Bay Area to do great work, but I do think location is important. In particular, there’s a significant advantage to working in or near a major AI hub. The Bay Area is one such place (Berkeley, Stanford, Google Brain, OpenAI, FAIR) but not the only one; e.g. London (DeepMind, UCL) and Montreal (MILA, Brain, et al) are also very strong.
I also want to push back a bit on the assumption that people working for AI alignment organisations will be involved with EA and rationalist communities. While it may be true in many cases, at CHAI I think it’s only around 50% of staff. So whether these communities are thriving or not in a particular area doesn’t seem that relevant to me for organisational location decisions.

AdamGleave Nov 17, 2018, 12:02 AM
17 points
on: Current AI Safety Roles for Software Engineers
Description of CHAI is pretty accurate. I think it’s a particularly good opportunity for people who are considering grad school as a long-term option: we’re in an excellent position to help people get into top programs, and you’ll also get a sense of what academic research culture is like.
We’d like to hire more than one engineer, and are currently trialling several hires. We have a mixture of work, some of which is more ML oriented and some of which is more infrastructure oriented. So we’d be willing to consider applicants with limited ML experience, but they’d need to have strengths in other areas to compensate.
If anyone is considering any of these roles and is uncertain whether they’re a good fit, I’d encourage you to just apply. It doesn’t take much time for you to apply or for the organisation to do an initial screening. I’ve spoken to several people who didn’t think they were viable candidates for a particular role, and then turned out to be one of the best applicants we’d received.

AdamGleave

In­tro­duc­ing the Fund for Align­ment Re­search (We’re Hiring!)

Introducing the Fund for Alignment Research (We’re Hiring!)