Yes, the cost of 1 bit for the OR gate was based on the somewhat arbitrary choice to consider only OR and AND gates. A bit more formally, the heuristic explanations in the post implicitly use a “reference class” of circuits where each binary gate was randomly chosen to be either an OR or an AND, and each input wire to a binary gate was randomly chosen to have a NOT or not. The arbitrariness of this choice of reference class is one obstruction to formalizing heuristic explanations and surprise accounting. We are currently preparing a paper that explores this and related topics, but unfortunately the core issue remains unresolved.
Jacob_Hilton
Formal verification, heuristic explanations and surprise accounting
See the statement from OpenAI in this article:
We’re removing nondisparagement clauses from our standard departure paperwork, and we’re releasing former employees from existing nondisparagement obligations unless the nondisparagement provision was mutual. We’ll communicate this message to former employees.
They have communicated this to me and I believe I was in the same category as most former employees.
I think the main reasons so few people have mentioned this are:
As I mentioned, there is still some legal ambiguity and additional avenues for retaliation
Some people are taking their time over what they want to say
Most people don’t want to publicly associate themselves with a controversial situation
Most people aren’t inclined to disparage their former employer anyway, and so they may not think of their own situation as that big of a deal
Yeah I agree with this, and my original comment comes across too strongly upon re-reading. I wanted to point out some counter-considerations, but the comment ended up unbalanced. My overall view is:
It was highly inappropriate for the company to have been issuing these agreements so widely, especially using such aggressive tactics and without allowing disclosure of the agreement, given the technology that it is developing.
The more high-profile and credible a person is, the more damaging it is for this person to have been subject to the agreement.
Nevertheless, it is a mistake to think of potential “disparagement” as part of the job duties of most of the people mentioned, and the post appears to wildly misinterpret the meaning of this term as “taking any actions which might make the company less valuable”.
Ultimately, it would have looked extremely bad for the company to enforce one of these agreements, so the primary effect of the contract comes down to how individuals felt that it constrained their behavior. We don’t have great visibility into this. It’s possible that some of these people felt quite constrained, and it’s also possible that some of these people weren’t even aware of the non-disparagement clause because they didn’t notice it when they signed.
Thankfully, most of this is now moot as the company has retracted the contract. I should emphasize that there may remain some legal ambiguity and additional avenues for retaliation, but I am optimistic that these will be cleaned up in the near future. There will still be non-disparagement agreements in place in cases where “the non-disparagement provision was mutual” (in the words of the company), but my strong guess is that this refers only to the original Anthropic departures and perhaps a handful of other individuals who were high up at the company.
It remains important for people to disclose their financial interest in the company when appropriate, or in some cases give up this interest to avoid a conflict of interest.
Note: I have a financial interest in the company and was subject to one of these agreements until recently.
We were especially alarmed to notice that the list contains at least 12 former employees currently working on AI policy, and 6 working on safety evaluations. This includes some in leadership positions, for example:
I don’t really follow this reasoning. If anything, playing a leadership role in AI policy or safety evaluations will usually give you an additional reason not to publicly disparage AI companies, to avoid being seen as partisan, making being subject to such an agreement less of an issue. I would be pretty surprised if such people subject to these agreements felt particularly constrained in what they could say as part of their official duties, although if I am wrong about this then it does seem like quite a concerning thing to have happened. The obvious exception to this is if the role involves unofficial public commentary about labs, but it’s not obvious to me that this has been a big part of the role of any of the people on your list, and even then, they may not have felt especially constrained, depending on the individual. It’s also worth noting that several of these roles require the holder to give up or donate lab equity to avoid any conflict of interest, regardless of any non-disparagement agreements.
I suspect the crux may be our differing interpretations of the agreement. I’m not sure where your interpretation that it prohibits “taking any actions which might make the company less valuable” comes from, maybe you could highlight the part of the agreement you are basing that on.
If the question is whether I think they were true at time given the information I have now, I think all of the individual points hold up except for the first and third “opinions”. I am now less sure about what OpenAI leadership believed or cared about. The last of the “opinions” also seems potentially overstated. Consequently, the overall thrust now seems off, but I still think it was good to share my views at the time, to start a discussion.
If the question is about the state of the organization now, I know less about that because I haven’t worked there in over a year. But the organization has certainly changed a lot since this post was written over 18 months ago.
- Dec 16, 2023, 5:39 AM; 12 points) 's comment on Common misconceptions about OpenAI by (
Since this post was written, OpenAI has done much more to communicate its overall approach to safety, making this post somewhat obsolete. At the time, I think it conveyed some useful information, although it was perceived as more defensive than I intended.
My main regret is bringing up the Anthropic split, since I was not able to do justice to the topic. I was trying to communicate that OpenAI maintained its alignment research capacity, but should have made that point without mentioning Anthropic.
Ultimately I think the post was mostly useful for sparking some interesting discussion in the comments.
EDIT: See also this retrospective, posted later in May 2024.
I think KL/entropy regularization is usually used to prevent mode collapse partly because it has nice theoretical properties. In particular, it is easy to reason about the optimal policy for the regularized objective—see for example the analysis in the paper Equivalence Between Policy Gradients and Soft Q-Learning.
Nevertheless, action-dependent baselines do appear in the literature, although the story is a bit confusing. This is my understanding of it from some old notes:
The idea was explored in Q-Prop. But unlike you, their intention was not to change the optimal policy, but rather to reduce the variance of the policy gradient. Therefore they also incorporated an additional term to cancel out the bias introduced by the action-dependent baseline. (Incidentally, perhaps this analysis is also relevant to understanding ACTDE.)
Later, The Mirage of Action-Dependent Baselines showed that in fact the variance reduction due the action-dependent baseline was negligible, and the entire benefit of Q-Prop was essentially due to a bug! The implementation normalized advantage estimates, but failed to apply the same adjustment to the bias-correction term, which turned out to be independently helpful because it’s essentially the DDPG training objective.
We will do our best to fairly consider all applications, but realistically there is probably a small advantage to applying earlier. This is simply because there is a limit to how quickly we can grow the organization, so if hiring goes better than expected then it will be longer before we can take on even more people. With that being said, we do not have a fixed number of positions that we are hiring for; rather, we plan to vary the number of hires we make based on the strength of the applications we receive. Moreover, if we were unable to hire someone due to capacity constraints, we would very likely be interested in hiring them at a later date. For these reasons, I think the advantage to applying earlier is a fairly small consideration overall, and it sounds like it would make more sense for you to apply whenever you are comfortable.
The questions on the take-home test vary in difficulty but are generally easier than olympiad problems, and should be accessible to graduates with relevant background. However, it is important to note that we are ultimately interested in research ability rather than the ability to solve self-contained problems under timed conditions. So although the take-home test forms part of our assessment, we also look at other signals such as research track-record (while recognizing that assessing research ability is unfortunately very hard).
(Note: I am talking about the current version of the test, it’s possible that the difficulty will change as we refine our interview process.)
I think the kind of mathematical problem solving we’re engaged in is common across theoretical physics (although this is just my impression as a non-physicist). I’ve noticed that some specific topics that have come up (such as Gaussian integrals and permanents) also crop up in quantum field theory, but I don’t think that’s a strong reason to prefer that background particularly. Broad areas that often come up include probability theory, computational complexity and discrete math, but it’s not necessary to have a lot of experience in those areas, only to be able to pick things up from them as needed.
ARC is hiring theoretical researchers
It’s not quite this simple, the same issue arises if every PSD completion of the known-diagonal minor has zero determinant (e.g. ((?, 1, 2), (1, 1, 1), (2, 1, 1))). But I think in that case making the remaining diagonal entries large enough still makes the eigenvalues at least −ε, which is good enough.
I think the examples you give are valid, but there are several reasons why I think the situation is somewhat contingent or otherwise less bleak than you do:
Counterexamples: I think there are research agendas that are less pre-paradigmatic than the ones you’re focused on that are significantly more (albeit not entirely) parallelizable. For example, mechanistic interpretability and scalable oversight both have multiple groups focused on them and have grown substantially over the last couple of years. I’m aware that we disagree about how valuable these directions are.
Survival of the fittest: Unfortunately I think in cases where an individual has been pursuing a research direction for many years and has tried but failed to get anyone else on board with it, there is some explanatory power to the hypothesis that the direction is not that productive. Note that I’m not claiming to have a strong view on any particular agenda, and there are of course other possibilities in any given case. On the flip side, I expect promising directions to gain momentum over time, even if only gradually, and I consider the counterexamples from point 1 to be instances of this effect.
Fixable coordination/deference failures: I think it would be a mistake for absolutely everyone to go off and try to develop their own alignment strategy from scratch, and it’s plausible that the group you’re focused on is erring too far in this direction. My own strategy has been to do my best to develop my own inside view (which I think is important for research prioritization and motivation as well from a group epistemics perspective), use this to narrow down my set of options to agendas I consider plausible, but be considerably more willing to defer when it comes to making a final call about which agenda to pursue.
Clarity from AI advances: If the risk from AI is real, then I expect the picture of it to become clearer over time as AI improves. As a consequence, it should become clearer to people which directions are worth pursuing, and theoretical approaches should evolve into practical ones than can be iterated on empirically. This should both cause the field to grow and lead to more parallelizable work. I think this is already happening, and even the public at large is picking up on the spookiness of current alignment failures (even though the discourse is unsurprisingly very muddled).
You might find this work interesting, which takes some small steps in this direction. It studies the effect of horizon length inasmuch as it makes credit assignment harder, showing that the number of samples required is an affine function of horizon length in a toy context.
I think the direction depends on what your expectations were – I’ll try to explain.
First, some terminology: the term “horizon length” is used in the paper to refer to the number of timesteps over which the algorithm pays attention to rewards, as governed by the discount rate. In the biological anchors framework, the term “effective horizon length” is used to refer to a multiplier on the number of samples required to train the model, which is influenced by the horizon length and other factors. For clarity, I’ll using the term “scaling multiplier” instead of “effective horizon length” in this comment. The paper studies the effect of the horizon length on the scaling multiplier in a toy MNIST setting.
One key takeaway is that the scaling multiplier is not simply proportional to the horizon length, as one might have naively expected. Instead, the number of samples required is the sum of two components, one that is inherent to the task and independent of the horizon length, and one that is proportional to the horizon length. Compared to the naive expectation, this means that training compute requirements are lower. On the other hand, this ignores reward sparsity, so you might expect training compute requirements to be higher once both horizon length and reward sparsity are accounted for.
The paper also lends some support to the modeling assumptions of the neural network anchor, by validating the hypotheses that (a) training compute requirements still scale as a power law in model size for reinforcement learning, and with a similar exponent, and (b) the scaling multiplier can indeed vary a lot between environments. This might make you put more weight on the neural network anchor, which could again have either directional effect.
The other takeaways are more methodological and I don’t think have much of a directional effect.
The effect of horizon length on scaling laws
I would wildly speculate that “simply” scaling up RLHF ~100x, while paying careful attention to rewarding models appropriately (which may entail modifying the usual training setup, as discussed in this comment), would be plenty to get current models to express calibrated uncertainty well. However:
In practice, I think we’ll make a lot of progress in the short term without needing to scale up this much by using various additional techniques, some that are more like “tricks” (e.g. teaching the model to generally express uncertainty when answering hard math problems) and some more principled (e.g. automating parts of the evaluation).
Even ~100x is still much less than pre-training (e.g. WebGPT used ~20k binary comparisons, compared to ~300b pre-training tokens for GPT-3). The difficulty of course is that higher-quality data is more expensive to collect. However, most of the cost of RLHF is currently employee hours and compute, so scaling up data collection ~100x might not be as expensive as it sounds (although it would of course be a challenge to maintain data quality at this scale).
Even though scaling up data collection will help, I think it’s more important for labs to be prioritizing data quality (i.e. “reducing bias” rather than “reducing variance”): data quality issues are in some sense “scarier” in the long run, since they lead to the model systematically doing the wrong thing (e.g. deceiving the evaluators) rather than defaulting to the “safer” imitative pre-training behavior.
It’s pretty unclear how this picture will evolve over time. In the long run, we may end up needing much less extremely high-quality data, since larger pre-trained models are more sample efficient, and we may get better at using techniques like automating parts of the evaluation. I’ve written more about this question here, and I’d be excited to see more people thinking about it.
In short, sample efficiency is a problem right now, but not the only problem, and it’s unclear how much longer it will continue to be a problem for.
My understanding of why it’s especially hard to stop the model making stuff up (while not saying “I don’t know” too often), compared to other alignment failures:
The model inherits a strong tendency to make stuff up from the pre-training objective.
This tendency is reinforced by the supervised fine-tuning phase, if there are examples of answers containing information that the model doesn’t know. (However, this can be avoided to some extent, by having the supervised fine-tuning data depend on what the model seems to know, a technique that was employed here.)
In the RL phase, the model can in theory be incentivized to express calibrated uncertainty by rewarding it using a proper scoring rule. (Penalizing the model a lot for saying false things and a little for saying “I don’t know” is an approximation to this.) However, this reward signal is noisy and so is likely much less sample-efficient than teaching the model simple rules about how to behave.
Even if the model were perfectly calibrated, it would still make legitimate mistakes (e.g., if it were incentivized to say “I’m not sure” whenever it was <95% confident, it would still be wrong 5% of the time). In other words, there is also an inherent trade-off at play.
Labelers likely make some mistakes when assessing correctness, especially for more complex topics. This is in some sense the most pernicious cause of failure, since it’s not automatically fixed by scaling up RL, and leads to deception being directly incentivized. That being said, I suspect it’s currently driving a minority of the phenomenon.
In practice, incorporating retrieval should help mitigate the problem to a significant extent, but that’s a different kind of solution.
I expect that making the model adversarially robust to “jailbreaking” (enough so for practical purposes) will be easier than stopping the model making stuff up, since sample efficiency should be less of a problem, but still challenging due to the need to generate strong adversarial attacks. Other unwanted behaviors such as the model stating incorrect facts about itself should be fairly straightforward to fix, and it’s more a matter of there being a long list of such things to get through.
(To be clear, I am not suggesting that aligning much smarter models will necessarily be as easy as this, and I hope that once “jailbreaking” is mostly fixed, people don’t draw the conclusion that it will be as easy.)
I would say that they are motivated by the same basic idea, but are applied to different problems. The MDL (or the closely-related BIC) is a method for model selection given a dataset, whereas surprise accounting is a method for evaluating heuristic explanations, which don’t necessarily involve model selection.
Take the Boolean circuit worked example: what is the relevant dataset? Perhaps it is the 256 (input, TRUE) pairs. But the MDL would select a much simpler model, namely the circuit that ignores the input and outputs TRUE (or “x_1 OR (NOT x_1)” if it has to consist of AND, OR and NOT gates). On the other hand, a heuristic explanation is not interested choosing a simpler model, but is instead interested in explaining why the model we have been given behaves in the way it does.
The heuristic explanations in the post do use a single prior or over the set of circuits, which we also call a “reference class”. But we wish to allow explanations that use other reference classes, as well as explanations that combine multiple reference classes, and perhaps even explanations that use “subjective” reference classes that do not seem to correspond to any precise prior. These are the sorts of issues explored in the upcoming paper. Ultimately, though, a lot of our heuristic arguments and the surprise accounting for them remain somewhat ambiguous or informal.