This is cool, although I suspect that you’d get something similar from even very simple models that aren’t necessarily “modelling the world” in any deep sense, simply due to first and second order statistical associations between nearby place names. See e.g. https://onlinelibrary.wiley.com/doi/pdfdirect/10.1111/j.1551-6709.2008.01003.x , https://escholarship.org/uc/item/2g6976kg .
gabrielrecc
Leopold and Pavel were out (“fired for allegedly leaking information”) in April. https://www.silicon.co.uk/e-innovation/artificial-intelligence/openai-fires-researchers-558601
Nice job! I’m working on something similar.
> Next, I might get my agent to attempt the last three tasks in the report
I wanted to clarify one thing: Are you building custom prompts for the different tasks? If so, I’d be curious to know how much effort you put into these (I’m generally curious how much of your agent’s ability to complete more tasks might be due to task-specific prompting, vs. the use of WebDriverIO and other affordances of your scaffolding). If not, isn’t getting the agent to attempt the last three tasks as simple as copy-pasting the task instructions from the ARC Evals task specs linked in the report, and completing the associated setup instructions?
Cybersecurity seems in a pretty bad state globally—it’s not completely obvious to me that a historical norm of “people who discover things like SQL injection are pretty tight-lipped about them and share them only with governments / critical infrastructure folks / other cybersecurity researchers” would have led to a worse situation than the one we’re in cybersecuritywise...
I’d recommend participating in AGISF. Completely online/virtual, a pretty light commitment (I’d describe it more as a reading group than a course personally), cohorts are typically run by AI alignment researchers or people who are quite well-versed in the field, and you’ll be added to a Slack group which is pretty large and active and a reasonable way to try to get feedback.
This is great. One nuance: This implies that behavioral RL fine-tuning evals are strictly less robust than behavioral I.I.D. fine-tuning evals, and that as such they would only be used for tasks that you know how to evaluate but not generate. But it seems to me that there are circumstances in which the RL-based evals could be more robust at testing capabilities, namely in cases where it’s hard for a model to complete a task by the same means that humans tend to complete it, but where RL can find a shortcut that allows it to complete the task in another way. Is that right or am I misunderstanding something here?
For example, if we wanted to test whether a particular model was capable of getting 3 million points in the game of Qbert within 8 hours of gameplay time, and we fine-tuned on examples of humans doing the same, it might not be able to: achieving this in the way an expert human does might require mastering numerous difficult-to-learn subskills. But an RL fine-tuning eval might find the bug discovered by Canonical ES, illustrating the capability without needing the subskills that humans lean on.
Nice, thanks for this!
If you want to norm this for your own demographic, you can get a very crude estimate by entering your demographic information in this calculator, dividing your risk of hospitalization by 3 and multiplying the total by 0.4 (which includes the 20% reduction from vaccination and the 50% reduction from Paxlovid)
Anecdotally, I feel like I’ve heard a number of instances of folks with what pretty clearly seemed to be long Covid coming on despite not having required hospitalization? And in this UK survey of “Estimated number of people (in thousands) living in private households with self-reported long COVID of any duration”, it looks like only 4% of such people were hospitalized (March 2023 dataset table 1)
Irving’s team’s terminology has been “behavioural alignment” for the green box—https://arxiv.org/pdf/2103.14659.pdf
Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results
The byte-pair encoding is probably hurting it somewhat here; forcing it to unpack it will likely help. Try using this as a one-shot prompt:
How many Xs are there in “KJXKKLJKLJKXXKLJXKJL”?
Numbering the letters in the string, we have: 1 K, 2 J, 3 X, 4 K, 5 K, 6 L, 7 J, 8 K, 9 L, 10 J, 11 K, 12 X, 13 X, 14 K, 15 L, 16 J, 17 X, 18 K, 19 J, 20 L. There are Xs at positions 3, 12, 13, and 17. So there are 4 Xs in total.
How many [character of interest]s are there in “[string of interest goes here]”?
If it’s still getting confused, add more shots—I suspect it can figure out how to do it most of the time with a sufficient number of examples.
It seems like you’re claiming something along the lines of “absolute power corrupts absolutely” … that every set of values that could reasonably be described as “human values” to which an AI could be aligned—your current values, your CEV, [insert especially empathetic, kind, etc. person here]’s current values, their CEV, etc. -- would endorse subjecting huge numbers of beings to astronomical levels of suffering, if the person with that value system had the power to do so.
I guess I really don’t find that claim plausible. For example, here is my reaction to the following two questions in the post:
”How many ordinary, regular people throughout history have become the worst kind of sadist under the slightest excuse or social pressure to do so to their hated outgroup?”
… a very, very small percentage of them? (minor point: with CEV, you’re specifically thinking about what one’s values would be in the absence of social pressure, etc...)
”What society hasn’t had some underclass it wanted to put down in the dirt just to lord power over them?”
It sounds like you think “hatred of the outgroup” is the fundamental reason this happens, but in the real world it seems like “hatred of the outgroup” is driven by “fear of the outgroup”. A godlike AI that is so powerful that it has no reason to fear the outgroup also has no reason to hate it. It has no reason to behave like the classic tyrant whose paranoia of being offed leads him to extreme cruelty in order to terrify anyone who might pose a threat, because no one poses a threat.
This reminded me of some findings associated with “latent semantic analysis”, an old-school information retrieval technique. You build a big matrix where each unique term in a corpus (excluding a stoplist of extremely frequent terms) is assigned to a row, each document is assigned to a column, and each cell holds the number of times that term appeared in document , and with some kind of weighting scheme that downweights frequent terms), and you take the SVD. This also gives you interpretable dimensions, at least if you use varimax rotation. See for example pgs. 9-11 & pgs. 18-20 of this paper. Also, I seem to recall that the positive and negative singular values after doing latent semantic analysis are often both semantically interpretable, sometimes with antipodal pairs, although I can’t find the paper where I saw this.
I’m not sure whether the right way to think about this is “you should be very circumspect about saying that ‘semantic processing’ is going on just because the SVD has interpretable dimensions, because you get that merely by taking the SVD of a slightly preprocessed word-by-document matrix”, or rather “a lot of what we call ‘semantic processing’ in humans is probably just down to pretty simple statistical associations, which the later layers seem to be picking up on”, but it seemed worth mentioning in any case!edit: seems likely that the “association clusters” seen in the earlier layers might map onto what latent semantic analysis is picking up on, whereas the later layers might be picking up on semantic relationships that aren’t as directly reflected in the surface-level statistical associations. could be tested!
Why do you expect Bitcoin to be excepted from being labelled a security along with the rest?
(Apologies if the answer is obvious to those who know more about the subject than me, am just genuinely curious)
Had a similar medical bill story from when I was a poor student: Medical center told me that insurance would cover an operation. They failed to mention that they were only talking about the surgeon’s fee; the hospital at which they arranged the operation was out-of-network and I was stuck with 50% of the facility’s costs. I explained my story to the facility. They said I still had to pay but that a payment plan would be possible, and that I could start by paying a small amount each month. I took that literally and just started paying a (very) small amount monthly. At some point they called back to tell me to formally arrange a payment plan through their online portal, which gave me options with such high interest rates that there was no way my future earnings would increase at a fast enough rate to make a payment plan make any sense whatsoever. I called back and explained this, and said that if those were the only options I guess I would just have to try to scrape the money together now, and that I was prepared to try to do this. The administrator, bless her heart, asked me to hold for awhile, and eventually came back to say “I’ve spoken with my colleagues, and your current balance owed to us is now zero dollars”.
This (along with a few other experiences in my life) has underscored how sometimes an apparently immovable constraint can evaporate if you can manage to talk to the right person. That said, I felt very lucky to have been taken pity on in this way—I feel like having one’s balance explicitly zeroed out in this way is rare! But it’s interesting to hear that Zvi knows of cases where someone just didn’t pay, with no consequences. I would have assumed that they’d normally report nonpayers to credit agencies and crater their credit scores after long enough, as it costs them nothing or almost nothing to do so. Would be interested either to hear other people’s anecdotes of what happened after nonpayment of a large hospital bill (positive or negative), or to see data on this if anyone knows of any.
I was using medical questions as just one example of the kind of task that’s relevant to sandwiching. More generally, what’s particularly useful for this research programme are
tasks where we have “models which have the potential to be superhuman at [the] task”, and “for which we have no simple algorithmic-generated or hard-coded training signal that’s adequate”; and
for which there is some set of reference humans who are currently better at the task than the model;
and for which there is some set of reference humans for whom the task is difficult enough that they would have trouble even evaluating/recognizing good performance. (you also want this set of reference humans to be capable of being helped to evaluate/recognize good performance in some way)
Prime examples are task types that require some kind of niche expertise to do and evaluate. Cotra’s examples involve “[fine-tuning] a model to answer long-form questions in a domain (e.g. economics or physics) using demonstrations and feedback collected from experts in the domain”, “[fine-tuning] a coding model to write short functions solving simple puzzles using demonstrations and feedback collected from expert software engineers”, “[fine-tuning] a model to translate between English and French using demonstrations and feedback collected from people who are fluent in both languages”. I was just making the point that Surge can help with this kind of thing in some domains (coding), but not in others.
It’s worth knowing that there are some categories of data that Surge is not well positioned to provide. For example, while they have a substantial pool of participants with programming expertise, my understanding from speaking with a Surge rep is that they don’t really have access to a pool of participants with (say) medical expertise—although for small projects it sounds like they are willing to try to see who they might already have with relevant experience in their existing pool of ‘Surgers’. This kind of more niche expertise does seem likely to become increasingly relevant for sandwiching experiments. I’d be interested in learning more about companies or resources that can help collect RLHF data from people with uncommon (but not super-rare) kinds of expertise for exactly this reason.
I did Print to PDF in Word after formatting my Word document to look like a standard LaTeX-exported document, it had no problem going through! But might depend on the particular moderator.
Sounds a little like StarWeb? Recently read a lovely article about a similar but different game, Monster Island, which was a thing from 1989 to 2017.
But yes, my default assumption would be that the particular conversation you’re referring to never resulted in a game that saw the light of day; I’ve seen many detailed game design discussions among people I’ve known meet the same fate.
Thanks, I agree that’s a better analogy. Though of course, it isn’t necessary that none of the employees (participants in a sandwiching project) are unaware of the CEO’s (sandwiching project overseer’s) goal; I was only highlighting that they need not necessarily be aware of it in order to make it clear that the goals of the human helpers/judges aren’t especially relevant to what sandwiching, debate, etc. is really about. But of course if it turns out that having the human helpers know what the ultimate goal is helps, then they’re absolutely allowed to be in on it...
Perhaps this is a bit glib, but arguably some of the most profitable companies in the mobile game space have essentially built product assembly lines to churn out fairly derivative games that are nevertheless unique enough to do well on the charts, and they absolutely do it by factoring the project of “making a game” into different bits that are done by different people (programmers, artists, voice actors, etc.), some of whom might not have any particular need to know what the product will look like as a whole to play their part.However, I don’t want to press too hard on this game example as you may or may not consider this ‘cognitive work’ and as it has other disanalogies with what we are actually talking about here. And to a certain degree I share your intuition that factoring certain kinds of tasks is probably very hard: if it wasn’t, we might expect to see a lot more non-manufacturing companies whose employee main base consists of assembly lines (or hierarchies of assembly lines, or whatever) requiring workers with general intelligence but few specialized rare skills, which I think is the broader point you’re making in this comment. I think that’s right, although I also think there are reasons for this that go beyond just the difficulty of task factorization, and which don’t all apply in the HCH etc. case, as some other commenters have pointed out.
Love pieces that manage to be both funny and thought-provoking. And +1 for fitting a solar storm in there. There is now better evidence of very large historical solar storms than there had been during David Roodman’s Open Phil review in late 2014, have been meaning to write something up about that but other things have taken priority.