gabrielrecc

Karma: 239

gabrielrecc 15 May 2024 7:51 UTC
3 points
−1
in reply to: mishka’s comment on: Ilya Sutskever and Jan Leike resign from OpenAI
Leopold and Pavel were out (“fired for allegedly leaking information”) in April. https://www.silicon.co.uk/e-innovation/artificial-intelligence/openai-fires-researchers-558601

gabrielrecc 5 Sep 2023 5:24 UTC
2 points
0
on: Reproducing ARC Evals’ recent report on language model agents
Nice job! I’m working on something similar.
> Next, I might get my agent to attempt the last three tasks in the report
I wanted to clarify one thing: Are you building custom prompts for the different tasks? If so, I’d be curious to know how much effort you put into these (I’m generally curious how much of your agent’s ability to complete more tasks might be due to task-specific prompting, vs. the use of WebDriverIO and other affordances of your scaffolding). If not, isn’t getting the agent to attempt the last three tasks as simple as copy-pasting the task instructions from the ARC Evals task specs linked in the report, and completing the associated setup instructions?

gabrielrecc 30 Aug 2023 20:04 UTC
1 point
−3
on: Biosecurity Culture, Computer Security Culture
Cybersecurity seems in a pretty bad state globally—it’s not completely obvious to me that a historical norm of “people who discover things like SQL injection are pretty tight-lipped about them and share them only with governments / critical infrastructure folks / other cybersecurity researchers” would have led to a worse situation than the one we’re in cybersecuritywise...

gabrielrecc 31 Jul 2023 9:53 UTC
2 points
0
on: How to find AI alignment researchers to collaborate with?
I’d recommend participating in AGISF. Completely online/virtual, a pretty light commitment (I’d describe it more as a reading group than a course personally), cohorts are typically run by AI alignment researchers or people who are quite well-versed in the field, and you’ll be added to a Slack group which is pretty large and active and a reasonable way to try to get feedback.

gabrielrecc 30 Jul 2023 10:17 UTC
LW: 15 AF: 8
2
AF
on: When can we trust model evaluations?
This is great. One nuance: This implies that behavioral RL fine-tuning evals are strictly less robust than behavioral I.I.D. fine-tuning evals, and that as such they would only be used for tasks that you know how to evaluate but not generate. But it seems to me that there are circumstances in which the RL-based evals could be more robust at testing capabilities, namely in cases where it’s hard for a model to complete a task by the same means that humans tend to complete it, but where RL can find a shortcut that allows it to complete the task in another way. Is that right or am I misunderstanding something here?
For example, if we wanted to test whether a particular model was capable of getting 3 million points in the game of Qbert within 8 hours of gameplay time, and we fine-tuned on examples of humans doing the same, it might not be able to: achieving this in the way an expert human does might require mastering numerous difficult-to-learn subskills. But an RL fine-tuning eval might find the bug discovered by Canonical ES, illustrating the capability without needing the subskills that humans lean on.

gabrielrecc 7 May 2023 4:03 UTC
1 point
0
on: Long Covid Risks: 2023 Update
Nice, thanks for this!
If you want to norm this for your own demographic, you can get a very crude estimate by entering your demographic information in this calculator, dividing your risk of hospitalization by 3 and multiplying the total by 0.4 (which includes the 20% reduction from vaccination and the 50% reduction from Paxlovid)
Anecdotally, I feel like I’ve heard a number of instances of folks with what pretty clearly seemed to be long Covid coming on despite not having required hospitalization? And in this UK survey of “Estimated number of people (in thousands) living in private households with self-reported long COVID of any duration”, it looks like only 4% of such people were hospitalized (March 2023 dataset table 1)

gabrielrecc 27 Feb 2023 11:23 UTC
LW: 7 AF: 4
0
AF
on: A (EtA: quick) note on terminology: AI Alignment != AI x-safety
Irving’s team’s terminology has been “behavioural alignment” for the green box—https://arxiv.org/pdf/2103.14659.pdf

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results

Esben Kran, Fazl, Sabrina Zaki, gabrielrecc and rz2383

23 Feb 2023 10:48 UTC

8 points

0 comments6 min readLW link

gabrielrecc 7 Jan 2023 15:04 UTC
6 points
0
on: Can ChatGPT count?
The byte-pair encoding is probably hurting it somewhat here; forcing it to unpack it will likely help. Try using this as a one-shot prompt:

How many Xs are there in “KJXKKLJKLJKXXKLJXKJL”?

Numbering the letters in the string, we have: 1 K, 2 J, 3 X, 4 K, 5 K, 6 L, 7 J, 8 K, 9 L, 10 J, 11 K, 12 X, 13 X, 14 K, 15 L, 16 J, 17 X, 18 K, 19 J, 20 L. There are Xs at positions 3, 12, 13, and 17. So there are 4 Xs in total.

How many [character of interest]s are there in “[string of interest goes here]”?

If it’s still getting confused, add more shots—I suspect it can figure out how to do it most of the time with a sufficient number of examples.

gabrielrecc 24 Dec 2022 9:32 UTC
20 points
11
on: The case against AI alignment
It seems like you’re claiming something along the lines of “absolute power corrupts absolutely” … that every set of values that could reasonably be described as “human values” to which an AI could be aligned—your current values, your CEV, [insert especially empathetic, kind, etc. person here]’s current values, their CEV, etc. -- would endorse subjecting huge numbers of beings to astronomical levels of suffering, if the person with that value system had the power to do so.

I guess I really don’t find that claim plausible. For example, here is my reaction to the following two questions in the post:

”How many ordinary, regular people throughout history have become the worst kind of sadist under the slightest excuse or social pressure to do so to their hated outgroup?”

… a very, very small percentage of them? (minor point: with CEV, you’re specifically thinking about what one’s values would be in the absence of social pressure, etc...)

”What society hasn’t had some underclass it wanted to put down in the dirt just to lord power over them?”

It sounds like you think “hatred of the outgroup” is the fundamental reason this happens, but in the real world it seems like “hatred of the outgroup” is driven by “fear of the outgroup”. A godlike AI that is so powerful that it has no reason to fear the outgroup also has no reason to hate it. It has no reason to behave like the classic tyrant whose paranoia of being offed leads him to extreme cruelty in order to terrify anyone who might pose a threat, because no one poses a threat.

gabrielrecc 28 Nov 2022 21:56 UTC
17 points
2
on: The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
This reminded me of some findings associated with “latent semantic analysis”, an old-school information retrieval technique. You build a big matrix where each unique term in a corpus (excluding a stoplist of extremely frequent terms) is assigned to a row, each document is assigned to a column, and each cell holds the number of times that term $t_{i}$ appeared in document $d_{j}$ , and with some kind of weighting scheme that downweights frequent terms), and you take the SVD. This also gives you interpretable dimensions, at least if you use varimax rotation. See for example pgs. 9-11 & pgs. 18-20 of this paper. Also, I seem to recall that the positive and negative singular values after doing latent semantic analysis are often both semantically interpretable, sometimes with antipodal pairs, although I can’t find the paper where I saw this.

I’m not sure whether the right way to think about this is “you should be very circumspect about saying that ‘semantic processing’ is going on just because the SVD has interpretable dimensions, because you get that merely by taking the SVD of a slightly preprocessed word-by-document matrix”, or rather “a lot of what we call ‘semantic processing’ in humans is probably just down to pretty simple statistical associations, which the later layers seem to be picking up on”, but it seemed worth mentioning in any case!
edit: seems likely that the “association clusters” seen in the earlier layers might map onto what latent semantic analysis is picking up on, whereas the later layers might be picking up on semantic relationships that aren’t as directly reflected in the surface-level statistical associations. could be tested!

gabrielrecc 23 Nov 2022 10:47 UTC
1 point
in reply to: TropicalFruit’s comment on: People Will Listen
Why do you expect Bitcoin to be excepted from being labelled a security along with the rest?
(Apologies if the answer is obvious to those who know more about the subject than me, am just genuinely curious)

gabrielrecc 10 Nov 2022 15:52 UTC
5 points
0
on: Covid 11/10/22: Into the Background
Had a similar medical bill story from when I was a poor student: Medical center told me that insurance would cover an operation. They failed to mention that they were only talking about the surgeon’s fee; the hospital at which they arranged the operation was out-of-network and I was stuck with 50% of the facility’s costs. I explained my story to the facility. They said I still had to pay but that a payment plan would be possible, and that I could start by paying a small amount each month. I took that literally and just started paying a (very) small amount monthly. At some point they called back to tell me to formally arrange a payment plan through their online portal, which gave me options with such high interest rates that there was no way my future earnings would increase at a fast enough rate to make a payment plan make any sense whatsoever. I called back and explained this, and said that if those were the only options I guess I would just have to try to scrape the money together now, and that I was prepared to try to do this. The administrator, bless her heart, asked me to hold for awhile, and eventually came back to say “I’ve spoken with my colleagues, and your current balance owed to us is now zero dollars”.

This (along with a few other experiences in my life) has underscored how sometimes an apparently immovable constraint can evaporate if you can manage to talk to the right person. That said, I felt very lucky to have been taken pity on in this way—I feel like having one’s balance explicitly zeroed out in this way is rare! But it’s interesting to hear that Zvi knows of cases where someone just didn’t pay, with no consequences. I would have assumed that they’d normally report nonpayers to credit agencies and crater their credit scores after long enough, as it costs them nothing or almost nothing to do so. Would be interested either to hear other people’s anecdotes of what happened after nonpayment of a large hospital bill (positive or negative), or to see data on this if anyone knows of any.

gabrielrecc 28 Sep 2022 16:58 UTC
3 points
2
in reply to: ChristianKl’s comment on: Why we’re not founding a human-data-for-alignment org
I was using medical questions as just one example of the kind of task that’s relevant to sandwiching. More generally, what’s particularly useful for this research programme are
- tasks where we have “models which have the potential to be superhuman at [the] task”, and “for which we have no simple algorithmic-generated or hard-coded training signal that’s adequate”; and
- for which there is some set of reference humans who are currently better at the task than the model;
- and for which there is some set of reference humans for whom the task is difficult enough that they would have trouble even evaluating/recognizing good performance. (you also want this set of reference humans to be capable of being helped to evaluate/recognize good performance in some way)
Prime examples are task types that require some kind of niche expertise to do and evaluate. Cotra’s examples involve “[fine-tuning] a model to answer long-form questions in a domain (e.g. economics or physics) using demonstrations and feedback collected from experts in the domain”, “[fine-tuning] a coding model to write short functions solving simple puzzles using demonstrations and feedback collected from expert software engineers”, “[fine-tuning] a model to translate between English and French using demonstrations and feedback collected from people who are fluent in both languages”. I was just making the point that Surge can help with this kind of thing in some domains (coding), but not in others.

gabrielrecc 28 Sep 2022 12:29 UTC
10 points
1
on: Why we’re not founding a human-data-for-alignment org
It’s worth knowing that there are some categories of data that Surge is not well positioned to provide. For example, while they have a substantial pool of participants with programming expertise, my understanding from speaking with a Surge rep is that they don’t really have access to a pool of participants with (say) medical expertise—although for small projects it sounds like they are willing to try to see who they might already have with relevant experience in their existing pool of ‘Surgers’. This kind of more niche expertise does seem likely to become increasingly relevant for sandwiching experiments. I’d be interested in learning more about companies or resources that can help collect RLHF data from people with uncommon (but not super-rare) kinds of expertise for exactly this reason.

gabrielrecc 1 Sep 2022 10:01 UTC
2 points
1
in reply to: Neel Nanda’s comment on: Your posts should be on arXiv
I did Print to PDF in Word after formatting my Word document to look like a standard LaTeX-exported document, it had no problem going through! But might depend on the particular moderator.

gabrielrecc 10 Aug 2022 8:40 UTC
1 point
0
in reply to: Richard_Kennaway’s comment on: The lessons of Xanadu
Sounds a little like StarWeb? Recently read a lovely article about a similar but different game, Monster Island, which was a thing from 1989 to 2017.
But yes, my default assumption would be that the particular conversation you’re referring to never resulted in a game that saw the light of day; I’ve seen many detailed game design discussions among people I’ve known meet the same fate.

gabrielrecc 6 Aug 2022 17:01 UTC
3 points
0
in reply to: johnswentworth’s comment on: Rant on Problem Factorization for Alignment
Thanks, I agree that’s a better analogy. Though of course, it isn’t necessary that none of the employees (participants in a sandwiching project) are unaware of the CEO’s (sandwiching project overseer’s) goal; I was only highlighting that they need not necessarily be aware of it in order to make it clear that the goals of the human helpers/judges aren’t especially relevant to what sandwiching, debate, etc. is really about. But of course if it turns out that having the human helpers know what the ultimate goal is helps, then they’re absolutely allowed to be in on it...

Perhaps this is a bit glib, but arguably some of the most profitable companies in the mobile game space have essentially built product assembly lines to churn out fairly derivative games that are nevertheless unique enough to do well on the charts, and they absolutely do it by factoring the project of “making a game” into different bits that are done by different people (programmers, artists, voice actors, etc.), some of whom might not have any particular need to know what the product will look like as a whole to play their part.
However, I don’t want to press too hard on this game example as you may or may not consider this ‘cognitive work’ and as it has other disanalogies with what we are actually talking about here. And to a certain degree I share your intuition that factoring certain kinds of tasks is probably very hard: if it wasn’t, we might expect to see a lot more non-manufacturing companies whose employee main base consists of assembly lines (or hierarchies of assembly lines, or whatever) requiring workers with general intelligence but few specialized rare skills, which I think is the broader point you’re making in this comment. I think that’s right, although I also think there are reasons for this that go beyond just the difficulty of task factorization, and which don’t all apply in the HCH etc. case, as some other commenters have pointed out.

gabrielrecc 6 Aug 2022 10:00 UTC
4 points
0
on: Rant on Problem Factorization for Alignment
We start with some ML model which has lots from many different fields, like GPT-n. We also have a human who has a domain-specific problem to solve (like e.g. a coding problem, or a translation to another language) but lacks the relevant domain knowledge (e.g. coding skills, or language fluency). The problem, roughly speaking, is to get the ML model and the human to work as a team, and produce an outcome at-least-as-good as a human expert in the domain. In other words, we want to factorize the “expert knowledge” and the “having a use-case” parts of the problem.
...
This sort of problem comes up all the time in real-world businesses. We could just as easily consider a product designer at a tech startup (who knows what they want but little about coding), an engineer (who knows lots about coding but doesn’t understand what the designer wants)...

These examples conflate “what the human who provided the task to the AI+human combined system wants” with “what the human who is working together with the AI wants” in a way that I think is confusing and sort of misses the point of sandwiching. In sandwiching, “what the human wants” is implicit in the choice of task, but the “what the human wants” part isn’t really what is being delegated or factored off to the human who is working together with the AI; what THAT human wants doesn’t enter into it at all. Using Cotra’s initial example to belabor the point: if someone figured out a way to get some non-medically-trained humans to work together with a mediocre medical-advice-giving AI in such a way that the output of the combined human+AI team is actually good medical advice, it doesn’t matter whether those non-medically-trained humans actually care that the result is good medical advice; they might not even individually know what the purpose of the system is, and just be focused on whatever their piece of the task is—say, verifying the correctness of individual steps of a chain of reasoning generated by the system, or checking that each step logically follows from the previous, or whatever. Of course this might be really time intensive, but if you can improve even slightly on the performance of the original mediocre system, then hopefully you can train a new AI system to match the performance of the original AI+human system by imitation learning, and bootstrap from there.
The point, as I understand it, is that if we can get human+AI systems to progress from “mediocre” to “excellent” (in other words, to remain aligned with the designer’s goal) -- despite the fact that the only feedback involved is from humans who wouldn’t even be mediocre at achieving the designer’s goal if they were asked to do it themselves—and if we can do it in a way that generalizes across all kinds of tasks, then that would be really promising. To me, it seems hard enough that we definitely shouldn’t take a few failed attempts as evidence that it can’t be done, but not so hard as to seem obviously impossible.

gabrielrecc 22 Jun 2022 9:11 UTC
4 points
in reply to: Nanda Ale’s comment on: Common but neglected risk factors that may let you get Paxlovid
I just shared this info with an immune-compromised relative, thanks so much for this.

gabrielrecc

Au­to­mated Sand­wich­ing & Quan­tify­ing Hu­man-LLM Co­op­er­a­tion: ScaleOver­sight hackathon results

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results