DeepMind Gemini Safety lead; Foundation board member
Dave Orr
I don’t work directly on pretraining, but when there were allegations of eval set contamination due to detection of a canary string last year, I looked into it specifically. I read the docs on prevention, talked with the lead engineer, and discussed with other execs.
So I have pretty detailed knowledge here. Of course GDM is a big complicated place and I certainly don’t know everything, but I’m confident that we are trying hard to prevent contamination.
I work at GDM so obviously take that into account here, but in my internal conversations about external benchmarks we take cheating very seriously—we don’t want eval data to leak into training data, and have multiple lines of defense to keep that from happening. It’s not as trivial as you might think to avoid, since papers and blog posts and analyses can sometimes have specific examples from benchmarks in them, unmarked—and while we do look for this kind of thing, there’s no guarantee that we will be perfect at finding them. So it’s completely possible that some benchmarks are contaminated now. But I can say with assurance that for GDM it’s not intentional and we work to avoid it.
We do hill climb on notable benchmarks and I think there’s likely a certain amount of overfitting going on, especially with LMSys these days, and not just from us.
I think the main thing that’s happening is that benchmarks used to be a reasonable predictor of usefulness, and mostly are not now, presumably because of Goodhart reasons. The agent benchmarks are pretty different in kind and I expect are still useful as a measure of utility, and probably will be until they start to get more saturated, at which point we’ll all need to switch to something else.
Humans have always been misaligned. Things now are probably significantly better in terms of human alignment than almost any time in history (citation needed) due to high levels of education and broad agreement about many things that we take for granted (e.g. the limits of free trade are debated but there has never been so much free trade). So you would need to think that something important was different now for there to be some kind of new existential risk.
One candidate is that as tech advances, the amount of damage a small misaligned group could do is growing. The obvious example is bioweapons—the number of people who could create a lethal engineered global pandemic is steadily going up, and at some point some of them may be evil enough to actually try to do it.
This is one of the arguments in favor of the AGI project. Whether you think it’s a good idea probably depends on your credences around human-caused xrisks versus AGI xrisk.
One tip for research of this kind is to not only measure recall, but also precision. It’s easy to block 100% of dangerous prompts by blocking 100% of prompts, but obviously that doesn’t work in practice. The actual task that labs are trying to solve is to block as many unsafe prompts as possible while rarely blocking safe prompts, or in other words, looking at both precision and recall.
Of course with truly dangerous models and prompts, you do want ~100% recall, and in that situation it’s fair to say that nobody should ever be able to build a bioweapon. But in the world we currently live in, the amount of uplift you get from a frontier model and a prompt in your dataset isn’t very much, so it’s reasonable to trade off against losses from over refusal.
The pivotal act link is broken, fyi.
Gemini V2 (1206 experimental which is the larger model) one boxes, so.… progress?
I’m probably too conflicted to give you advice here (I work on safety at Google DeepMind), but you might want to think through, at a gears level, what could concretely happen with your work that would lead to bad outcomes. Then you can balance that against positives (getting paid, becoming more familiar with model outputs, whatever).
You might also think about how your work compares to whoever would replace you on average, and what implications that might have as well.
This is great data! I’d been wondering about this myself.
Where were you measuring air quality? How far from the stove? Same place every time?
Practicing LLM prompting?
I haven’t heard the p zombie argument before, but I agree that is at least some Bayesian evidence that we’re not in a sim.
We don’t know if simulated people will be p zombies
I am not a p zombie [citation needed]
It would be very surprising if sims were not p zombies but everyone in the physical universe is
Therefore the likelihood ratio of being conscious is higher for the real universe than a simulation
Probably 3 needs to be developed further, but this is the first new piece of evidence I’ve seen since I first encountered the simulation argument in like 2005.
Are we playing the question game because the thread was started by Rosencranz? Is China doing well in the EV space a bad thing?
Is it the case that the tech would exist without him? I think that’s pretty unclear, especially for SpaceX, where despite other startups in the space, nobody else managed to radically reduce the cost per launch in a way that transformed the industry.
Even for Tesla, which seems more pedestrian (heh) now, there were a number of years where they had the only viable car in the market. It was only once they proved it was feasible that everyone else piled in.
Progress in ML looks a lot like, we had a different setup with different data and a tweaked algorithm and did better on this task. If you want to put an asterisk on o3 because it trained in some specific way that’s different from previous contenders, then basically every ML advance is going to have a similar asterisk. Seems like a lot of asterisking.
Hm I think the main thrust of this post misses something, which is that different conditions, even contradictory conditions, can easily happen locally. Obviously, it can be raining in San Francisco and sunny in LA, and you can have one person wearing a raincoat in SF and the other one the beach in LA with no problem, even if they are part of the same team.
I think this is true of wealth inequality.
Carnegie or Larry Page or Warren Buffett got their money in a non exploitative way, by being better than others at something that was extremely socially valuable. Part of what enables that is living in a society where capital is allocated by markets and there are clear price signals.
But many places in the world are not like this. Assad and Putin amassed their wealth via exploitative and extractive means. Wealth at the top in their societies is a tool of oppression.
I think this geographic heterogeneity implies that you should have one kind of program in the US (e.g. with about market failures with goods with potentially very high negative externalities like advanced AI) and another in e.g. Uganda where direct cash transfers (if you are careful to ensure they don’t get expropriated by whenever the local oppressors are) could be very high impact.
It seems very strange to me to say that they cheated, when the public training set is intended to be used exactly for training. They did what the test specified! And they didn’t even use all of it.
The whole point of the test is that some training examples aren’t going to unlock the rest of it. What training definitely does it teach the model how to output the JSON in the right format, and likely how to think about what to even do with these visual puzzles.
Do we say that humans aren’t a general intelligence even though for ~all valuable tasks, you have to take some time to practice, or someone has to show you, before you can do it well?
Checking in on Scott’s composition image bet with imagen 3
Why does RL necessarily mean that AIs are trained to plan ahead?
“Reliable fact recall is valuable, but why would o1 pro be especially good at it? It seems like that would be the opposite of reasoning, or of thinking for a long time?”
Current models were already good at identifying and fixing factual errors when run over a response and asked to critique and fix it. Works maybe 80% of the time to identify whether there’s a mistake, and can fix it at a somewhat lower rate.
So not surprising at all that a reasoning loop can do the same thing. Possibly there’s some other secret sauce in there, but just critiquing and fixing mistakes is probably enough to see the reported gains in o1.
Aha, thanks, that makes sense.
“There have been some relatively discontinuous jumps already (e.g. GPT-3, 3.5 and 4), at least from the outside perspective.”
These are firmly within our definition of continuity—we intend our approach to handle jumps larger than seen in your examples here.
Possibly a disconnect is that from an end user perspective a new release can look like a big jump, while from a developer perspective it was continuous.
Note also that continuous can still be very fast. And of course we could be wrong about discontinuous jumps.