Google AI PM; Foundation board member
Dave Orr
Aha, thanks, that makes sense.
One way this could happen is searching for jailbreaks in the space of paraphrases and synonyms of a benign prompt.
Why would this produce fake/unlikely jailbreaks? If the paraphrases and such are natural, then doesn’t the nearness to a real(istic) prompt enough to suggest that the jailbreak found is also realistic? Of course you can adversarially generate super unrealistic things, but does that necessarily happen with paraphrasing type attacks?
You may recall certain news items last February around Gemini and diversity that wiped many billions off of Google’s market cap.
There’s a clear financial incentive to make sure that models say things within expected limits.
There’s also this: https://www.wired.com/story/air-canada-chatbot-refund-policy/
Really cool project! And the write-up is very clear.
In the section about options for reducing the hit to helpfulness, I was surprised you didn’t mention scaling the vector you’re adding or subtracting—did you try different weights? I would expect that you can tune the strength of the intervention by weighting the difference in means vector up or down.
The usual reason is compounding. If you have an asset that is growing over time, paying taxes from it means not only do you have less of it now, but the amount you pulled out now won’t compound indefinitely into the future. You want to compound growth for as long as possible on as much capital as possible. If you could diversify without paying capital gains you would, but since the choice is something like, get gains on $100 in this one stock, or get gains on $70 in this diversified basket of stocks, you might stay with the concentrated position even if you would prefer to be diversified.
This reminds me of a Brin short story which I think exactly discusses what you’re talking about: https://www.davidbrin.com/tankfarm.htm
Cool concept. I’m a bit puzzled by one thing though—presumably every time you use a tether, it slows down and drops to a lower orbit. How do you handle that? Is the idea that it’s so much more massive than the rockets its boosting that its slowdown is negligible? Or do we have to go spin it back up every so often?
“If you are playing with a player who thinks that “all reds” is a strong hand, it can take you many, many hands to figure out that they’re overestimating their hands instead of just getting anomalously lucky with their hidden cards while everyone else folds!”
As you guessed, this is wrong. If someone is playing a lot of hands, your first hypothesis is that they are too loose and making mistakes. At that point, each additional hand they play is evidence in favor of fishiness, and you can quickly become confident that they are bad.
Mistakes in the other direction are much harder to detect. If someone folds for 30 minutes, they plausibly just had a bad run of cards. We’ve all been there. They do have some discipline, but because folding is so common, each additional fold only adds a small bit of is Bayesian evidence that the person is a rock.
I wonder if there’s a way to give the black box recommended a different objective function. CTR is bad for the obvious clickbait reasons, but signals for user interaction are still valuable if you can find the right signal to use.
I would propose that returning to the site some time in the future is a better signal of quality than CTR, assuming the future is far enough away. You could try a week, a month, and a quarter.
This is maybe a good time to use reinforcement learning, since the signal is far away from the decision you need to make. When someone interacts with an article, reward the things they interacted with n weeks ago. Combined with karma, I bet that would be a better signal than CTR.
Children are evidently next word completers.
I would be very unhappy if a non disparagement agreement were sprung on me when I left the company. And I would be very reluctant to sign one entering any company.
Luckily we don’t have those at Google Deepmind.
I work at DeepMind and have been influenced by METR. :)
If you want a far future fictional treatment of this kind of situation, I recommend Surface Detail by Iain Banks.
I think your model is a bit simplistic. METR has absolutely influenced the behavior of the big labs, including DeepMind. Even if all impact goes through the big labs, you could have more influence outside of the lab than as one of many employees within. Being the head of a regulatory agency that oversees the labs sets policy in a much more direct way than a mid level exec within the company can.
I went back to finish college as an adult, and my main surprise was how much fun it was. It probably depends on what classes you have left, but I took every AI class offered and learned a ton that is still relevant to my work today, 20 years later. Even the general classes were fun—it turns out it’s easy to be an excellent student if you’re used to working a full work week, and being a good student is way more pleasant and less stressful than being a bad one, or at least it was for me.
I’m not sure what you should do necessarily, but given that you’re thinking about this less as useful for anyone in particular and more for other reasons, fun might be a good goal.
As it happened I think the credential ended up useful too, but it was a top school so more valuable than many.
This is very well written and compelling. Thanks for posting it!
This is a great post. I knew that at the top end of the income distribution in the US people have more kids, but didn’t understand how robust the relationship seems to be.
I think the standard evbio explanation here would ride on status—people at the top of the tribe can afford to expend more resources for kids, and also have more access to opportunities to have kids. That would predict that we wouldn’t see a radical change as everyone got more rich—the curve would slide right and the top end of the distribution would have more kids but not necessarily everyone else.
But the gdp per capita graphs I think are evidence against that view. It looks like the curve is a lot flatter than when fertility is rising than when it’s dropping, but if this holds into the future I really don’t worry too much. We’re on the cusp of all getting a lot richer, or else AI will kill us all anyway.
Heh, that’s why I put “strong” in there!
One big one is that the first big spreading event happened at a wet market where people and animals are in close proximity. You could check densely peopled places within some proximity of the lab to figure out how surprising it is that it happened in a wet market, but certainly animal spillover is much more likely where there are animals.
Edit: also it’s honestly kind of a bad sign that you aren’t aware of evidence that tends against your favored explanation, since that mostly happens during motivated reasoning.
“Reliable fact recall is valuable, but why would o1 pro be especially good at it? It seems like that would be the opposite of reasoning, or of thinking for a long time?”
Current models were already good at identifying and fixing factual errors when run over a response and asked to critique and fix it. Works maybe 80% of the time to identify whether there’s a mistake, and can fix it at a somewhat lower rate.
So not surprising at all that a reasoning loop can do the same thing. Possibly there’s some other secret sauce in there, but just critiquing and fixing mistakes is probably enough to see the reported gains in o1.