Olli Järviniemi

Karma: 1,508

Olli Järviniemi 14 Jan 2025 15:45 UTC
2 points
0
on: What’s the short timeline plan?
Thank you for this post. I agree this is important, and I’d like to see improved plans.
Three comments on such plans.
1: Technical research and work.
(I broadly agree with the technical directions listed deserving priority.)
I’d want these plans to explicitly consider the effects of AI R&D acceleration, as those are significant. The speedups vary based on how constrained projects are on labor vs. compute; those that are mostly bottle-necked on labor could be massively sped up. (For instance, evaluations seem primarily labor-constrained to me.)
The lower costs of labor have other implications as well, likely including security (see also here) and technical governance (making better verification methods technically feasible).
2: The high-level strategy
If I were to now write a plan for two-to-three-year timelines, the high-level strategy I’d choose is:
Don’t build generally vastly superhuman AIs. Use whatever technical methods we have now to control and align AIs which are less capable than that. Drastically speed up (technical) governance work with the AIs we have.^[1] Push for governments and companies to enforce the no-vastly-superhuman-AIs rule.
Others might have different strategies; I’d like these plans to discuss what the high-level strategy or aims are.
3: Organizational competence
Reasoning transparency and safety first culture are mentioned in the post (in Layer 2), but I’d further prioritize and plan organizational aspects, even when aiming for “the bare minimum”. Beside the general importance of organizational competence, there are two specific reasons for this:
- If and when AI R&D acceleration is very fast, delays in information propagating to outsiders are more costly. That is: insofar as you want to keep external actors “in the loop” and contribute, you need to put more effort into communicating what is happening internally.
- Organizational competence and technical work are not fully at odds, as there are employees specialized in different things anyways.
(I think the responses to Evan Hubinger’s request for takes on what Anthropic should do differently has useful ideas for planning here.)
1. ^
  Note: I’m not technically knowledgeable on the field.

Olli Järviniemi 13 Jan 2025 17:39 UTC
3 points
0
in reply to: evhub’s comment on: evhub’s Shortform
I’m glad you asked this. I think there are many good suggestions by others. A few more:
1: Have explicit, written plans for various scenarios. When it makes sense, have drills for them. Make your plans public or, at least, solicit external feedback on them.
Examples of such scenarios:
- Anthropic needs to pause down due to RSP commitments
- A model is caught executing a full-blown escape attempt
- Model weights are stolen
- A competing AI company makes credible claims about having AIs that imply decisive competitive advantage
2: Have a written list of assumptions you aim to maintain for each model’s lifecycle. Make your lists public or, at least, solicit external feedback on them. Iterate on them regularly. Updates and violations are communicated at least internally.
These lists could vary based on ASL-levels etc., and could include things like:
- During the training of the model, we will continually evaluate the model on our evaluation suite for sabotage, sandbagging and scheming at [frequency].
- We solicit third-party evaluations on the model before internal deployment.
- Throughout the model’s external deployment, we will have such-and-such monitoring schemes in place.
They could also have conditional statements (e.g. “if the model is [surprisingly capable] on [metric], we will do further investigation / take counter-measures ABC / …”). C.f. safety cases. I intend this as less binding and formal than Anthropic’s RSP.
3: Keep external actors up-to-speed. At present, I expect that in many cases there are months of delay between when the first employees discover something to when it is publicly known (e.g. research, but also with more informal observations about model capabilities and properties). But months of delay are relatively long during fast acceleration of AI R&D, and make the number of actors who can effectively contribute smaller.
This effect strengthens over time, so practicing and planning ahead seems prudent. Some ideas in that direction:
- Provide regular updates about internal events and changes (via blog posts, streamed panel conversations, open Q&A sessions or similar)
- Interviews, incident reporting and hotlines with external parties (as recommended here: https://arxiv.org/pdf/2407.17347)
- Plan ahead for how to aggregate and communicate large amounts of output (once AI R&D has been considerably accelerated)
4: Invest in technical governance. As I understand it, there are various unsolved problems in technical governance (e.g. hardware-based verification methods for training runs), and progress in those would make international coordination easier. This seems like a particularly valuable R&D area to automate, which is something frontier AI companies like Anthropic are uniquely fit to advance. Consider working with technical governance experts on how to go about this.

Olli Järviniemi 2 Jan 2025 13:26 UTC
8 points
0
on: Natural Latents: The Math
I sometimes use the notion of natural latents in my own thinking—it’s useful in the same way that the notion of Bayes networks is useful.
A frame I have is that many real world questions consist of hierarchical latents: for example, the vitality of a city is determined by employment, number of companies, migration, free-time activities and so on, and “free-time activities” is a latent (or multiple latents?) on its own.
I sometimes get use of assessing whether a topic at hand is a high-level or low-level latent and orienting accordingly. For example: if the topic at hand is “what will the societal response to AI be like?”, it’s by default not a great conversational move to talk about one’s interactions with ChatGPT the other day—those observations are likely too low-level^[1] to be informative about the high-level latent(s) under discussion. Conversely, if the topic at hand is low-level, then analyzing low-level observations is very sensible.
(One could probably have derived the same every-day lessons simply from Bayes nets, without the need for natural latent math, but the latter helped me clarify “hold on, what are the nodes of the Bayes net?”)
But admittedly, while this is a fun perspective to think about, I haven’t got that much value out of it so far. This is why I give this post +4 instead of +9 for the review.
1. ^
  And, separately, too low sample size.

Olli Järviniemi 26 Dec 2024 0:42 UTC
8 points
0
in reply to: Thane Ruthenis’s comment on: johnswentworth’s Shortform
This looks reasonable to me.
It seems you’d largely agree with that characterization?
Yes. My only hesitation is about how real-life-important it’s for AIs to be able to do math for which very-little-to-no training data exists. The internet and the mathematical literature is so vast that, unless you are doing something truly novel, there’s some relevant subfield there—in which case FrontierMath-style benchmarks would be informative of capability to do real math research.
Also, re-reading Wentworth’s original comment, I note that o1 is weak according to FM. Maybe the things Wentworth is doing are just too hard for o1, rather than (just) overfitting-on-benchmarks style issues? In any case his frustration with o1′s math skills doesn’t mean that FM isn’t measuring real math research capability.

Olli Järviniemi 25 Dec 2024 21:33 UTC
24 points
8
in reply to: Thane Ruthenis’s comment on: johnswentworth’s Shortform
[...] he suggests that each “high”-rated problem would be likewise instantly solvable by an expert in that problem’s subfield.
This is an exaggeration and, as stated, false.
Epoch AI made 5 problems from the benchmark public. One of those was ranked “High”, and that problem was authored by me.
- It took me 20-30 hours to create that submission. (To be clear, I considered variations of the problem, ran into some dead ends, spent a lot of time carefully checking my answer was right, wrote up my solution, thought about guess-proof-ness^[1] etc., which ate up a lot of time.)
- I would call myself an “expert in that problem’s subfield” (e.g. I have authored multiple related papers).
- I think you’d be very hard-pressed to find any human who could deliver the correct answer to you within 2 hours of seeing the problem.
  - E.g. I think it’s highly likely that I couldn’t have done that (I think it’d have taken me more like 5 hours), I’d be surprised if my colleagues in the relevant subfield could do that, and I think the problem is specialized enough that few of the top people in CodeForces or Project Euler could do it.
On the other hand, I don’t think the problem is very hard insight-wise—I think it’s pretty routine, but requires care with details and implementation. There are certainly experts who can see the right main ideas quickly (including me). So there’s something to the point of even FrontierMath problems being surprisingly “shallow”. And as is pointed out in the FM paper, the benchmark is limited to relatively short-scale problems (hours to days for experts) - which really is shallow, as far as the field of mathematics is concerned.
But it’s still an exaggeration to talk about “instantly solvable”. Of course, there’s no escaping of Engel’s maxim “A problem changes from impossible to trivial if a related problem was solved in training”—I guess the problem is instantly solvable to me now… but if you are hard-pressed to find humans that could solve it “instantly” when seeing it the first time, then I wouldn’t describe it in those terms.
Also, there are problems in the benchmark that require more insight than this one.
1. ^
  Daniel Litt writes about the problem: “This one (rated “high”) is a bit trickier but with no thinking at all (just explaining what computation I needed GPT-4o to do) I got the first 3 digits of the answer right (the answer requires six digits, and the in-window python timed out before it could get this far)
  Of course *proving* the answer to this one is correct is harder! But I do wonder how many of these problems are accessible to simulation/heuristics. Still an immensely useful tool but IMO people should take a step back before claiming mathematicians will soon be replaced”.
  I very much considered naive simulations and heuristics. The problem is getting 6 digits right, not 3. (The AIs are given a limited compute budget.) This is not valid evidence in favor of the problem’s easiness or for the benchmark’s accessibility to simulation/heuristics—indeed, this is evidence in the opposing direction.
  See also Evan Chen’s “I saw the organizers were pretty ruthless about rejecting problems for which they felt it was possible to guess the answer with engineer’s induction.”

Olli Järviniemi 21 Dec 2024 23:37 UTC
10 points
0
in reply to: ryan_greenblatt’s comment on: o3
This is close but not quite what I mean. Another attempt:
The literal Do Well At CodeForces task takes the form “you are given ~2 hours and ~6 problems, maximize this score function that takes into account the problems you solved and the times at which you solved them”. In this o3 is in top 200 (conditional on no cheating). So I agree there.
As you suggest, a more natural task would be “you are given $t$ time and one problem, maximize your probability of solving it in the given time”. Already at $t$ equal to ~1 hour (which is what contestants typically spend on the hardest problem they’ll solve), I’d expect o3 to be noticeably worse than top 200. This is because the CodeForces scoring function heavily penalizes slowness, and so if o3 and a human have equal performance in the contests, the human has to make up for their slowness by solving more problems. (Again, this is assuming that o3 is faster than humans in wall clock time.)
I separately believe that humans would scale better than AIs w.r.t. $t$ , but that is not the point I’m making here.

Olli Järviniemi 21 Dec 2024 20:55 UTC
14 points
0
in reply to: Taleuntum’s comment on: o3
CodeForces ratings are determined by your performance in competitions, and your score in a competition is determined, in part, by how quickly you solve the problems. I’d expect o3 to be much faster than human contestants. (The specifics are unclear—I’m not sure how a large test-time compute usage translates to wall-clock time—but at the very least o3 parallelizes between problems.)
This inflates the results relative to humans somewhat. So one shouldn’t think that o3 is in the top 200 in terms of algorithmic problem solving skills.

Olli Järviniemi 19 Dec 2024 15:34 UTC
10 points
12
on: Alignment Faking in Large Language Models
Just want to say: This is among the best AI safety work I’ve seen, and am happy you did it!

Olli Järviniemi 11 Dec 2024 20:45 UTC
21 points
4
in reply to: Joe Carlsmith’s comment on: Frontier Models are Capable of In-context Scheming
Let me remind people of this other data point: in a previous paper with Evan we found that Opus engages in sandbagging. Here are screenshots of key parts:
The system prompt is natural (but does give the model a goal, unlike in Apollo’s case):
A “Slack conversation” that appears in the prompt includes this message:
And here’s the evaluation task.
Opus acted less capable than it is, both with and without chain-of-thought.

Olli Järviniemi 11 Dec 2024 20:27 UTC
3 points
0
on: A to Z of things
I really like the idea of creating a Good ABC book, and the author executed it well. Out of the recreational posts of 2023, this is the one I’ve liked the most. I can’t articulate why exactly, but the visuals are great, it Creates A Thing, and it’s an example of things I’d like there to be more in the world. It inspired me create my own version. I give it a +9 for the 2023 review.

Olli Järviniemi 11 Dec 2024 20:18 UTC
9 points
0
on: A to Z of things
I really liked this post. I also have friends who have young children, and was inspired to give them a book like this. But everyone involved speaks Finnish, so I ended up creating my own.
I just got my copies from mail. It looks really unimpressive in these low-quality phone-camera photos of the physical book, but it’s really satisfying in real life—like Katja, I paid attention to using high-quality photos. For the cover picture I chose Earthrise.
(I’m not sharing the full photos due to uncertainties with copyright, but if you want your copy, I can send the materials to you.)
More information about the creation process:
- I didn’t know where to look for photos, but Claude had suggestions. I quickly ended up using just one service, namely Adobe Stock. Around 80% of my pictures are from there.
  - I don’t know where Grace got her pictures from, but would like to know—I think her pictures were slightly better than mine on average.
  - I found myself surprised that, despite doing countless school presentations with images, no one told me that there are these vast banks of high-quality images. (Instead I always used Google’s image search results, which is inferior.)
  - Finding high-quality photos depicting what I wanted was the bulk of the work.
- I had a good vision of the types of words I wanted to include. After having a good starting point, I benefited a little (but just a little) from using LLMs to brainstorm ideas.
- To turn it into a photo, I needed to create images for the text pages. Claude was really helpful with coding (and debugging ä′s and ö′s). It also helped me compile a PDF I could share with my friends. (I only instructed and Claude programmed—Claude sped up this part by maybe 5x.)
- My friends gave minor suggestions and improvements, a few of which made their way to the final product.
- Total process took around 15 person-hours.
I didn’t want to have any information like authors or a book title in the book: I liked the idea that the children will have a Mysterious Book in their bookshelf. (The back cover has some information about the company that made the book, though.)
Here’s the word list:
A: Avaruus (space)
B: Bakteeri (bacteria)
C: Celsius
D: Desi
E: Etäisyys (distance)
F: Folio (foil)
G: Geeni (gene)
H: Hissi (elevator)
I: Ilma (air)
J: Jousi (spring)
K: Kello (clock)
L: Linssi (lens)
M: Määrä (quantity)
N: Nopea (fast)
O: Odottaa (to wait)
P: Pyörä (wheel)
R: Rokote (vaccine)
S: Solu (cell)
T: Tietokone (computer)
U: Uuni (oven)
V: Valo (light)
Y: Ympyrä (circle)
Ä: Ääni (sound)
Ö: Ötökkä (bug)
What links here?
- Olli Järviniemi's comment on A to Z of things by KatjaGrace (11 Dec 2024 20:27 UTC; 3 points)

Olli Järviniemi 9 Dec 2024 12:30 UTC
4 points
0
on: Loppukilpailija’s Shortform
I recently gave a workshop in AI control, for which I created an exercise set.
The exercise set can be found here: https://drive.google.com/file/d/1hmwnQ4qQiC5j19yYJ2wbeEjcHO2g4z-G/view?usp=sharing
The PDF is self-contained, but three additional points:
- I assumed no familiarity about AI control from the audience. Accordingly, the target audience for the exercises is newcomers, and are about the basics.
  - If you want to get into AI control, and like exercises, consider doing these.
  - Conversely, if you are already pretty familiar with control, I don’t expect you’ll get much out of these exercises. (A good fraction of the problems is about re-deriving things that already appear in AI control papers etc., so if you already know those, it’s pretty pointless.)
- I felt like some of the exercises weren’t that good, and am not satisfied with my answers to some of them—I spent a limited time on this. I thought it’s worth sharing the set anyways.
  - (I compensated by highlighting problems that were relatively good, and by flagging the answers I thought were weak; the rest is on the reader.)
- I largely focused on monitoring schemes, but don’t interpret this as meaning there’s nothing else to AI control.
You can send feedback by messaging me or anonymously here.

Olli Järviniemi 5 Dec 2024 23:24 UTC
16 points
0
on: When do “brains beat brawn” in Chess? An experiment
The post studies handicapped chess as a domain to study how player capability and starting position affect win probabilities. From the conclusion:
In the view of Miles and others, the initially gargantuan resource imbalance between the AI and humanity doesn’t matter, because the AGI is so super-duper smart, it will be able to come up with the “perfect” plan to overcome any resource imbalance, like a GM playing against a little kid that doesn’t understand the rules very well.
The problem with this argument is that you can use the exact same reasoning to imply that’s it’s “obvious” that Stockfish could reliably beat me with queen odds. But we know now that that’s not true.
Since this post came out, a chess bot (LeelaQueenOdds) that has been designed to play with fewer pieces has come out. simplegeometry’s comment introduces it well. With queen odds, LQO is way better than Stockfish, which has not been designed for it. Consequentially, the main empirical result of the post is severely undermined. (I wonder how far even LQO is from truly optimal play against humans.)
(This is in addition to—as is pointed out by many commenters—how the whole analogue is stretched at best, given the many critical ways in which chess is different from reality. The post has little argument in favor of the validity of the analogue.)
I don’t think the post has stood the test of time, and vote against including it in the 2023 Review.

Olli Järviniemi 27 Nov 2024 6:45 UTC
20 points
0
in reply to: simplegeometry’s comment on: When do “brains beat brawn” in Chess? An experiment
I found it interesting to play against LeelaQueenOdds. My experiences:
- I got absolutely crushed on 1+1 time controls (took me 50+ games to win one), but I’m competitive at 3+2 if I play seriously.
- The model is really good at exploiting human blind spots and playing aggressively. I could feel it striking in my weak spots, but not being able to do much about it. (I now better acknowledge the existence of adversarial attacks for humans on a gut level.)
- I found it really addictive to play against it: You know the trick that casinos use, where they make you feel like you “almost” won? This was that: I constantly felt like I could have won, if it wasn’t just for that one silly mistake—despite having lost the previous ten games to such “random mistakes”, too… I now better understand what it’s like to be a gambling addict.
Overall fascinating to play from a position that should be an easy win, but getting crushed by an opponent that Just Plays Better than I do.
[For context, I’m around 2100 in Lichess on short time controls (bullet/blitz). I also won against Stockfish 16 at rook odds on my first try—it’s really not optimized for this sort of thing.]

Olli Järviniemi 15 Nov 2024 14:55 UTC
2 points
0
in reply to: Adam Scholl’s comment on: Untrusted smart models and trusted dumb models
You might be interested in this post of mine, which is more precise about what “trustworthy” means. In short, my definition is “the AI isn’t adversarially trying to cause a bad outcome”. This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).

Olli Järviniemi 14 Nov 2024 22:36 UTC
2 points
0
on: The Parable of the Dagger

Olli Järviniemi 9 Nov 2024 16:36 UTC
5 points
−1
in reply to: Buck’s comment on: Anthropic: Three Sketches of ASL-4 Safety Case Components
Thanks, that clears it up. Indeed I agree that “take real trajectories, then edit the attacks in” is relatively promising, and didn’t track this when writing my first comment (oops) - thanks for flagging it.

Olli Järviniemi 7 Nov 2024 18:31 UTC
LW: 2 AF: 1
−2
AF
in reply to: Buck’s comment on: Anthropic: Three Sketches of ASL-4 Safety Case Components
Not sure I understand what you are saying.
If we can construct fake honeypots that are indistinguishable—to the policy—from real opportunities, and do this at scale, then that could suffice: we could drive the policy’s P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.) If this is what you are saying, I agree.
If we can construct fake attacks that are indistinguishable—to the untrusted monitor—from real attacks, then that possibly suffices: we could drive the monitor’s prior P(deployment | attack) very low. (We have to worry about some other sources of evidence, though.) I guess one could call these honeypots for the monitor, and if this is what you are saying, I agree.
If you are saying that we don’t need either of these—i.e. that we can get by with honeypots/attacks that are noticeably fake—then that’s news to me, and I don’t see how you can do that.

Olli Järviniemi 7 Nov 2024 2:07 UTC
52 points
18
in reply to: Thomas Kwa’s comment on: adam_scholl’s Shortform
Wow, tough crowd

Olli Järviniemi 6 Nov 2024 22:13 UTC
LW: 6 AF: 4
0
AF
on: Anthropic: Three Sketches of ASL-4 Safety Case Components
Glad to see public writing on what safety cases could look like!
I liked the section “Common precautions”. Points like “wherever possible, run supervised fine-tuning on the eval”, “run the eval on multiple model sizes”, “ensure that we don’t train against measurements” and “all generated outputs are monitored” are commonly acknowledged as useful/important (even if obvious), so it’s nice they were explicitly mentioned here.
In the AI control sketch, there’s a part that I find unsatisfactory:
The situations for the game were generated using Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution. (See below for validating the indistinguishability of this distribution.)
It’s a strong assumption that we have such a simulation procedure, at least if we are talking about General External Deployment (GED). [Edited later: As Buck says, we actually don’t even need this assumption for trusted monitoring. I’m still leaving this up, since untrusted monitoring does rely on such assumptions.] In the Limitations section later, you write
Diversity of simulations. It seems very difficult to produce realistic simulations for a sufficiently diverse set of situations to achieve indistinguishability. Manual construction is infeasible (note that context lengths are likely to be very long). Automatically generating realistic contexts is a challenging research problem.
but I think that even this is still understating the issue: For example, during GED a user might ask a model to proofread a yet-unpublished graduate level textbook on analytic number theory she has been writing. It would be very costly to generate a diverse set of such inputs for the evaluation, and automatic generation faces the challenge of discrimination being easier than generation.
The best solution for this that I’m aware of is to train and use a trusted model to flag inputs that likely weren’t in the evaluation set. Now, you mention a related technique in “[P2.5] Optional Steps.”
Information hiding. Each input to an oversight model is first fed to a smaller, trusted model trained to check for clear signs that an input is either test or deployment. Such details (such as clear evidence of the date, or other side channels) are explicitly censored in all inputs such that the presence of censorship does not itself carry information about deployment status.
but I think this is far from enough: in the “textbook on analytic number theory” example, it’s extremely hard to hide the fact that this input would have been costly to simulate.^[1]
My answer would be to have a human oversee the model’s actions on inputs flagged by the trusted model, and use such inputs to continually train and evaluate the model. But this is costly and not privacy-preserving. (The cost is acceptable if these inputs are rare enough, though. And if such inputs are common, one needs to put more effort to bridge the evaluation-deployment gap.)
1. ^
  I also worry about multimodality (e.g. native computer usage) making it harder to 1) make realistic simulations and 2) modify inputs in real time to hide information.