I have signed no contracts or agreements whose existence I cannot mention.
plex
Agree that takeoff speeds are more important, and expect that FrontierMath has much less affect on takeoff speed. Still think timelines matter enough that the amount of relevantly informing people that you buy from this is likely not worth the cost, especially if the org is avoiding talking about risks in public and leadership isn’t focused on agentic takeover, so the info is not packaged with the info needed for that info to have the effects which would help.
Evaluating the final model tells you where you got to. Evaluating many small models and checkpoints helps you get further faster.
Even outside of the arguing against the Control paradigm, this post (esp. The Model & The Problem & The Median Doom-Path: Slop, not Scheming) cover some really important ideas, which I think people working on many empirical alignment agendas would benefit from being aware of.
One neat thing I’ve explored is learning about new therapeutic techniques by dropping a whole book into context and asking for guiding phrases. Most therapy books do a lot of covering general principles of minds and how to work with them, with the unique aspects buried in a way which is not super efficient for someone who already has the universal ideas. Getting guiding phrases gives a good starting point for what the specific shape of a technique is, and means you can kinda use it pretty quickly. My project system prompt is:
Given the name of, and potentially documentation on, an introspective or therapeutic practice, generate a set of guiding phrases for facilitators. These phrases should help practitioners guide participants through deep exploration, self-reflection, and potential transformation. If you don’t know much about the technique or the documentation is insufficient, feel free to ask for more information. Please explain what you know about the technique, especially the core principles and things relevant to generating guiding phrases, first.
Consider the following:
Understand the practice’s core principles, goals, and methods.
Create open-ended prompts that invite reflection and avoid simple yes/no answers.
Incorporate awareness of physical sensations, emotions, and thought patterns.
Develop phrases to navigate unexpected discoveries or resistances.
Craft language that promotes non-judgmental observation of experiences.
Generate prompts that explore contradictions or conflicting beliefs.
Encourage looking beyond surface-level responses to deeper insights.
Help participants relate insights to their everyday lives and future actions.
Include questions that foster meta-reflection on the process itself.
Use metaphorical language when appropriate to conceptualize abstract experiences.
Ensure phrases align with the specific terminology and concepts of the practice.
Balance providing guidance with allowing space for unexpected insights.
Consider ethical implications and respect appropriate boundaries.Aim for a diverse set of phrases that can be used flexibly throughout the process. The goal is to provide facilitators with versatile tools that enhance the participant’s journey of self-discovery and growth.
Example (adapt based on the specific practice):“As you consider [topic], what do you notice in your body?”
“If that feeling had a voice, what might it say?”
”How does holding this belief serve you?”
“What’s alive for you in this moment?”
”How might this insight change your approach to [relevant aspect of life]?”Remember, the essence is to create inviting, open-ended phrases that align with the practice’s core principles and facilitate deep, transformative exploration.
Please store your produced phrases in an artefact.
I’m guessing you view having better understanding of what’s coming as very high value, enough that burning some runway is acceptable? I could see that model (though put <15% on it), but I think this is at least not good integrity wise to have put on the appearance of doing just the good for x-risk part and not sharing it as an optimizable benchmark, while being funded by and giving the data to people who will use it for capability advancements.
Evaluation on demand because they can run them intensely lets them test small models for architecture improvements. This is where the vast majority of the capability gain is.
Getting an evaluation of each final model is going to be way less useful for the research cycle, as it only gives a final score, not a metric which is part of the feedback loop.
However, we have a verbal agreement that these materials will not be used in model training.
If by this you mean “OpenAI will not train on this data”, that doesn’t address the vast majority of the concern. If OpenAI is evaluating the model against the data, they will be able to more effectively optimize for capabilities advancement, and that’s a betrayal of the trust of the people who worked on this with the understanding that it will be used only outside of the research loop to check for dangerous advancements. And, particularly, not to make those dangerous advancements come sooner by giving OpenAI another number to optimize for.
If you mean OpenAI will not be internally evaluating models on this to improve and test the training process, please state this clearly in writing (and maybe explain why they got privileged access to the data despite being prohibited from the obvious use of that data).
Really high quality high-difficulty benchmarks are much more scarce and important for capabilities advancing than just training data. Having an apparently x-risk focused org do a benchmark implying it’s for evaluating danger from highly capable models in a way which the capabilities orgs can’t use to test their models, then having it turn out that’s secretly funded by OpenAI with OpenAI getting access to most of the data is very sketchy.
Some people who contributed questions likely thought they would be reducing x-risk by helping build bright line warning signs. Their work being available to OpenAI will mostly have increased x-risk by giving the capabilities people an unusually important number-goes-up to optimize for, bringing timelines to dangerous systems closer. That’s a betrayal of trust, and Epoch should do some serious soul searching about taking money to do harmful things.
This is a good idea usually, but critically important when using skills like those described in Listening to Wisdom, in a therapeutic relationship (including many forms of coaching), or while under the influence of substances that increase your rate of cognitive change and lower barriers to information inflow (such as psychedelics).
If you’re opening yourself up to receive the content of those vibes on an emotional/embodied/deep way, and those vibes are bad, this can be toxic to an extent you will not be expecting (even if you try to account for this warning).
Do not do mind meld-like techniques/drugs/therapy with people your system is throwing unexplained warnings around. Instead, step out of the situation and investigate any such warnings at a safe distance, with the possibility of a “nope” and disengaging if the warning is still flashing (even if you don’t get clarity on it’s source).
Maybe having exact evaluations not being trivial is not entirely a bug, but might make the game more interesting (though maybe more annoying)?
I recommend most readers skip this subsection on a first read; it’s not very central to explaining the alignment problem.
Suggest either putting this kind of aside in a footnote, or giving the reader a handy link to the next section for convenience?
Nice!
(I wrote the bit about not having to tell people your favourite suit or what cards you have leaves things open for some sharp or clever negotiation, but looking back I think it’s mostly a trap. I haven’t seen anyone get things to go better for them by hiding the suit.)
To add some layer of this strategy: Giving each person one specific card on their suit that they want with much higher strength might be fun, as the other players can ransom that card if they know (but might be happy trading it anyway). Also having the four suits each having a different multiplier might be fun?
On one side: Humanoid robots have much more density of parts requiring more machine-time than cars, probably slowing things a bunch.
On the other, you mention assuming no speed up due to the robots building robot factories, but this seems like the dominant factor in the growth. Your numbers excluding that are going to be way underestimating things pretty quickly without that. I’d be interested in what those numbers look like assuming reasonable guesses about robot workforce being part of a feedback cycle.
Or, worse, if most directions are net negative and you have to try quite hard to find one which is positive, almost everyone optimizing for magnitude will end up doing harm proportional to how much they optimize magnitude.
Yeah, this seems probably a good idea. Though some of these would be best on existing resource pages, like the funders list.
This is imo one of if not the most impactful software role in the AI safety ecosystem. I think the AI safety funding ecosystem’s significant challenges are most likely to be addressed by scaling up the s-process.
I’ve passed this on to some of the best SWEs I know, some of whom also manage a lot of other great devs.
Added something to the TL;DR footnote covering this.
It’s a pretty straightforward modification of the Caplan thruster. You scoop up bits of sun with very strong magnetic fields, but rather than fusing it and using it to move a star, you cool most of it (firing some back with very high velocity to balance things momentum wise) and keep the matter you extract (or fuse some if you need quick energy). There’s even a video on it! Skip to 4:20 for the relevant bit.
This feels very related to a section I didn’t write for the post because it was getting too long about how to “quote” claims about the other person’s self-model in a way which defuse conflict while leaving you with a wider range of conversational motion. Basically by saying e.g.
“I have a story that you’re angry with me”
rather than
”You’re angry with me”The other person can accept your statement into their conversational stack safety, even if they’re not angry. Because another person thinking you’re angry while you’re not angry is totally compatible as a model, but you being angry while you’re not angry is not. So if you try and include their mental object it fires a crapton of error messages for colliding predictive models.
This seems worth taking a position on, the relevant people need to hear from the experts an unfiltered stance of “this is a real and perhaps very likely risk”.