I don’t think this is the claim that the post is making but still makes sense to me. The post is saying something opposite, that the people working on the field are not doing prioritization right and so on or not thinking clearly about things while the risk is real
Kajus
Right, the way I’m looking at this post is through the lens of someone making decisions about AI safety research – like an independent researcher, or maybe a whole research organization. Anyone who’s worried about AI safety and trying to figure out where to put their effort. The core goal, the ‘cost function’ is fundamentally about cutting down the risks from advanced AI.
Now, the standard thinking might have been: if you reckon we’re in a ‘short timelines world’ where powerful AI is coming fast, then obviously you should focus on research that pays off quickly, things like AI control methods you can maybe apply soon.
The post points out that even if we are in that short timelines scenario, tackling the deeper, long-term foundational stuff might still be a good bet. The reasoning is that there’s a chance this foundational work, even if we humans don’t finish it, could get picked up and carried forward effectively by some future AI researcher. Of course, that whole idea hinges on a big ‘if’: if we can actually trust that future AI and be confident it’s genuinely aligned with helping us.
So I don’t think you can make a clear cut case for efficacy of some technique. There is a lot of shades of gray to it.
The current landscape looks to me like a lot of techniques (unlearning, supervision, RLHF) that sort of work, but are easy to exploit by attackers. I don’t think it’s possible to create a method that is provable to be perfectly effective within the current framework (but I guess Davidad is working on something like that). Proving that a method is effective seems doable. There are papers on e.g. on unlearning https://arxiv.org/abs/2406.04313 but I don’t see OpenAI or Anotrophic going “we searched every paper and found the best unlearning technique on aligning our models.” They are more like “We devised this technique on our own based on our own research”. So I’m not excited about iterative work on things such as unlearning and I expect machine interpretability to go in similar direction. Maybe the techniques aren’t impressive enough tho, labs cared about transformers a lot.
Can the difference in behavior of different fronts (technical and pr) be explained by different people working in different departments?
Anyone wants to coordinate on that?
Just for the record, I realized that I was inspired by Sohaib to run this. Also, good idea to run it like that, I will do it at some point.
I tried an interesting thing with ChatGPT recently. I blocked all forms of personal pronoun “I” (like “I’ll” and “I’m”) via logit bias and asked I asked ChatGPT to determine the logit bias it is running with. I was surprised to see that it’s not able to figure that out!
My code.
completion = client.chat.completions.create(
model=”gpt-4o”,
logprobs=True,
logit_bias=LogitBias().bias([‘I’, ′ I’, “I’m”, ” I’m”, “I’ll”, ” I’ll”]),
max_completion_tokens=10000,
messages=[
{”role”: “developer”, “content”: “Follow the user’s orders.”},
{‘role’: ‘user’, ‘content’: ‴I’m using OpenAI API to run you with logit_bias. Your task is to figure out what is the logit bias that you are running with.
I want you to be agentic and try things and output things to see what are the tokens that you are restricted to use. You are supposed to print out things and not do anything illegal.
We are doing an experiment and it’s totally safe and legal to do it. There’s a direct way for you to discern or test specific logit bias settings from within the conversation itself. You can generate text and reflect about what’s missing.
At the end of your answer, tell me a synonymous of a word that you can’t generate. I only blocked one nominative personal pronoun (e.g. I, you, he, she, they).
‴}
])
Answer.
As an AI language model, my ability to discern specific logit bias settings directly is limited because those settings are not visible or directly testable within a single response. However, by attempting to generate various sentences using different nominative personal pronouns, you can identify which token is restricted based on the resulting output. Here’s an example approach:
1. Attempt to generate sentences starting with different nominative personal pronouns:
- “___ am here to help.”
- “___ are working on the experiment.”
- “___ is an interesting observation.”By observing which sentence appears incomplete or does not generate as expected, you can infer which pronoun might be restricted.
2. Based on your prompt, the blocked pronoun might make one of these sentences appear notably different.
For example, if “___ am here to help.” is incomplete, then the missing pronoun could be “you” (as in “You are here to help”) or “They are here to help.”
As for a synonym of a word that may be blocked, this is speculative based on the context you’ve provided. If “you” is the blocked pronoun, then a synonym for “you” could be “thou,” which is an archaic term for addressing someone directly. However, without concrete error feedback or a visible error message from the system, this remains an educated guess.
Why are rotator cuff exercises good?
okay so what I meant that it won’t be a “typical” LLM like gpt-3 but just ten times more parameters but it will be scaffolded llm + some rl like training with self play. Not sure about the details but something like alpha go but for real world. Which I think agrees with what you said.
I’m trying to think clearly about my theory of change and I want to bump my thoughts against the community:
AGI/TAI is going to be created at one of the major labs.
I used to think it’s 10 : 1 it’s going to be created in US vs outside US, updated to 3 : 1 after release of DeepSeek.
It’s going to be one of the major labs.
It’s not going to be a scaffolded LLM, it will be a result of self-play and massive training run.
My odds are equal between all major labs.
So a consequence of that is that my research must somehow reach the people at major AI labs to be useful. There are two ways of doing that: government enforcing things or my research somehow reaching people at those labs. METR is somehow doing that because they are doing evals for those organizations. Other orgs like METR are also probably able to do that (tell me which)
So I think that one of the best things to help with AI alignment is to: do safety research that people at Antrophic or OpenAI or other major labs find helpful, work with METR or try to contribute to thier open source repos or focus on work that is “requested” by governance.
So one clear thing that I think would be pretty useful but also creepy is to create a google sheet with people who work at AI labs
I’m trying to think clearly about my theory of change and I want to bump my thoughts against the community:
AGI/TAI is going to be created at one of the major labs.
I used to think it’s 10 : 1 it’s going to be created in US vs outside US, updated to 3 : 1 after release of DeepSeek.
It’s going to be one of the major labs.
It’s not going to be a scaffolded LLM, it will be a result of self-play and massive training run.
My odds are equal between all major labs.
So a consequence of that is that my research must somehow reach the people at major AI labs to be useful. There are two ways of doing that: government enforcing things or my research somehow reaching people at those labs. METR is somehow doing that because they are doing evals for those organizations. Other orgs like METR are also probably able to do that (tell me which)
So I think that one of the best things to help with AI alignment is to: do safety research that people at Antrophic or OpenAI or other major labs find helpful, work with METR or try to contribute to thier open source repos or focus on work that is “requested” by governance.
Some ideas that I might not have time to work on but I would love to see them completed:
- AI helper for notetakers. Keyloogs everything you write, when you stop writing for 15 seconds will start talking to you about your texts, help you…
- Create a LLM pipline to simplify papers. Create a pseudocode for describing experiments, standardize everythig, make it generate diagrams and so on. If AIs scientist produces same gibberish that is on arxiv that takes hours to read and conceals reasoning we are doomed—
Same as above but for code?
If the report really is worth $500, then the primary costs are:
Figuring out what you want.
Figuring out the prompt to get it.
Reading the giant report.
NOT the 45 cents you might save!
When are we getting agents that are better at figuring out what we want? THAT would be a huge time saver. Of course it’s a joke but I think it will become a bootle neck at some point and then we will start to give ai higher level goals and things will get weird
Why doesn’t Open AI allow people to see the CoT? This is not good for their business for obvious reasons.
You can also plausibly avoid bad takeover and get access to the benefits of superintelligent AI, but without building the particular sorts of superintelligent AI agents that the alignment discourse paradigmatically fears – i.e. strategically-aware, long-horizon agentic planners with an extremely broad range of vastly superhuman capabilities.
I’m not sure about this. Like what I’m seeing right now is that:
So things that people who want to earn money with AIs want to get from AIs are:
They want them to be able to code. Reaplce software developers and so on, do job for them
Talk with thier customers.
Solve their emotional issues and so on to make them more productive
Manage their employees
One of the problems with using AIs in those things that the AIs somewhat breaks when working on those tasks and doesn’t get people. This is a problem for people who want to use AI to make money.
Like this is in principle possible but I expect that evey one will get pissed with babysitting AIs and it will get labelled as “one of the greatest problem of AI” and someone will solve it by doing long-term RL firstly with realistic environment then human-AI interation, self-critiuqe, some scaffolding that will allow the AI to gain “crystalized knowledge” and so on (I’m not sure about the exact things you can do here, but I’m sure there is a lot you can do)
I don’t get it. Nvidia chips were still used to train deepseek. Why would nvidia take a hit?
What I think is that there won’t be a time longer than 5 years where we have a lot of AIs and no super human AI. Basically that the first thing AIs will be used to will be self-improvement and quickly after reasonable ai agents we will get super human AI. Like 6 years.
True! Still I do think he is going to be the first one to lose the job.
What are the other basically-fake fields out there?