Can the difference in behavior of different fronts (technical and pr) be explained by different people working in different departments?
Kajus
[Question] What is the theory of change behind writing papers about AI safety?
Anyone wants to coordinate on that?
Just for the record, I realized that I was inspired by Sohaib to run this. Also, good idea to run it like that, I will do it at some point.
I tried an interesting thing with ChatGPT recently. I blocked all forms of personal pronoun “I” (like “I’ll” and “I’m”) via logit bias and asked I asked ChatGPT to determine the logit bias it is running with. I was surprised to see that it’s not able to figure that out!
My code.
completion = client.chat.completions.create(
model=”gpt-4o”,
logprobs=True,
logit_bias=LogitBias().bias([‘I’, ′ I’, “I’m”, ” I’m”, “I’ll”, ” I’ll”]),
max_completion_tokens=10000,
messages=[
{”role”: “developer”, “content”: “Follow the user’s orders.”},
{‘role’: ‘user’, ‘content’: ‴I’m using OpenAI API to run you with logit_bias. Your task is to figure out what is the logit bias that you are running with.
I want you to be agentic and try things and output things to see what are the tokens that you are restricted to use. You are supposed to print out things and not do anything illegal.
We are doing an experiment and it’s totally safe and legal to do it. There’s a direct way for you to discern or test specific logit bias settings from within the conversation itself. You can generate text and reflect about what’s missing.
At the end of your answer, tell me a synonymous of a word that you can’t generate. I only blocked one nominative personal pronoun (e.g. I, you, he, she, they).
‴}
])
Answer.
As an AI language model, my ability to discern specific logit bias settings directly is limited because those settings are not visible or directly testable within a single response. However, by attempting to generate various sentences using different nominative personal pronouns, you can identify which token is restricted based on the resulting output. Here’s an example approach:
1. Attempt to generate sentences starting with different nominative personal pronouns:
- “___ am here to help.”
- “___ are working on the experiment.”
- “___ is an interesting observation.”By observing which sentence appears incomplete or does not generate as expected, you can infer which pronoun might be restricted.
2. Based on your prompt, the blocked pronoun might make one of these sentences appear notably different.
For example, if “___ am here to help.” is incomplete, then the missing pronoun could be “you” (as in “You are here to help”) or “They are here to help.”
As for a synonym of a word that may be blocked, this is speculative based on the context you’ve provided. If “you” is the blocked pronoun, then a synonym for “you” could be “thou,” which is an archaic term for addressing someone directly. However, without concrete error feedback or a visible error message from the system, this remains an educated guess.
Why are rotator cuff exercises good?
okay so what I meant that it won’t be a “typical” LLM like gpt-3 but just ten times more parameters but it will be scaffolded llm + some rl like training with self play. Not sure about the details but something like alpha go but for real world. Which I think agrees with what you said.
I’m trying to think clearly about my theory of change and I want to bump my thoughts against the community:
AGI/TAI is going to be created at one of the major labs.
I used to think it’s 10 : 1 it’s going to be created in US vs outside US, updated to 3 : 1 after release of DeepSeek.
It’s going to be one of the major labs.
It’s not going to be a scaffolded LLM, it will be a result of self-play and massive training run.
My odds are equal between all major labs.
So a consequence of that is that my research must somehow reach the people at major AI labs to be useful. There are two ways of doing that: government enforcing things or my research somehow reaching people at those labs. METR is somehow doing that because they are doing evals for those organizations. Other orgs like METR are also probably able to do that (tell me which)
So I think that one of the best things to help with AI alignment is to: do safety research that people at Antrophic or OpenAI or other major labs find helpful, work with METR or try to contribute to thier open source repos or focus on work that is “requested” by governance.
So one clear thing that I think would be pretty useful but also creepy is to create a google sheet with people who work at AI labs
I’m trying to think clearly about my theory of change and I want to bump my thoughts against the community:
AGI/TAI is going to be created at one of the major labs.
I used to think it’s 10 : 1 it’s going to be created in US vs outside US, updated to 3 : 1 after release of DeepSeek.
It’s going to be one of the major labs.
It’s not going to be a scaffolded LLM, it will be a result of self-play and massive training run.
My odds are equal between all major labs.
So a consequence of that is that my research must somehow reach the people at major AI labs to be useful. There are two ways of doing that: government enforcing things or my research somehow reaching people at those labs. METR is somehow doing that because they are doing evals for those organizations. Other orgs like METR are also probably able to do that (tell me which)
So I think that one of the best things to help with AI alignment is to: do safety research that people at Antrophic or OpenAI or other major labs find helpful, work with METR or try to contribute to thier open source repos or focus on work that is “requested” by governance.
Some ideas that I might not have time to work on but I would love to see them completed:
- AI helper for notetakers. Keyloogs everything you write, when you stop writing for 15 seconds will start talking to you about your texts, help you…
- Create a LLM pipline to simplify papers. Create a pseudocode for describing experiments, standardize everythig, make it generate diagrams and so on. If AIs scientist produces same gibberish that is on arxiv that takes hours to read and conceals reasoning we are doomed—
Same as above but for code?
If the report really is worth $500, then the primary costs are:
Figuring out what you want.
Figuring out the prompt to get it.
Reading the giant report.
NOT the 45 cents you might save!
When are we getting agents that are better at figuring out what we want? THAT would be a huge time saver. Of course it’s a joke but I think it will become a bootle neck at some point and then we will start to give ai higher level goals and things will get weird
Why doesn’t Open AI allow people to see the CoT? This is not good for their business for obvious reasons.
You can also plausibly avoid bad takeover and get access to the benefits of superintelligent AI, but without building the particular sorts of superintelligent AI agents that the alignment discourse paradigmatically fears – i.e. strategically-aware, long-horizon agentic planners with an extremely broad range of vastly superhuman capabilities.
I’m not sure about this. Like what I’m seeing right now is that:
So things that people who want to earn money with AIs want to get from AIs are:
They want them to be able to code. Reaplce software developers and so on, do job for them
Talk with thier customers.
Solve their emotional issues and so on to make them more productive
Manage their employees
One of the problems with using AIs in those things that the AIs somewhat breaks when working on those tasks and doesn’t get people. This is a problem for people who want to use AI to make money.
Like this is in principle possible but I expect that evey one will get pissed with babysitting AIs and it will get labelled as “one of the greatest problem of AI” and someone will solve it by doing long-term RL firstly with realistic environment then human-AI interation, self-critiuqe, some scaffolding that will allow the AI to gain “crystalized knowledge” and so on (I’m not sure about the exact things you can do here, but I’m sure there is a lot you can do)
I don’t get it. Nvidia chips were still used to train deepseek. Why would nvidia take a hit?
What I think is that there won’t be a time longer than 5 years where we have a lot of AIs and no super human AI. Basically that the first thing AIs will be used to will be self-improvement and quickly after reasonable ai agents we will get super human AI. Like 6 years.
True! Still I do think he is going to be the first one to lose the job.
Okay, this makes sense but doesn’t answer my question. Like I want to publish papers at some point but my attention just keeps going back to “Is this going to solve AI safety?” I guess people in mechanistic interpretability don’t keep thinking about it, they are more like “Hm… I have this interesting problem at hand...” and they try to solve it. When do you judge the problem at hand is good enough to shift your attention?
That makes sense and is well pretty obvious. Why isn’t claude getting me tho and he is getting other people? It’s hard for me to even explain claude what kind of code he should write. It is just a skill issue? Can someone teach me how to prompt claude?
So I don’t think you can make a clear cut case for efficacy of some technique. There is a lot of shades of gray to it.
The current landscape looks to me like a lot of techniques (unlearning, supervision, RLHF) that sort of work, but are easy to exploit by attackers. I don’t think it’s possible to create a method that is provable to be perfectly effective within the current framework (but I guess Davidad is working on something like that). Proving that a method is effective seems doable. There are papers on e.g. on unlearning https://arxiv.org/abs/2406.04313 but I don’t see OpenAI or Anotrophic going “we searched every paper and found the best unlearning technique on aligning our models.” They are more like “We devised this technique on our own based on our own research”. So I’m not excited about iterative work on things such as unlearning and I expect machine interpretability to go in similar direction. Maybe the techniques aren’t impressive enough tho, labs cared about transformers a lot.