I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.
William_S
Maybe there’s an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis
I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there’d be snickering news articles about it. So if some individuals could do this independently might be easier
How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.
I think it’s somewhat blameworthy to not think about these questions at all though
On reflection there was something missing from my perspective here, which is that taking any action based on principles depends on pragmatic considerations, like if you leave are there better alternatives? How much power do you really have? I think I don’t fault someone who thinks this through and decides that something is wrong but there’s no real way to do anything about it. I do think you should try to maintain some sense of what is wrong and what the right direction would be, look out for ways to push in that direction. E.g. working at a lab but maintaining some sense of “this is how much of a chance it looks like pause activism would need before I’d quite and endorse a pause”.
I think I was just conflating different kinds of decisions here, and imagining arguing with people with very different conceptions of what are important to count in costs and benefits, and a bit confused. On reflection I don’t endorse 10x margin in terms of like percentage points of x-risk. And like maybe margin is sort of a crutch, maybe the thing I want more is like “95% chance of being net-positive, considering possibility you’re kind of biased”. I still think you should be suspicious of “the case exactly balance lets ship’
Yeah this part is pretty under-defined, I was maybe falling into the trap of being too idealistic, and I’m probably less optimistic about this than I was when writing it before. I think there’s something directionally important here, are you trying at all to expand the circle of accountability at all, even if you’re being cautious about expanding it because you’re afraid of things breaking down?
Would be nice to have a llm+prompt that tries to produce reasonable AI strategy advice based on a summary of the current state of play, have some way to validate that it’s reasonable, be able to see how it updates as events unfold.
A couple advantages for AI intellectuals could be:
- being able to rerun based on different inputs, see how their analysis changes function of those inputs
- being able to view full reasoning traces (while also not the full story, probably more of the full story than what goes on with human reasoning, good intellectuals already try to share their process but maybe can do better/use this to weed out clearly bad approaches)
Yep, I’ve used those, with some effectiveness but also tend to just like get used to it over time, form a habit of mindlessly jumping through the hoops. Hypothesis here is that having to justify what you’re doing would be more effective at changing habits.
LLM-based application I’d like to exist:
Web browser addon for firefox that has blocklists of websites, when you try to visit one you have to have a conversation with Claude about why you want to visit it in this moment, convince Claude to let you bypass the block for a limited period of time for your specific purpose (let you customize the claude prompt with info about why you set up the block in the first place).
Wanting to use for things like news, social media where it’s a bit too much to try to completely block, but I’ve got bad habits around checking too frequently.
Bonus: be able to let the LLM read the website for you and answer questions without showing you the page, like is there anything new about X.
Principles for the AGI Race
Hypothesis: each of these vectors representing a single token that is usually associated with code, vectors says “I should output this token soon”, and the model then plans around that to produce code. But adding vectors representing code tokens doesn’t necessarily produce another vector representing a code token, so that’s why you don’t see compositionality. Does somewhat seem plausible that there might be ~800 “code tokens” in the representation space.
Transformer Circuit Faithfulness Metrics Are Not Robust
Absent evidence to the contrary, for any organization one should assume board members were basically selected by the CEO. So hard to get assurance about true independence, but it seems good to at least to talk to someone who isn’t a family member/close friend.
Good that it’s clear who it goes to, though if I was an anthropic I’d want an option to escalate to a board member who isn’t Dario or Daniella, in case I had concerns related to the CEO
I do think 80k should have more context on OpenAI but also any other organization that seems bad with maybe useful roles. I think people can fail to realize the organizational context if it isn’t pointed out and they only read the company’s PR.
I agree that this kind of legal contract is bad, and Anthropic should do better. I think there are a number of aggrevating factors which made the OpenAI situation extrodinarily bad, and I’m not sure how much these might obtain regarding Anthropic (at least one comment from another departing employee about not being offered this kind of contract suggest the practice is less widespread).
-amount of money at stake
-taking money, equity or other things the employee believed they already owned if the employee doesn’t sign the contract, vs. offering them something new (IANAL but in some cases, this could be a felony “grand theft wages” under California law if a threat to withhold wages for not signing a contract is actually carried out, what kinds of equity count as wages would be a complex legal question)
-is this offered to everyone, or only under circumstances where there’s a reasonable justification?
-is this only offered when someone is fired or also when someone resigns?
-to what degree are the policies of offering contracts concealed from employees?
-if someone asks to obtain legal advice and/or negotiate before signing, does the company allow this?
-if this becomes public, does the company try to deflect/minimize/only address issues that are made publically, or do they fix the whole situation?
-is this close to “standard practice” (which doesn’t make it right, but makes it at least seem less deliberately malicious), or is it worse than standard practice?
-are there carveouts that reduce the scope of the non-disparagement clause (explicitly allow some kinds of speech, overriding the non-disparagement)?
-are there substantive concerns that the employee has at the time of signing the contract, that the agreement would prevent discussing?
-are there other ways the company could retaliate against an employee/departing employee who challenges the legality of contract?
I think with termination agreements on being fired there’s often 1. some amount of severance offered 2. a clause that says “the terms and monetary amounts of this agreement are confidential” or similar. I don’t know how often this also includes non-disparagement. I expect that most non-disparagement agreements don’t have a term or limits on what is covered.
I think a steelman of this kind of contract is: Suppose you fire someone, believe you have good reasons to fire them, and you think that them loudly talking about how it was unfair that you fired them would unfairly harm your company’s reputation. Then it seems somewhat reasonable to offer someone money in exchange for “don’t complain about being fired”. The person who was fired can then decide whether talking about it is worth more than the money being offered.
However, you could accomplish this with a much more limited contract, ideally one that lets you disclose “I signed a legal agreement in exchange for money to not complain about being fired”, and doesn’t cover cases where “years later, you decide the company is doing the wrong thing based on public information and want to talk about that publically” or similar.
I think it is not in the nature of most corporate lawyers to think about “is this agreement giving me too much power?” and most employees facing such an agreement just sign it without considering negotiating or challenging the terms.
For any future employer, I will ask about their policies for termination contracts before I join (as this is when you have the most leverage, if they give you an offer they want to convince you to join).
Would be nice if it was based on “actual robot army was actually being built and you have multiple confirmatory sources and you’ve tried diplomacy and sabotage and they’ve both failed” instead of “my napkin math says they could totally build a robot army bro trust me bro” or “they totally have WMDs bro” or “we gotta blow up some Japanese civilians so that we don’t have to kill more Japanese civilians when we invade Japan bro” or “dude I’m seeing some missiles on our radar, gotta launch ours now bro”.
Initial version for firefox, code at https://github.com/william-r-s/MindfulBlocker, extension file at https://github.com/william-r-s/MindfulBlocker/releases/tag/v0.2.0