AI grantmaking at Open Philanthropy.
I used to give careers advice for 80,000 hours.
AI grantmaking at Open Philanthropy.
I used to give careers advice for 80,000 hours.
Thanks for writing this up! I’ve found this frame to be a really useful way of thinking about GPT-like models since first discussing it.
In terms of future work, I was surprised to see the apparent low priority of discussing pre-trained simulators that were then modified by RLHF (buried in the ‘other methods’ section of ‘Novel methods of process/agent specification’). Please consider this comment a vote for you to write more on this! Discussion seems especially important given e.g. OpenAI’s current plans. My understanding is that Conjecture is overall very negative on RLHF, but that makes it seem more useful to discuss how to model the results of the approach, not less, to the extent that you expect this framing to help shed light what might go wrong.
It feels like there are a few different ways you could sketch out how you might expect this kind of training to go. Quick, clearly non-exhaustive thoughts below:
Something that seems relatively benign/unexciting—fine tuning increases the likelihood that particular simulacra are instantiated for a variety of different prompts, but doesn’t really change which simulacra are accessible to the simulator.
More worrying things—particular simulacra becoming more capable/agentic, simulacra unifying/trading, the simulator framing breaking down in some way.
Things which could go either way and seem very high stakes—the first example that comes to mind is fine-tuning causing an explicit representation of the reward signal to appear in the simulator, meaning that both corrigibly aligned and deceptively aligned simulacra are possible, and working out how to instantiate only the former becomes kind of the whole game.
The intuition driving this is that one model of power/intelligence I put a lot of weight on is increasing the set of actions available to you.
If I want to win at chess, one way of winning is by being great at chess, but other ways involve blackmailing my opponent to play badly, cheating, punching them in the face whenever they try to think about a move, drugging them etc.
The moment at which I become aware of these other options seems critical.
It seems possible to write a chess* environment where you also have the option to modify the rules of the game.
My first idea for how to allow this is have it be the case that specific illegal moves trigger rule changes in some circumstances.
I think this provides a pretty great analogy to expanding the scope of your action set.
There’s also some relevance to training/deployment mismatches.
If you’re teaching a language model to play the game, the specific ‘changing the rules’ actions could be included in the ‘instruction set’ for the game.
This might provide insight/the opportunity to experiment on (to flesh out in depth):
Myopia
Deception (if we select away from agents who make these illegal moves)
useful bounds on consequentialism
More specific things like, in the language models example above, whether saying ‘don’t do these things, they’re not allowed’, works better or worse than not mentioning them at all.
Interested.
AI safety level: don’t typically struggle to follow technical conversations with full time researchers, though am not a full time researcher.
Bio: last studied it 14 years ago. Vaguely aware miosis and mitosis are different but couldn’t define either without Google.
Founders Pledge’s research is the best in the game here. If you want to make a recommendation that’s for a specific charity rather than a fund, Clean Air Task Force seemed sensible every time I spoke to them, and have been around for a while.
AXRP—Excellent interviews with a variety of researchers. Daniel’s substantial own knowledge means that the questions he asks are often excellent, and the technical depth is far better than anything else that’s available in audio, given the difficulty of autoreaders on papers or the alignment forum finding it difficult to handle actual maths.
This talk from Joe Carlsmith. - Hits at several of the key ideas really directly given the time and technical background constraints. Like Rob’s videos, implies an obvious next step for people interested in learning more, or who are suspicious of one of the claims (reading Joe’s actual report, maybe even the extensive discussion of it on here).
The Alignment Problem—Easily accessible, well written and full of interesting facts about the development of ML. Unfortunately somewhat light on actual AI x-risk, but in many cases is enough to encourage people to learn more.
Edit: Someone strong-downvoted this, I’d find it pretty useful to know why. To be clear, by ‘why’ I mean ‘why does this rec seem bad’, rather than ‘why downvote’. If it’s the lightness on x-risk stuff I mentioned, this is useful to know, if my description seems inaccurate, this is very useful for me to know, given that I am in a position to recommend books relatively often. Happy for the reasoning to be via DM if that’s easier for any reason.
Rob Miles’s youtube channel, see this intro. Also his video on the stop button problem for Computerphile.
- Easily accessible, entertaining, videos are low cost for many people to watch, and they often end up watching several.
Both 80,000hours and AI Safety Support are keen to offer personalised advice to people facing a career decision and interested in working on alignment (and in 80k’s case, also many other problems).
Noting a conflict of interest—I work for 80,000 hours and know of but haven’t used AISS. This post is in a personal capacity, I’m just flagging publicly available information rather than giving an insider take.
I love this idea mostly because it would hugely improve screen reader options for alignment research.
Some initial investigation here, along with a response from the author of the original claim.
This recent discovery about DALLE-2 seems like it might provide interesting ideas for experiments in this vein.
Yes, https://metaculusextras.com/points_per_question
It has its own problems in terms of judging ability. But it does exist.
I think both of those would probably help but expect that the concept graph is very big, especially if you want people to be able to use the process recursively.
There’s also value in the workflow being smooth, and this task is sandwiched between two things which seem very useful (and quite straightforward) to automate with an LLM:
concept extraction
search for and summarise explainer papers/articles
I can however imagine a good wiki with great graph style UX navigation and expandable definitions/paper links solving the last two problems, with then only concept extraction being automated by Elicit, though even in this case initially populating the graph/wiki might be best done using automation of the type described above. It’s much easier to maintain something which already exists.
Make it as easy as possible to generate alignment forum posts and comments.
The rough idea here is that it’s much easier to explain an idea out loud, especially to someone who occasionally asks for clarification or for you to repeat an idea, than it is to write a clear, concise post on it. Most of the design of this would be small bits of frontend engineering, but language model capability would be useful, and several of the capabilities are things that Ought is already working on. Ideally, interacting with the tool looks like:
Researcher talks through the thing they’re thinking about. Model transcribes ideas[1], suggests splits into paragraphs[2], suggests section headings [3], generates a high level summary/abstract [4]. If researcher says “[name of model] I’m stuck”, the response is “What are you stuck on?”, and simple replies/suggestions are generated by something like this[5].
Once the researcher has talked through the ideas, they are presented with a piece which contains an abstract at the top, then a series of headed sections, each with paragraphs which rather than containing what they said at that point verbatim, contain clear and concise summaries[6] of what was actually said. Clicking on any generated heading allows the user to select from a list of generated alternatives, or write their own[7], while clicking on any paragraph allows the user to see and select from a list of other generated summaries, the verbatim transcription, and to write their own version of this paragraph.
1 can probably be achieved by just buying an off the shelf transcription bot (though you could train one if you wanted), with the most important criterion being speed. 2-4 can have data trivially generated by scraping the entire alignment forum and removing headings/summaries/abstracts/paragraph breaks. 5 I’ve generated data for below. An MVP for generating data for 6 is using the transcription software from 1 to autotranscribe AXRP and then comparing to the human-edited summary, though I think suggesting clear rephrasings (which I’ll call 6.5) might require a seperate task. 7 is just frontend design, which I suspect is doable in-house by Ought.
There’s a (set of) experiments I’d be keen to see done in this vein, which I think might produce interesting insights.
How well do capabilites generalise across languages?
Stuart Armstrong recently posted this example of GPT-3 failing to generalise to reversed text. The most natural interpretation, at least in my mind, and which is pointed at by a couple of comments, is that the problem here is that there’s just been very little training data which contains things like:
m’I gnitirw ffuts ni esrever tub siht si yllautca ytterp erar
(I’m writing stuff in reverse but this is actually pretty rare)
Especially with translations underneath. In particular, there hasn’t been enough data to relate the ~token ‘ffuts’ to ‘stuff’. These are just two different things which have been encoded somewhere, one of them tends to appear near english words, the other tends to appear near other rare things like ‘ekil’.
It seems that how much of a capability hit language models take when trying to work in ‘backwards writing’, as well as other edited systems like pig latin or simple cyphers, and how much fine tuning they would take to restore the same capabilities as English, may provide a few interesting insights into model ‘theory of mind’.
The central thing I’m interested in here is trying to identify differences between cases where a LLM is both modelling a situation in some sort of abstract way, and then translating from that situation to language output, and cases where the model is ‘only’ doing language output.
Models which have some sort of world model, and use that world model to output language, should find it much easier to capably generalise from one situation to another. They also seem meaningfully closer to agentic reasoners. There’s also an interesting question about how different models look when fine tuned here. If it is the case that there’s a ~separate ‘world model’ and ‘language model’, training the model to perform well in a different language should, if done well, only change the second. This may even shed light on which parts of the model are doing what, though again I just don’t know if we have any ways of representing the internals which would allow us to catch this yet.
Ideas for specific experiments:
How much does grammar matter?
Pig latin, reversed english, and simple substitution cyphers all use identical grammar to standard english. This means that generalising to these tasks can be done just by substituting each english word for a different one, without any concept mapping taking place.
Capably generalising to French, however, is substantially harder to do without a concept map. You can’t just substitute word-by-word.
How well preserved is fine tuning across ‘languages’?
Pick some task, fine tune the model until it does well on the task, then fine tune the model to use a different language (using a method that has worked in earlier experiments). How badly is performance on the task (now presented in the new language) hit?
What happens if you change the ordering—you do the same fine-tuning, but only after you’ve trained the model to speak the new language. How badly is performance hit? How much does it matter whether you do the task-specific fine tuning in english or the ‘new’ language?
In all of these cases (and variations of them), how big a difference does it make if the new language is a 1-1 substitution, compared to a full language with different grammar.
If anyone does test some of these, I’d be interested to hear the results!
I actually happen to already have taught elicit to give helpful/obvious advice (not alignment specific, but close enough given the examples were inspired by thinking that lots of the advice I give in my day job as an advisor is obvious)! You can play with it here if you have an elicit account.
Edit:
Here’s the training data
Life problem
I need to think of a research topic but I’ve only had two ideas and both of them aren’t that great.
Obvious but useful advice.
Have you tried setting a 5-minute timer and, by-the-clock, just trying to write down as many ideas as possible? This can work surprisingly well!
Life problem
I’ve been putting off writing this email, and now every time I think about it I feel bad and don’t want to do anything at all, especially write the email!
Obvious but useful advice.
This seems like a pretty urgent problem to solve, as it’s one that will just keep getting worse otherwise as you get negative reinforcement when you think about it. I have a few ideas for things to try: Can you get a friend to sit with you while you write it, or even to write it for you? If you make it your number one priority, can you get it done right now? Is there a way you can not send it, for example by just emailing to say ‘sorry, can’t reply now, will explain later’?
Life problem
I’m thinking about quitting my job in finance in order to self-study ML and switch to working on alignment. How can I make the final decision?
Obvious but useful advice.
That’s an exciting decision to be making! It might be worth writing up the pros and cons of both options in a googledoc, and sharing it with some friends with comment access enabled. Getting your thoughts sorted in a way which is clear to others might be helpful itself, and then also your friends might have useful suggestions or additional considerations!
Life problem
I’m giving a talk tomorrow, but I’m worried that I’m going to be asked a question that’s difficult to answer and then say something which seems stupid if it gets taken out of context.
Obvious but useful advice.
Remember that it’s fine to say “I don’t know”, or some variant like “I’m not sure I can give a clear enough answer to that without thinking about it for longer”, when people ask you questions! You can even write this on your hand if you need to.
Life problem
I’m a bit worried about the next week, there’s a few pretty important things happening and I’ve tried to plan as best I can for them but I’m worried it’s all going to go horribly wrong.
Obvious but useful advice.
It’s ok to be worried before important things! One option you have, as long as it wouldn’t be too stressful, is writing a premortem. This is where you think about a specific thing that could go wrong, then decide whether there’s any action you can take to prevent or mitigate that scenario, then repeat. At the end of this process, you can then say to yourself something like “I’m pretty sure I’ve taken all of the actions it’s reasonable to take”, which might be reassuring. Even if it’s not immediately emotionally reassuring, if the actions were sensible, it’s now less likely things will go wrong!
Life problem
I don’t know how to go about planning my career.
Obvious but useful advice.
My suggestion would be to write down a few of the most important things you’d like your career to look like in two or three years, and for each of them write down concrete steps which you think will make progress towards achieving that. Then you can start seeing what sorts of jobs / collaborations might be feasible. If one thing that’s important to you for your career is impact, you could check out the resources on 80000hours.org, where there’s lots of helpful guidance, and you can even book a call with an advisor.
Life problem
I’m trying to solve a difficult technical problem at work, but my mind is just blank.
Obvious but useful advice.
Do you have to solve the problem on your own? Asking a colleague or friend can be really helpful. You could even write up a description of the problem in a googledoc and share it with a few people asking them to comment with ideas. If you have to solve the problem on your own, you could try imagining a specific person who you think would be able to solve the problem, and asking yourself what they would try.
Life problem
I’ve taken on too many commitments and now I’m feeling really overwhelmed by them
Obvious but useful advice.
Remember that saying “no” is fine, there are a lot of good reasons to drop commitments, including simply not having time or being too busy. You don’t need to feel guilty about making this decision now, but when you’re feeling better it might be worth working out whether you could have noticed dropped some things earlier. If you’re finding yourself saying yes too much, it might be worth setting yourself a ‘saying no to things’ goal which you share with a friend.
Life problem
I’ve got a test next week and I don’t know how best to prepare.
Obvious but useful advice.
Do you have any practice tests you can sit and then mark to work out where you went wrong? The best way to prepare for a thing is usually just trying to do that thing!
Life problem
I’m struggling to maintain focus as I work. I find myself staring into space, and not really resting, but I feel too guilty to stop and take a break.
Obvious but useful advice.
If you’re too stressed or tired to do useful work, you should stop and rest! It’s better to fully rest and gain back some energy than keep struggling when you aren’t being productive. You could also try using the pomodoro technique of working for set periods of time and taking breaks in between.
The ideal version of the task is decomposable into:
find the high level concepts in a paper (high level here meaning ‘high level background required’)
From a concept, generate the highest level prerequisite concepts
For a given concept, generate a definition/explanation (either by finding and summarising a paper/article, or just directly producing one)
The last of these tasks seems very similar to a few things Elicit is already doing or at least trying to do, so I’ll generate instances of the other two.
Identify some high-level concepts in a paper
Example 1
Input: This post by Nuno Sempere
Output: Suggestions for high level concepts
Counterfactual impact
Shapley Value
Funging
Leverage
Computability
Notes: In one sense the ‘obvious’ best suggestion for the above post is ‘Shapley value’, given that’s what the post is about, and it’s therefore the most central concept one might want to generate background on. I think I’d be fine with probably prefer the output above though, where there’s some list of <10 concepts. In a model which had some internal representation of the entirety of human knowledge, and purely selected the single thing with the most precursors, my (very uncertain) guess is that computability might be the single output produced, even though it’s non-central to the post and only appears in a footnote. That’s part of the reason why I’d be relatively happy for the output of this first task to roughly be ‘complicated vocabulary which gets used in the paper’
Example 2
Input: Eliciting Latent Knowledge by Mark Xu and Paul Christiano
Output: Suggestions for high level concepts
Latent Knowledge
Ontology
Bayesian Network
Imitative Generalisation
Regularisation
Indirect Normativity
Notes: This is actually a list of terms I noted down as I was reading the paper, so rather than ‘highest level’ it’s just ‘what Alex happened to think it was worth looking up’, but for illustrative purposes I think it’s fine.
Having been given a high-level concept, generate prerequisite concepts
Notes: I noticed when trying to generate background concepts here that in order to do so it was most useful to have the context of the post. This pushed me in the direction of thinking these concepts were harder to fully decompose than I had thought, and suggested that the input might need to be ‘[concept], as used in [paper]‘, rather than just [concept]. All of the examples below come from the examples above. In some cases, I’ve indicated what I expect a second layer of recursion might produce, though it seems possible that one might just want the model to recurse one or more times by default.
I found the process of generating examples really difficult, and am not happy with them. I notice that what I kept wanting to do was write down ‘high-level’ concepts. Understanding the entirety of a few high-level concepts is often close to sufficient to understand an idea, but it’s not usually necessary. With a smooth recursion UX (maybe by clicking), I think the ideal output almost invariably generates low-mid level concepts with the first few clicks. The advantages of this are that if the user recognises a concept they know they are done with that branch, and narrower concepts are easier to generate definitions for without recursing. Unfortunately, sometimes there are high level prerequisites which aren’t obviously going to be generated by recursing on the lower level ones. I don’ have a good solution to this yet.
Input: Shapley Value
Output:
Expected value
Weighted average
Elementary probability
Utility
Marginal contribution
Payoff
Agent
Fixed cost
Variable cost
Utility
Input: Computability
Output:
Computational problem
Computable function
Turing Machine
Computational complexity
Notes: I started recursing, quickly confirmed my hypothesis from earlier about this being by miles the thing with the most prerequisites, and deleted everything except what I had for ‘level 1’, which I also left unfinished before I got completely lost down a rabbithole.
Input: Bayesian Network
Output:
Probabilistic inference
Bayes’ Theorem
Probability distribution
Directed Acyclic Graph
Directed Graph
Graph (Discrete Mathematics)
Vertex
Edge
Cycle
Trail
Graph (Discrete Mathematics)
Vertex
Edge
Notes: Added a few more layers of recursion to demonstrate both that you probably want some kind of dynamic tree structure, and also also that not every prerequisite is equally ‘high level’.
Conclusions from trying to generate examples
This is a much harder, but much more interesting, problem than I’d originally expected. Which prerequisites seem most important, how narrowly to define them, and how much to second guess myself, all ended up feeling pretty intractable. I may try with some (much) simpler examples later, rather than trying to generate them from papers I legitimately found interesting. If a LLM is able to generalise the idea of ‘necessary prerequisites’ from easier concepts to harder ones, this itself seems extremely interesting and valuable.
Task: Identify key background knowledge required to understand a concept
Context: Many people are currently self-directing their learning in order to eventually be able to useful contribute to alignment research. Even among experienced researchers, people will sometimes come across concepts that require background they don’t have in order to understand. By ‘key’ background content, I’m imagining that the things which get identified are ‘one step back’ in the chain, or something like ‘the required background concepts which themselves require the most background’. This seems like the best way of making the tool useful, if the background concepts generated are themselves not understood by the user, they can just use the tool again on those concepts.
Input type: A paper (with the idea that part of the task is to identify the highest level concepts in the paper). It would also be reasonable to just have the name of a concept, with a separate task of ‘generate the highest level concept’.
Output type: At minimum, a list of concepts which are key background. Better would be a list of these concepts plus summaries of papers/textbooks/wikipedia entries which explain them.
Info considerations: This system is not biased towards alignment over capabilities, though I think it will in practice help alignment work more than capabilities work, due to the former being less well-served by mainstream educational material and courses. This does mean that having scraped LW and the alignment forum, alignment-relevant things on ArXiv, MIRI’s site etc. would be particularly useful
I don’t have capacity today to generate instances, though I plan to come back and do so. I’m happy to share credit if someone else jumps in first and does so though!
I’m confused about how valuable Language models are multiverse generators is as a framing. On the one hand, I find thinking in this way very natural, and did end up having what I thought were useful ideas to pursue further as I was reading it. I also think loom is really interesting, and it’s clearly a product of the same thought process (and mind(s?)).
On the other hand, I worry that the framing is so compelling mostly just because of our ability to read into text. Lots of things have high branching factor, and I think there’s a very real sense in which we could replace the post with Stockfish is a multiverse generator, Alphazero is a multiverse generator, or Piosolver is a multiverse generator, and the post would look basically the same, except it would seem much less beautiful/insightful, and instead just provoke a response of ‘yes, when you can choose a bunch of options at each step in some multistep process, the goodness of different options is labelled with some real, and you can softmax those reals to turn them into probabilities, your process looks like a massive tree getting split into finer and finer structure.’
There’s a slight subtlety here in that in the chess and go cases, the structure won’t strictly be a tree because some positions can repeat, and in the poker case the number of times the tree can branch is limited (unless you consider multiple hands, but in that case you also have possible loops because of split pots). I don’t know how much this changes things.