I am starting a PhD in computer science, focusing on agent foundations so far, which is great. I intend to continue devoting at least half my time to agent foundations.
However, for several reasons, it seems to be important for me to do some applied work, particularly with LLMs:
I believe I’m otherwise pretty well positioned to get an impactful job at Google DeepMind, but apparently some impressive machine learning engineering creds are necessary to get on even the safety team currently.
My PhD supervisor is pushing for me to work on LLMs, though he seems to be pretty flexible about the details.
As much fun as math is, I also value developing my practical skills (in fact, half the fun of learning things is becoming more powerful in the real world).
LLM experts seem likely to be in high demand right now, though I am not sure how long that will last.
Now, I’ve spent the last couple of years mostly studying AIXI and its foundations. I’m pretty comfortable with standard deep learning algorithms and libraries and I have some industry experience with machine learning engineering, but I am not an expert on NLP, LLMs, or prosiac alignment. Therefore, I am looking for suggestions from the community about LLM related research projects that would satisfy as many as possible of the following criteria:
Is not directly focused on improving frontier model capabilities (for ethical reasons; though my timelines seem to be longer than the average lesswronger, I’m not able to accept the risk that I am wrong).
Produces mundane utility. I find it much more fulfilling to work on things that I can see becoming useful to people, and I also want a measure of my success which is as concrete as possible.
Contributes to prosiac alignment. It would be particularly nice if the experimental/engineering work involved is likely to inform my ideas for mathematical alignment research.
Machine learning engineering/research creds.
Any suggestions are appreciated. I may also link this question to a Manifold market in the future (probably “conditional on working full time at DeepMind within 18 months of graduation, which areas of research did my PhD thesis include”) or something along those lines. Thanks!
I’m accumulating a to-do list of experiments much faster than my ability to complete them:
Characterizing fine-tuning effects with feature dictionaries
Toy-scale automated neural network decompilation (difficult to scale)
Trying to understand evolution of internal representational features across blocks by throwing constraints at it
Using soft prompts as a proxy measure of informational distance between models/conditions and behaviors (see note below)
Prompt retrodiction for interpreting fine tuning, with more difficult extension for activation matching
Miscellaneous bunch of experiments
If you wanted to take one of these and run with it or a variant, I wouldn’t mind!
The unifying theme behind many of these is goal agnosticism: understanding it, verifying it, maintaining it, and using it.
Note: I’ve already started some of these experiments, and I will very like start others soon. If you (or anyone reading this, for that matter) sees something they’d like to try, we should chat to avoid doing redundant work. I currently expect to focus on #4 for the next handful of weeks, so that one is probably at the highest risk of redundancy.
Further note: I haven’t done a deep dive on all relevant literature; it could be that some of these have already been done somewhere! (If anyone happens to know of prior art for any of these, please let me know.)
This is an extensive list, I’ll return to it when I have a bit more experience with the area.
I’m not familiar enough with agent foundations to provide very detailed object level advice, but I think it would be hugely valuable to empirically test agent foundations ideas in real models, with the understanding that AGI doesn’t necessarily have to look like LMs but any theory for intelligence has to at least fit both LMs and AGI. As an example, we might believe that LMs might not have goals in the same sense as AGI eventually, but then we can ask why LMs can still seem to achieve any goals at all, and perhaps through empirical investigation of LMs we can get a better understanding of the nature of goal seeking. I think this would be much, much more valuable than generic LM alignment work.
If I were in your position, I would work on the ideas described in my post How to Control an LLM’s Behavior and the paper Pretraining Language Models with Human Preferences that inspired it.
From the paper’s results, the approach is very effective, my post discusses how to make it very controllable and flexible, and it has the particular advantage that since it’s done at pretraining time it can’t just be easily fine-tuned away out of an open-source model (admittedly, the latter might do more for your employability at Meta FAIR Paris or Mistral than at DeepMind — but then, which of those seem like the higher x-risk to solve?)
I like this idea, can I DM you about the research frontier?
Of course. I also wrote a second post on another possible specific application of this approach: Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment.