jacquesthibs comments on jacquesthibs’s Shortform

jacquesthibs 12 Apr 2024 15:01 UTC
11 points
4
I’m currently ruminating on the idea of doing a video series in which I review code repositories that are highly relevant to alignment research to make them more accessible.

I do want to pick out repos with perhaps even bad documentation that are still useful and then hope on a call with the author to go over the repo and record it. At least have something basic to use when navigating the repo.

This means there would be two levels: 1) an overview with the author sharing at least the basics, and 2) a deep dive going over most of the code. The former likely contains most of the value (lower effort for me, still gets done, better than nothing, points to repo as a selection mechanism, people can at least get started).
I am thinking of doing this because I think there may be repositories that are highly useful for new people but would benefit from some direction. For example, I think Karpathy and Neel Nanda’s videos have been useful in getting people started. In particular, Karpathy saw OOM more stars to his repos (e.g. nanoGPT) after the release of his videos (which, to be fair, he’s famous, and a number of stars is definitely not a perfect proxy for usage).
I’m interested in any feedback (“you should do it like x”, “this seems low value for x, y, z reasons so you shouldn’t do it”, “this seems especially valuable only if x”, etc.).
Here are some of the repos I have in mind so far:

Release Ordering
- Evalugator
- Sleeper Agents
  - How to remove Sleeper Agents
  - Open Source replication
- Weak-to-Strong Generalization
- Neuron-pedia (I think they have GIFs right now)
- ELK
  - ELK Generalization
- Potential Videos
  - Localizing Lying in Llama
  - Representation Engineering
  - Sparse Autoencoders
  - nnsight
  - TransformerLens
  - Anthropic
    Model-Written Evals
    Ethan’s faithfulness approach
    Influence Functions
  - DeepMind
    None seem interesting so far.
  - EleutherAI
    LM Harness
    Semantic Memorization
    How to use Pythia
    Weak-to-Strong replication
    Concept Erasure
    Features Across Time
  - OpenAI
    Evals
    Triton introduction
    Transformer Debugger (OpenAI has videos)
  - Influence Functions (when there is a legitimate repo)
  - Recommended by Garett Baker
    Devinterp
    procgenAISC and procgen-tools
- Dagon 12 Apr 2024 16:57 UTC
  6 points
  0
  Parent
  I love this idea! I don’t actually like videos, preferring searchable, exerptable text, but I may not be typical and there’s room for all. At first glance, I agree with your guess that the overview/intro is more value per effort (for you and for consumers, IMO) than a deep-dive into the code. There IS probably a section of code or core modeling idea for each where it would be worth going half-deep into (algorithm and usage, not necessarily line-by-line).
  Note that this list is itself incredibly valuable, and you might start with an intro video (and associated text) that spends 1 minute on each and why you’re planning to do it, and what you currently think will be the most important intro concept(s) for each.