Team Shard Status Report

Team Shard is a nebulous alignment research collective, on paper siloed under John Wentworth’s SERI MATS program, but in reality extending its many tendrils far across the Berkeley alignment community. “Shard theory”—a name spoken of in hushed, mildly confused tones at many an EA hangout. This is their story (this month).

Epistemic status: A very quick summary of Team Shard’s current research, written up today. Careful summaries and actual results are forthcoming, so skip this unless you’re specifically interested in a quick overview of what we’re currently working on.

Introduction

This past month, Team Shard began its research into the relationship between the reinforcement schedules and learned values of RL agents. Our core MATS team is composed of yours truly, Michael Einhorn, and Quintin Pope. The greater Team Shard, however, is legion—its true extent only dimly suggested by the author names on its LessWrong writeups.

Our current path to impact is to (1) distill and expound shard theory and preregister its experimental predictions, (2) run RL experiments testing shard theory’s predictions about learned values, and (3) climb the interpretability tech tree, starting with finetuned-on-values-text large language models, to unlock more informative experiments. In the 95th percentile, best-case possible world, we learn a bunch about how to reliably induce chosen values in extant RL agents by modulating the agent’s reinforcement schedule and are able to probe the structure of those induced values within the models with interpretability tools.

Distillations

If you don’t understand shard theory’s basic claims and/​or its relevance to alignment, stay tuned! A major distillation is forthcoming.

Natural Shard Theory Experiments in Minecraft

Uniquely, Team Shard already has a completed (natural) experiment under its belt! However, this experiment has a couple of nasty confounds, and even without those it would only have yielded a single bit of evidence for or against shard theory. But to summarize: OpenAI’s MineRL agent is able to, in the best case, craft a diamond pickaxe in Minecraft in 4 minutes (!). Usually, the instrumental steps the MineRL agent must pursue to craft the diamond pickaxe … lie on the most efficient path to crafting the diamond pickaxe. So we can’t disentangle from the model’s ordinary gameplay data whether the model terminally values the journey or the destination: is reinforcement the model’s optimization target, or are its numerous in-distribution proxies among its terminal goals?

One Karolis Ramanauskas, thankfully, already did the hard work of finding out for us!

When you give the MineRL agent a full stack of diamonds at the get-go … it starts punching trees and crafting the basic Minecraft tools, rather than immediately crafting as many diamond pickaxes as possible. Theories that predict the journey rather than the destination rejoice!

Now, there’s a confound here, because the model was trained via reward shaping—it was rewarded some lesser amount for all the instrumental steps along the way to the diamond pickaxe. Also, the model is quite stupid, despite its best-case properties. Whatever’s true about its terminal values, it may just be spazzing around and messing up. Given the significant difficulty of even running (yet alone further finetuning) the OpenAI MineRL model, along with the model’s stupidity confound, we opted to conduct our RL experiments in a more tractable (but still appreciably complex) environment.

Learned Values in CoinRun

So we now have an RL agent playing CoinRun well!

We’re going to take it off-distribution and see whether it terminally values (1) just the coins, (2) a small handful of in-distribution proxies for getting coins, or (3) all of its in-distribution proxies for coins! Shard theory mostly bets that (3) will be the case, and is very nearly falsified if (1) is the case.

Feedback on Observable Monologues

We have successfully replicated some of the core results of the ROME paper, on a GPT-style model! It looks like a language model finetuned on value-laden sentences stores facts about those sentences in the same place internally that it would store, e.g., facts about which city the Eiffel Tower is located in.

From here, we will proceed on to forcing the thing to observably monologue.[1] Think of this as a continuation of the ROME interpretability result. That is, this part of Team Shard is betting on us climbing the interpretability tech tree and thereby unlocking important alignment experiments, including experiments that will reveal much about the shards active inside a model.

Conclusion

♫Look at me still talking when there’s science to do
When I look out there
It makes me GLaD I’m not you
I’ve experiments to run
There is research to be done
On the people who are
Still alive.♫

  1. ^

    externalize its reasoning