See also: Your posts should be on Arxiv
I do agree we’re leaving lots of value on the table and even causing active harm by not writing things up well, at least for Arxiv, for a bunch of reasons including some of the ones listed here.
See also: Your posts should be on Arxiv
I do agree we’re leaving lots of value on the table and even causing active harm by not writing things up well, at least for Arxiv, for a bunch of reasons including some of the ones listed here.
It’s good to see some informed critical reflection on MI as there hasn’t been much AFAIK. It would be good to see reactions from people who are more optimistic about MI!
I see. In that case, what do you think of my suggestion of inverting the LM? By default, it maps human reward functions to behavior. But when you invert it, it maps behavior to reward functions (possibly this is a one-to-many mapping but this ambiguity is a problem you can solve with more diverse behavior data). Then you could use it for IRL (with the some caveats I mentioned).
Which may be necessary since this:
The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).
...seems like an unreliable mapping since any training data of the form “person did X, therefore their goal must be Y” is firstly rare and more importantly inaccurate/incomplete since it’s hard to describe human goals in language. On the other hand, human behavior seems easier to describe in language.
Do I read right that the suggestion is as follows:
Overall we want to do inverse RL (like in our paper) but we need an invertible model that maps human reward functions to human behavior.
You use an LM as this model. It needs to take some useful representation of reward functions as input (it could do so if those reward functions are a subset of natural language)
You observe a human’s behavior and invert the LM to infer the reward function that produced the behavior (or the set of compatible reward functions)
Then you train a new model using this reward function (or functions) to outperform humans
This sounds pretty interesting! Although I see some challenges:
How can you represent the reward function? On the one hand, an LM (or another behaviorally cloned model) should use it as an input so it should be represented as natural language. On the other hand some algorithm should maximize it in the final step so it would ideally be a function that maps inputs to rewards.
Can the LM generalize OOD far enough? It’s trained on human language which may contain some natural language descriptions of reward functions, but probably not the ‘true’ reward function which is complex and hard to describe, meaning it’s OOD.
How can you practically invert an LM?
What to do if multiple reward functions explain the same behavior? (probably out of scope for this post)
Great to see this studied systematically—it updated me in some ways.
Given that the study measures how likeable, agreeable, and informative people found each article, regardless of the topic, could it be that the study measures something different from “how effective was this article at convincing the reader to take AI risk seriously”? In fact, it seems like the contest could have been won by an article that isn’t about AI risk at all. The top-rated article (Steinhardt’s blog series) spends little time explaining AI risk: Mostly just (part of) the last of four posts. The main point of this series seems to be that ‘More Is Different for AI’, which is presumably less controversial than focusing on AI risk, but not necessarily effective at explaining AI risk.
Not sure if any of these qualify but: Military equipment, ingredients for making drugs, ingredients for explosives, refugees and travelers (being transferred between countries), stocks and certificates of ownership (used to be physical), big amounts of cash. Also I bet there was lots of registration of goods in planned economies.
Another advantage of Chinese leadership in AI: while right now they have less alignment research than the West, they may be better at scaling it up at crunch time: they have more control over what companies and people work on, a bigger government, and a better track record at pulling off major projects like controlling COVID and, well, large-scale ‘social engineering’.
One way to convert: measure how accurate the LM is at word-level prediction by measuring its likelihood of each possible word. For example the LM’s likelihood of the word “[token A][token B]” could be .
Playing this game made me realize that humans aren’t trainged to predict at the token-level. I don’t know the token-level vocabulary; and made lots of mistakes by missing spaces and punctuation. Is it possible to convert the token-level prediction in to word-level prediction? This may get you a better picture of human ability.
Relevant: Last Layer Re-Training is Sufficient for Robustness to Spurious Correlations.
They argue that the pre-trained network already learns some non-confused features but doesn’t use them. And you just need to fine-tune the last layer to utilize them.
We’ll be able to fine-tune in the test environment so won’t experience OOD at deployment, and while changes will happen, continual fine-tuning will be good enough to stop the model from ever being truly OOD. We think this may apply in settings where we’re using the model for prediction, but it’s unclear whether continual fine-tuning will be able to help models learn and adapt to the rapid OOD shifts that could occur when the models are transferred from offline learning to online interaction at deployment.
Couldn’t the model just fail at the start of fine-tuning (because it’s causally confused), then learn in a decision setting to avoid causal confusion, and then no longer be causally confused?
If no—I’m guessing you expect that the model only unlearns some of its causal confusion. And there’s always enough left so that after the next distribution shift the model again performs poorly. If so, I’d be curious why you believe that the model won’t unlearn all or most of its causal confusion.
This distillation was useful for me, thanks for making it! As feedback, I got stuck at the bullet-point explanation of imitative generalization. There was not enough detail to understand it so I had to read Beth’s post first and try connect it to your explanation. For example kind of changes are we considering? To what model? How do you evaluate if an change lets the human make better predictions?
A large amount of math describes the relations between agents at the same level of analysis: this is almost all of game theory. [...] our focus is on “vertical” relations, between composite agents and their parts.
This seems to be what is studied in the fields of organizational economics and to some extent in industrial organization / vertical integration. These fields have a great deal of game theory on vertical relationships, particularly relationships between the firm and its employees, managers, and contractors. Some of this can probably be ported to your interfaces. These fields are unsolved though, which means there’s work left to do, but also that it’s been difficult to find simple solutions, perhaps because you’re modeling complex phenomena.
I like your section on self-unaligned agents btw. Curious what comes out of your centre.
My point is that, while PCIe bandwidths aren’t increasing very quickly, it’s easy to increase the number of machines you use. So you can distribute each NN layer (width-wise) across many machines, each of which adds to the total bandwidth you have.
(As noted in the previous comment, you can do this with <<300GB of total GPU memory for GPT-3 with something like ZeRO-infinity)
Beware bandwidth bottlenecks, as I mentioned in my original post.
Presumably bandwidth requirements can be reduced a lot through width-wise parallelism. Each GPU only has to load one slice of the model then. Of course you’ll need more GPUs then but still not a crazy number as long as you use something like ZeRO-infinity.
(Yes, 8x gpu->gpu communications will hurt overall latency… but not by all that much I don’t think. 1 second is an eternity.)
Width-wise communication, if you mean that, can be quite a latency bottleneck for training. And it gets worse when you make the model wider or the batch bigger, which of course people are constantly doing. But for inference I guess you can reduce the latency if you’re willing to use a small batch size.
Thanks for elaborating I think I know what you mean now. I missed this:
I am talking about pipelining loading the NN weights into the GPU. Which is not dependent on the result of the previous layer’s computation.
My original claim was that Zero-infinity has higher latency compared to pipelining in across many layers of GPUs so that you don’t have to repeatedly load weights from RAM. But as you pointed out, Zero-infinity may avoid the additional latency by loading the next layer’s weights from RAM at the same as computing the previous layer’s output. This helps IF loading the weights is at least as fast as computing the outputs. If this works, we may be able to deploy massive future neural nets on clusters no bigger than the ones we have today.
My original claim was therefore misconceived. I’ll revise it to a different claim: bigger neural nets ought to have higher inference latency in general—regardless of the whether we use Zero-infinity or not. As I think we both agree, pipelining, in the sense of using different GPUs to compute different layers, doesn’t reduce latency. However, adding more layers increases latency, and it’s hard to compensate with other forms of parallelism. (Width-wise parallelism could help but its communication cost scales unfavorably. It grows as we grow the NN’s width, and then again when we try to reduce latency by reducing the number of neurons per GPU [edit: it’s not quadratic, I was thinking of the parameter count].) Does that seem right to you?
The consequence then would be that inference latency (if not inference cost) becomes a constraint as we grow NNs, at least for applications where latency matters.
The key is: pipelining doesn’t help with latency of individual requests. But that’s not what we care about here. What we care about is the latency from starting request 1 to finishing request N
Thanks for the examples. Your point seems to be about throughput, not latency (which to my knowledge is defined on a per-request basis). The latency per request may not matter for training but it does matter for inference if you want your model to be fast enough to interact with the world in real time or faster.
Interesting point. Though on this view, “Deceptive alignment preserves goals” would still become true once the goal has drifted to some random maximally simple goal for the first time.
To be even more speculative: Goals represented in terms of existing concepts could be simple and therefore stable by default. Pretrained models represent all kinds of high-level states, and weight-regularization doesn’t seem to change this in practice. Given this, all kinds of goals could be “simple” as they piggyback on existing representations, requiring little additional description length.