A math and computer science graduate interested in machine and animal cognition, philosophy of language, interdisciplinary ideas, etc.
Ben Amitay
I probably don’t understand the shortform format, but it seem like others can’t create top-level comments. So you can comment here :)
I had an idea for fighting goal misgeneralization. Doesn’t seem very promising to me, but does feel close to something interesting. Would like to read your thoughts:
Use IRL to learn which values are consistent with the actor’s behavior.
When training the model to maximize the actual reward, regularize it to get lower scores according to the values learned by the IRL. That way, the agent is incentivized to signal not having any other values (and somewhat incentivized agains power seeking)
Ben Amitay’s Shortform
Writing this post as rationality case study
It is beautiful to see that many of our greatest minds are willing to Say Oops, even about their most famous works. It may not score that many winning-points, but it does restore quite a lot of dignity-points I think.
Learning without Gradient Descent—Now it is much easier to imagine learning without gradient decent. An LLM can add into its context or even save into a database knowledge, meta-cognitive strategies, code, etc.
It is very similar to value change due to inner misalignment or self improvement, except it is not literally inside the model but inside its extended cognition.
In another comment on this post I suggested an alternative entropy-inspired expression, that I took from RL. To the best of my knowledge, to the RL context it came from FEP or active inference or at least is acknowledged to be related.
Don’t know about the specific Friston reference though
I agree with all of it. I think that I through the N there because average utilitarianism is super contra intuitive for me so I tried to make it total utility.
And also about the weights—to value equality is basically to weight the marginal happiness of the unhappy more than that of the already-happy. Or when behind the vail of ignorance, to consider yourself unlucky and therefore more likely to be born as the unhappy. Or what you wrote.
I think that the thing you want is probably to maximize N*sum(u_i exp(-u_i/T))/sum(exp(-u_i/T)) or -log(sum(exp(-u_i/T))) where u_i is the utility of the Ith person, and N is the number of people—not sure which. That way you get in one limit the vail of ignorance for utility maximizers, and in the other limit the vail of ignorance of Roles (extreme risk aversion).
That way you also don’t have to treat the mean utility separately.
It’s not a full answer, but: To the degree that it is true that the quantities align with the standard basis, it must be somehow a result of asymmetry of the activation. For example ReLU trivially depend on the choice of basis.
If you focus on the ReLU example, it sort of make sense: if multiple non-related concepts express in the same neuron, and one of them push the neuron in the negative direction, it may make the ReLU destroy information of the other concepts.
Sorry for the off-topicness. I will not consider it rude if you stop reading here and reply with “just shut up”—but I do think that it is important:
A) I do agree that the first problem to address should probably be misalignment of the rewards to our values, and that some of the proposed problems are not likely in practice—including some versions of the planning-inside-worldmodel example.
B) I do not think that planning inside the critic or evaluating inside the actor are an example of that, because the functions that those two models are optimized to approximate reference each other explicitly in their definitions. It doesn’t mean that the critic is likely to one day kill us, just that we should take it into account when we try to I do understand what is going on.
C) Specifically, it implies 2 additional non-exotic alignment failures:
The critic itself did not converge to be a good approximation of the value function.
The actor did not converge to be a thing that maximize the output of the critic, and it maximize something else instead.
I see. I didn’t fully adapt to the fact that not all alignment is about RL.
Beside the point: I think those labels on the data structures are very confusing. Both the actor and the critic are very likely to have so specialized world models (projected from the labeled world model) and planning abilities. The values of the actor need not be the same as the output of the critic. And things value-related and planning-related may easily leak into the world model if you don’t actively try to prevent it. So I suspect that we should ignore the labels and focus on architecture and training methods.
Yes, I think that was it; and that I did not (and still don’t) understand what about that possible AGI architecture is non-trivial and has a non-trivial implementations for alignment, even if not ones that make it easier. It seem like not only the same problems carefully hidden, but the same flavor of the same problems on plain sight.
Didn’t read the original paper yet, but from what you describe, I don’t understand how the remaining technical problem is not basically the whole of the alignment problem. My understanding of what you say is that he is vague about the values we want to give the agent—and not knowing how to specify human values is kind of the point (that, and inner alignment—which I don’t see addressed either).
I didn’t think much about the mathematical problem, but I think that the conjecture is at least wrong in spirit, and that LLMs are good counterexample for the spirit. An LLM by its own is not very good at being an assistant, but you need pretty small amounts of optimization to steer the existing capabilities toward being a good assistant. I think about it as “the assistant was already there, with very small but not negligible probability”, so in a sense “the optimization was already there”, but not in a sense that is easy to capture mathematically.
[Question] Semantics, Syntax and Pragmatics of the Mind?
Hi, sorry for commenting on ancient comment, but I just read it again and found that I’m not convinced that the mesaoptimizers problem is relevant here. My understanding is that if you switch goals often enough, every mesaiptimizer that isn’t corrigible should be trained away as it hurt the utility as defined.
To be honest, I do not expect RLHF to do that. “Do the thing that make people press like” don’t seem to me like an ambitious enough problem to unlock much (the buried Predictor may extrapolate from short-term competence to an ambitious musk though). But if that is true, someone will eventually be tempted to be more… creative about the utility function. I don’t think you can train it on “maximize Microsoft share value” yet, but I expect it to be possible in decade or two, and Maan be less for some other dangerous utility.
I think that is a good post, and strongly agree with most of it. I do think though that the role of RLHF or RL fine-tuning in general is under emphasized. My fear isn’t that the Predictor by itself will spawn a super agent, even due to a very special prompt.
My fear is that it may learn good enough biases that RL can push it significantly beyond human level. That it may take the strengths of different humans and combine them. That it wouldn’t be like imitating the smartest person, but a cognitive version of “create a super-human by choosing genes carefully from just the existing genetic variation”—which I strongly suspect is possible. Imagine that there are 20 core cognitive problems that you learn to solve in childhood, and that you may learn better or worse algorithms to solve them. Imagine Feynman got 12 of them right, and Hitler got his power because he got other 5 of them right. If the RL can therefore be like a human who got 17 right, that’s a big problem. If it can extrapolate what a human would be like if its working memory was just twice larger—a big problem.
I seem to be the the only one who read the post that way, so probably I read my own opinions into it, but my main takeaway was pretty much that people with your (and my) values are often shamed into pretending to have other values and invent excuses for how their values are consistent with their actions, while it would be more honest and productive if we take a more pragmatic approach to cooperating around our altruistic goals.