Wait, I thought the first property was just independence, not also identically distributed.
In principle I could have e.g. two biased coins with their biases different but deterministically dependent.
Wait, I thought the first property was just independence, not also identically distributed.
In principle I could have e.g. two biased coins with their biases different but deterministically dependent.
I think:
Finding principles for AI “behavioural engineering” that reduces people’s desire to engage in risky races (e.g. because they find the principles acceptable) seems highly valuable
To the extent permitted by 1, pursuing something CEV like (“we’re happier with the outcome in hindsight than we would’ve been with other outcomes”) seems desirable also
I sort of see the former as potentially encouraging diversity (because different groups want different things, and are most likely to agree to “everyone gets what they want”), but the latter may in fact suggest convergence (because, perhaps, there are fairly universal answers to “what makes people happy with the benefit of hindsight?”).
You stress the importance of having robust feedback procedures, but having overall goals like this can help to judge which procedures are actually doing what we want.
Your natural latents seem to be quite related to the common construction IID variables conditional on a latent—in fact, all of your examples are IID variables (or “bundles” of IID variables) conditional on that latent. Can you give me an interesting example of a natural latent that is not basically the conditionally IID case?
(I was wondering if the extensive literature on the correspondence between De Finetti type symmetries and conditional IID representations is of any help to your problem. I’m not entirely sure if it is, given that mostly addresses the issue of getting from a symmetry to a conditional independence, whereas you want to get from one conditional independence to another, but it’s plausible some of the methods are applicable)
If you’re in a situation where you can reasonably extrapolate from past rewards to future reward, you can probably extrapolate previously seen “normal behaviour” to normal behaviour in your situation. Reinforcement learning is limited—you can’t always extrapolate past reward—but it’s not obvious that imitative regularisation is fundamentally more limited.
(normal does not imply safe, of course)
Their empirical result rhymes with adversarial robustness issues—we can train adversaries to maximise ~arbitrary functions subject to small perturbation from ground truth constraints. Here the maximised function is a faulty reward model and the constraint is KL to a base model instead of distance to a ground truth image.
I wonder if multiscale aggregation could help here too as it does with image adversarial robustness. We want the KL penalty to ensure that the generations should look normal at any “scale”, whether we look at them token by token or read a high-level summary of them. However, I suspect their “weird, low-KL” generations will have weird high-level summaries, whereas more desired policies would look more normal in summary (though it’s not immediately obvious if this translates to low and high probability summaries respectively—one would need to test). I think a KL penalty to the “true base policy” should operate this way automatically, but as the authors note we can’t actually implement that.
Is your view closer to:
there’s two hard steps (instruction following, value alignment), and of the two instruction following is much more pressing
instruction following is the only hard step; if you get that, value alignment is almost certain to follow
Mathematical reasoning might be specifically conducive to language invention because our ability to automatically verify reasoning means that we can potentially get lots of training data. The reason I expect the invented language to be “intelligible” is that it is coupled (albeit with some slack) to automatic verification.
There’s a regularization problem to solve for 3.9 and 4, and it’s not obvious to me that glee will be enough to solve it (3.9 = “unintelligible CoT”).
I’m not sure how o1 works in detail, but for example, backtracking (which o1 seems to use) makes heavy use of the pretrained distribution to decide on best next moves. So, at the very least, it’s not easy to do away with the native understanding of language. While it’s true that there is some amount of data that will enable large divergences from the pretrained distribution—and I could imagine mathematical proof generation eventually reaching this point, for example—more ambitious goals inherently come with less data, and it’s not obvious to me that there will be enough data in alignment-critical applications to cause such a large divergence.
There’s an alternative version of language invention where the model invents a better language for (e.g.) maths then uses that for more ambitious projects, but that language is probably quite intelligible!
For what it’s worth, one idea I had as a result of our discussion was this:
We form lots of beliefs as a result of motivated reasoning
These beliefs are amenable to revision due to evidence, reason or (maybe) social pressure
Those beliefs that are largely resilient to these challenges are “moral foundations”
So philosophers like “pain is bad” as a moral foundation because we want to believe it + it is hard to challenge with evidence or reason. Laypeople probably have lots of foundational moral beliefs that don’t stand up as well to evidence or reason, but (perhaps) are equally attributable to motivated reasoning.
Social pressure is a bit iffy to include because I think lots of people relate to beliefs that they adopted because of social pressure as moral foundations, and believing something because you’re under pressure to do so is an instance of motivated reasoning.
I don’t think this is a response to your objections, but I’m leaving it here in case it interests you.
I can explain why I believe bachelors are unmarried: I learned that this is what the word bachelor means, I learned this because it is what bachelor means, and the fact that there’s a word “bachelor” that means “unmarried man” is contingent on some unimportant accidents in the evolution of language. A) it is certainly not the result of an axiomatic game and B) if moral beliefs were also contingent on accidents in the evolution of language (I think most are not), that would have profound implications for metaethics.
Motivated belief can explain non-purely-selfish beliefs. I might believe pain is bad because I am motivated to believe it, but the belief still concerns other people. This is even more true when we go about constructing higher order beliefs and trying to enforce consistency among beliefs. Undesirable moral beliefs could be a mark against this theory, but you need more than not-purely-selfish moral beliefs.
I’m going to bow out at this point because I think we’re getting stuck covering the same ground.
Thanks for your continued engagement.
I’m interested in explaining foundational moral beliefs like suffering is bad, not beliefs like “animals do/don’t suffer”, which is about badness only because we accept the foundational assumption that suffering is bad. Is that clear in the updated text?
Now, I don’t think these beliefs come from playing axiomatic games like “define good as that which increases welfare”. There are many lines of evidence for this. First: “define bad as that which increases suffering” is not equally as plausible as “define good as that which increases suffering”. We have pre-existing beliefs about this.
Second: you talk about philosophers analysing welfare. However, the method that philosophers use to do this usually involves analysing a bunch of fundamental moral assumptions. For example, from the Stanford encyclopaedia of philosophy:
Correspondingly, no amount of empirical investigation seems by itself, without some moral assumption(s) in play, sufficient to settle a moral question https://plato.stanford.edu/entries/metaethics/
I am suggesting that the source of these fundamental moral assumptions may not be mysterious—we have a known ability to form beliefs based on what we want, and fundamental moral beliefs often align with what we want.
I think precisely defining “good” and “bad” is a bit beside the point—it’s a theory about how people come to believe things are good and bad, and we’re perfectly capable of having vague beliefs about goodness and badness. That said, the theory is lacking a precise account of what kind of beliefs it is meant to explain.
The LLM section isn’t meant as support for the theory, but speculation about what it would say about the status of “experiences” that language models can have. Compared to my pre-existing notions, the theory seems quite willing to accommodate LLMs having good and bad experiences on par with those that people have.
I have a pedantic and a non-pedantic answer to this. Pedantic: you say X is “usually considered good” if it increases welfare. Perhaps you mean to imply that if X is usually considered good then it is good. In this case, I refer you to the rest of the paragraph you quote.
Non-pedantic: yes, it’s true that once you accept some fundamental assumptions about goodness and badness you can go about theorising and looking for evidence. I’m suggesting that motivated reasoning is the mechanism that makes those fundamental assumptions believable.
I added a paragraph mentioning this, because I think your reaction is probably common.
Here’s a basic model of policy collapse: suppose there exist pathological policies of low prior probability (/high algorithmic complexity) such that they play the training game when it is strategically wise to do so, and when they get a good opportunity they defect in order to pursue some unknown aim.
Because they play the training game, a wide variety of training objectives will collapse to one of these policies if the system in training starts exploring policies of sufficiently high algorithmic complexity. So, according to this crude model, there’s a complexity bound: stay under it and you’re fine, go over it and you get pathological behaviour. Roughly, whatever desired behaviour requires the most algorithmically complex policy is the one that is most pertinent for assessing policy collapse risk (because that’s the one that contributes most of the algorithmic complexity, and so it give your first order estimate of whether or not you’re crossing the collapse threshold). So, which desired behaviour requires the most complex policy: is it, for example, respecting commonsense moral constraints, or is it inventing molecular nanotechnology?
Tangentially, the policy collapse theory does not predict outcomes that look anything like malicious compliance. It predicts that, if you’re in a position of power over the AI system, your mother is saved exactly as you want her to be. If you are not in such a position then your mother is not saved at all and you get a nanobot war instead or something. That is, if you do run afoul of policy collapse, it doesn’t matter if you want your system to pursue simple or complex goals, you’re up shit creek either way.
Algorithmic complexity is precisely analogous to difficulty-of-learning-to-predict, so saying “it’s not about learning to predict, it’s about algorithmic complexity” doesn’t make sense. One read of the original is: learning to respect common sense moral side constraints is tricky[1], but AI systems will learn how to do it in the end. I’d be happy to call this read correct, and is consistent with the observation that today’s AI systems do respect common sense moral side constraints given straightforward requests, and that it took a few years to figure out how to do it. That read doesn’t really jive with your commentary.
Your commentary seems to situate this post within a larger argument: teaching a system to “act” is different to teaching it to “predict” because in the former case a sufficiently capable learner’s behaviour can collapse to a pathological policy, whereas teaching a capable learner to predict does not risk such collapse. Thus “prediction” is distinguished from “algorithmic complexity”. Furthermore, commonsense moral side constraints are complex enough to risk such collapse when we train an “actor” but not a “predictor”. This seems confused.
First, all we need to turn a language model prediction into an action is a means of turning text into action, and we have many such means. So the distinction between text predictor and actor is suspect. We could consider an alternative knows/cares distinction: does a system act properly when properly incentivised (“knows”) vs does it act properly when presented with whatever context we are practically able to give it (“”“cares”””)? Language models usually act properly given simple prompts, so in this sense they “care”. So rejecting evidence from language models does not seem well justified.
Second, there’s no need to claim that commonsense moral side constraints in particular are so hard that trying to develop AI systems that respect them leads to policy collapse. It need only be the case that one of the things we try to teach them to do leads to policy collapse. Teaching values is not particularly notable among all the things we might want AI systems to do; it certainly does not seem to be among the hardest. Focussing on values makes the argument unnecessarily weak.
Third, algorithmic complexity is measured with respect to a prior. The post invokes (but does not justify) an “English speaking evil genie” prior. I don’t think anyone thinks this is a serious prior for reasoning about advanced AI system behaviour. But the post is (according to your commentary, if not the post itself) making a quantitative point—values are sufficiently complex to induce policy collapse—but it’s measuring this quantity using a nonsense prior. If the quantitative argument was indeed the original point, it is mystifying why a nonsense prior was chosen to make it, and also why no effort was made to justify the prior.
the text proposes full value alignment as a solution to the commonsense side constraints problem, but this turned out to be stronger than necessary.
When do you think is the right time to work in these issues? Monitoring, trust displacement and fine grained permission management all look liable to raise issues that weren’t anticipated and haven’t already been solved, because they’re not the way things have been done historically. My gut sense is that GPT4 performance is much lower when you’re asking it to do novel things. Maybe it’s also possible to make substantial gains with engineering and experimentation, but you’ll need a certain level of performance in order to experiment.
Some wild guesses: maybe the right time to start work is one generation before it’s feasible, and that might mean start now for fine grained permissions, gpt 4.5 for monitoring, gpt 5 for trust displacement.
The AI system builders’ time horizon seems to be a reasonable starting point
I’ve thought about it a bit, I have a line of attack for a proof, but there’s too much work involved in following it through to an actual proof so I’m going to leave it here in case it helps anyone.
I’m assuming everything is discrete so I can work with regular Shannon entropy.
Consider the range R1 of the function g1:λ↦P(X1|Λ=λ) and R2 defined similarly. Discretize R1 and R2 (chop them up into little balls). Not sure which metric to use, maybe TV.
Define Λ′1(λ) to be the index of the ball into which P(X1|Λ=λ) falls, Λ′2 similar. So if d(P(X1|Λ=a),P(X1|Λ=b)) is sufficiently small, then Λ′1(a)=Λ′1(b).
By the data processing inequality, conditions 2 and 3 still hold for Λ′=(Λ′1,Λ′2). Condition 1 should hold with some extra slack depending on the coarseness of the discretization.
It takes a few steps, but I think you might be able to argue that, with high probability, for each X2=x2, the random variable Q1:=P(X1|Λ′1) will be highly concentrated (n.b. I’ve only worked it through fully in the exact case, and I think it can be translated to the approximate case but I haven’t checked). We then invoke the discretization to argue that H(Λ′1|X1) is bounded. The intuition is that the discretization forces nearby probabilities to coincide, so if Q1 is concentrated then it actually has to “collapse” most of its mass onto a few discrete values.
We can then make a similar argument switching the indices to get H(Λ′2|X2) bounded. Finally, maybe applying conditions 2 and 3 we can get H(Λ′1|X2) bounded as well, which then gives a bound on H(Λ|Xi).
I did try feeding this to Gemini but it wasn’t able to produce a proof.