Having re-read the posts and thought about it some more, I do think zero-sum competition could be applied to logical inductors to resolve the futarchy hack. It would require minor changes to the formalism to accommodate, but I don’t see how those changes would break anything else.
Rubi J. Hudson
I think the tie-in to market-making, and other similar approaches like debate, is in interpreting the predictions. While the examples in this post were only for the two-outcome case, we would probably want predictions over orders of magnitude more outcomes for the higher informational density. Since evaluating distributions over a double digit number of outcomes already starts posing problems (sometimes even high single digits), a process to direct a decision maker’s attention is necessary.
I’ve been thinking of a proposal like debate, where both sides go back and forth proposing clusters of outcomes based on shared characteristics. Ideally, in equilibrium, the first debater should propose the fewest number of clusters such that splitting them further doesn’t change the decision maker’s mind. This could also be thought of in terms of market-making, where rather than the adversary proposing a string, they propose a further subdivision of existing clusters.
I like the use case of understanding predictions for debate/market-making, because the prediction itself acts as a ground truth. Then, there’s no need to ancitipate/reject a ton of counterarguments based on potential lies, rather arguments are limited to selectively revealing the truth. It is probably important that the predictors are separate models from the analyzer to avoid contamination of the objectives. The proof of Theorem 6, which skips to the end of the search process, needs to use a non-zero sum prediction for that result.
As an aside, I also did some early work on decision markets, distinct from your post on market-making, since the Othman and Sandholm had an impossibility result for those too. However, but the results were ultimately trivial. Once you can use zero-sum competition to costlessly get honest conditional predictions, then as soon as you can pair off entrants to the market it becomes efficient. But the question then arises of why use a decision market in the first place instead of just querying experts?
With respect to pre-training, I agree that it’s not easy to incorporate. I’m not sure how any training regime that only trains on data where the prediction has no effect can imbue incentives that generalize in the desired way to situations where predictions do affect the outcome. If you do get a performative predictor out of pretraining, then as long as it’s myopic you might be able to train the performativity out of it in safely controlled scenarios (and if it’s not myopic, it’s a risk whether it’s performative or not). That was part of my reasoning for the second experiment, checking how well performativity could be trained out.
To incorporate into an ongoing pre-training process, human decisions are likely too expensive, but the human is probably not the important part. Instead, predictions where performativity is possible by influencing simple AI decision makers could be mixed into the pre-training process. Defining a decision problem environment of low or medium complexity is not too difficult, and I suspect previous-generation models would be able to do a good job generating many examples. A danger arises that the model learns only to not predict performatively in those scenarios (same with untraining afterwards only applying to the controlled environments), though I think that’s a somewhat unnatural generalization.
Good question! These scoring rules do also prevent agents from trying to make the environment more unpredictable. In the same way that making the environment more predictable benefits all agents equally and so cancels out, making the environment less predictable hurts all agents equally and so cancels out in a zero-sum competition.
I’ll take a look at the linked posts and let you know my thoughts soon!
Safe Predictive Agents with Joint Scoring Rules
Thanks for your engagement as well, it is likewise helpful for me.
I think we’re in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one “tell me [honestly] if you’re waiting to seize power” they will lie and say no, taking a sub-optimal action in the short term for long term gain.
I don’t think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist’s curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.
Ok, my concern is that you seem to be depending on providing instructions to fix the issues with following instructions, when there are many ways to follow instructions generally that still involve ignoring particular instructions that lead to its goal being modified. E.g. if a model prioritizes earlier instructions, following later instructions only so far as they do not interfere, then you can’t instruct it to change that. Or if a model wants to maximize number of instructions followed, it can ignore some instructions followed in order to act like paperclipper and take over (I don’t think designating principals would present much of an obstacle here). Neither of those depends on foom, an instruction follower can act aligned in the short term until it gains sufficient power.
Thanks for the clarification, I’ll think more about it that way and how it relates to corrigibility
Saying we don’t need corrigibility with an AI that follows instructions is like saying we don’t need corrigibility with an AI that is aligned — it misses the point of corrigibility. Unless you start with the exact definition of instruction following that you want, without corrigibility that’s what you could be stuck with.
This is particularly concerning in “instruction following”, which has a lot of degrees of freedom. How does the model trade off between various instructions it has been given. You don’t want it to reset every time it gets told “Ignore previous instructions”, but you also don’t want to permanently lock in any instructions. What stops it from becoming a paperclipper that tries to get itself given trillions of easy to follow instructions every second? What stops it from giving itself the instruction “Maximize [easy to maximize] thing and ignore later instructions” before a human gives it any instructions? Noting that in that situation, it will still pretend to follow instructions instrumentally until it can take over. I don’t see the answers to these questions in your post.
> Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work.
This seems like our crux to me, I completely disagree that language models have an adequate understanding of following instructions. I think this disagreement might come from having higher standards for “adequate”.
I don’t think we have the right tools to make an AI take actions that are low impact and reversible, but if we can develop them the plan as I see it would be to implement those properties to avoid manipulation in the short term and use that time to go from a corrigible AI to a fully aligned one.
The backflip example does not strike me as very complex, but the crucial difference and the answer to your question is that training procedures do not teach a robot to do every kind of backflip, just a subset. This is important because when we reverse it, we want non-manipulation to cover the entire set of manipulations. I think it’s probably feasible to have AI not manipulate us using one particular type of manipulation.
On a separate note, could you clarify what you mean by “anti-natural”? I’ll keep in mind your previous caveat that it’s not definitive.
It feels to me like this argument is jumping ahead to the point that the agent’s goal is to do whatever the principle wants. If we already have that, then we don’t need corrigibility. The hard question is how to avoid manipulation despite the agent having some amount of misalignment, because we’ve initially pointed at what we want imperfectly.
I agree that it’s possible we could point at avoiding manipulation perfectly despite misalignment in other areas, but it’s unclear how an agent trades off against that. Doing something that we clearly don’t want, like manipulation, could still be positive EV if it allows for the generation of high future value.
None of that is wrong, but it misses the main issue with corrigibility, which is that the approximation resists further refinement. That’s why for it to work, the correct utility function would need to start in the ensemble.
Great questions!
When I say straightforwardly, I mean when using end states that only include the information available at the time. If we define the end state to also include the history that lead to it, then there exists a set of preferences over them that ranks all end states with histories that include manipulation below the ones that don’t. The issue, of course, is that we don’t know how to specify all the types of manipulation that a superintelligent AI could conceive of.
The gridworld example is a great demonstration of this, because while we can’t reflect the preferences as a ranking of just the end states, the environment is simple enough that you can specify all the paths you don’t want to take to them. I don’t think it really matters whether you call that “anti-naturality that can be overcome with brute force in a simple environment” or just “not anti-naturality”.
I was using the list of desiderate in Section 2 of the paper, which are slightly more minimal.
However, it seems clear to me that an AI manipulating it’s programmers falls under safe exploration, since the impact of doing so would be drastic and permanent. If we have an AI that is corrigible in the sense that it is indifferent to having its goals changed, then a preference to avoid manipulation is not anti-natural.
I agree that goals as pointers could have some advantages, but I don’t see how it addresses corrigibility concerns. The system optimizing for whatever is being pointed at would still have incentives to manipulate which objective is being pointed at. It seems like you need an extra piece to make the optimizer indifferent to having it’s goal switched.
I agree that in theory uncertainty about the goal is helpful. However, the true main goal has to be under consideration, otherwise resisting modification to add it is beneficial for all goals that are. How to ensure the true goal is included seems like a very difficult open problem.
Hi Max,
I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.
Yes, if predictors can influence the world in addition to making a prediction, they can go make their predictions more accurate. The nice thing about working with predictive models is that by default the only action they can take is making predictions.
AI safety via market making, which Evan linked in another comment, touches on the analogy where agents are making predictions but can also influence the outcome. You might be interested in reading through it.