The Constitutional AI paper, in a sense, shows that a smart alien with access to an RLHFed helpful language model can figure out how to write text according to a set of human-defined rules. It scares me a bit that this works well, and I worry that this sort of self-improvement is going to be a major source of capabilities progress going forward.
Tom Shlomi
Talking about what a language model “knows” feels confused. There’s a big distinction between what a language model can tell you if you ask it directly, what it can tell you if you ask it with some clever prompting, and what a smart alien could tell you after only interacting with that model. A moderately smart alien that could interact with GPT-3 could correctly answer far more questions than GPT-3 can even with any amount of clever prompting.
As a sort-of normative realist wagerer (I used to describe myself that way, and still have mostly the same views, but now longer consider it a good way to describe myself), I really enjoyed this post, but I think it misses the reasons the wager seems attractive to me.
To start, I don’t think of the wager as being “if normative realism is true, things matter more, so I should act as if I’m a normative realist”, but as being “unless normative realism is true, I don’t see how I could possibly determine what matters, and so I should act as if I’m a normative realist”.
I’m also strongly opposed to using Martha-type dilemmas to reason about meta-ethics. To start, this assumes a sort of meta-normative realism where there is an objective truth to meta-ethics, which is highly non-obvious to me. Secondly, I don’t buy the validity of thought experiments that go “assume X is true” rather than “assume you observe X”, especially when the metaphysical possibility of X is questionable. Finally, I think that my actual answer to any deal of the form “if X: I get $100; else: a hundred children burned alive” is to reject it, even if X is “2 + 2 = 4″, and so my response to such deals has little bearing on my evaluation of X.
I’ll admit that I don’t think I understand ethical nihilists. The analogy of the Galumphians was very helpful, but I still expect that nihilists have something in their ontology which I would label as a should. I’ll note that I don’t associate ethical nihilism with any sort of gloomy connotations, I’m just confused by it.
I really love this idea! Thanks for sharing this, I’m excited to try Calibrate.
How can list sorting be O(n)? There are n! ways to sort a list, which means that it’s impossible to have a list sorting algorithm faster than O(log(n!)) = O(n*log(n)).
That’s for linking the post! I quite liked it, and I agree that computational complexity doesn’t pose a challenge to general intelligence. I do want to dispute your notion that “if you hear that a problem is in a certain complexity class, that is approximately zero evidence of any conclusion drawn from it”. The world is filled with evidence, and it’s unlikely that closely related concepts give approximately zero evidence for each other unless they are uncorrelated or there are adversarial processes present. Hearing that list-sorting is O(n*log(n)) is pretty strong evidence that it’s easy to do, and hearing that simulating quantum mechanics is not in P is pretty strong evidence that it’s hard to do. Sure, there are lots of exceptions, but computational complexity is in fact a decent heuristic, especially if you go with average-case complexity of an approximation, rather than worst-case complexity of an exact answer.
I’m definitely only talking about probabilities in the range of >90%. >50% is justifiable without a strong argument for the disjunctivity of doom.
I like the self-driving car analogy, and I do think the probability in 2015 that a self-driving car would ever kill someone was between 50% and 95% (mostly because of a >5% chance that AGI comes before self-driving cars).
There’s still the problem of successor agents and self-modifying agents, where you need to set up incentives to create successor agents with the same utility functions and to not strategically self-modify, and I think a solution to that would probably also work as a solution to normal dishonesty.
I do expect that in a case where agents can also see each other’s histories, we can make bargaining go well with the bargaining theory we know (given that the agents try to bargain well, there are of course possible agents which don’t try to cooperate well).
I’m really glad that this post is addressing the disjunctivity of AI doom, as my impression is that it is more of a crux than any of the reasons in https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities.
Still, I feel like this post doesn’t give a good argument for disjunctivity. To show that the arguments for a scenario with no outside view are likely, it takes more than just describing a model which is internally disjunctive. There needs to be some reason why we should strongly expect there to not be some external variables that could cause the model not to apply.
Some examples of these, in addition to the competence of humanity, are that deep learning could hit a wall for decades, Moore’s Law could come to a halt, some anti-tech regulation could cripple AI research, or alignment could turn out to be easy (which itself contains several disjunctive possibilities). I haven’t thought about these, and don’t claim that any of them are likely, but the possibility of these or other unknown factors invalidating the model prevents me from updating to a very high P(doom). Some of this comes from it just being a toy model, but adding more detail to the model isn’t enough to notably reduce the possibility of the model being wrong from unconsidered factors.
A statement I’m very confident in is that no perpetual motion machines will be developed in the next century. I could make some disjunctive list of potential failure modes a perpetual motion machine could encounter, and thus conclude that their development is unlikely, but this wouldn’t describe the actual reason a perpetual motion machine is unlikely. The actual reason is that I’m aware of certain laws of physics which prevent any perpetual motion machines from working, including ones with mechanisms wildly beyond my imagination. The outside view is another tool I can use to be very confident: I’m very confident that the next flight I take won’t crash, not because of my model of planes, but because any crash scenario which non-negligible probability would have caused some of the millions of commercial flights every year to crash, and that hasn’t happened. Avoiding AGI doom is not physically impossible and there is no outside view against it, and without some similarly compelling reason I can’t see how very high P(doom) can be justified.
You’re welcome!
The main one that comes to mind for me is, there are many possible solutions/equilibria/policy-sets that get to pareto-optimal outcomes, but they differ in how good they are for different players, and so it’s not enough that players be aware of a solution, and it’s also not even enough that there be one solution which stands out as extra salient—because players will be hoping to achieve a solution that is more favorable to them and might do various crazy things to try to achieve that.
This seems like it’s solved by just not letting your opponent get more utility than they would under the bargaining system you think is fair, no matter what crazy things they do? If there is a bargaining solution which stands out, then agents which strategize over which solution they propose will choose the salient one, since they expect their partner to do the same. I might be misunderstand what you’re getting at, though.
What would you say are the remaining problems that need to be solved, if we assume everyone has a way to accurately estimate everyone else’s utility function?
Finding something like a universally canonical bargaining solution would be great, as it would allow agents with knowledge of each other’s utility functions to achieve Pareto optimality. I think it’s not fully disentagleable from the question of incentivizing honesty, as I could imagine that there is some otherwise great bargaining solution that turns out to be unfixably vulnerable to dishonesty. Although, I do have an intuition that probably most reasonable bargaining solutions thought up in good faith are similar enough to each other that agents using different ones wouldn’t end up too far from the Pareto frontier, and so I’m not sure how important it is.
I think my answer is probably figuring out how to deal with strategic successor agents and dealing with counterfactuals. The successor agent problem is similar to the problem of lying about utility functions: if you’re dealing with a successor agent (or an agent which has modified its utility function), you need to bargain with it as though it had the utility function of its creator (or its original utility function), and figuring out how to deal with uncertainty over how an agent’s values have been modified by other agents or its past self seems important.
Bargaining solutions usually have the property that you can’t naively apply them to subgames, as different agents might value the subgames more or less, and an agent might be happy to accept a locally unfair deal for a better deal in another subgame. This is fine for sequential or simultaneous subgames, but some subgames only happen with some probability. Determining what would happen in counterfactual subgames is important for determining the fair solution in the occurring subgames, but verifying would counterfactually happen is often quite difficult.
In some sense, these are just subproblems of incentivizing honesty generally. I think the problem of incentivizing honesty is the overwhelming importance-weighted bulk of the remaining open problems in bargaining theory (relative to what I know), and it’s hard for me to think of an important problem that isn’t entangled with that in some way.
It’s a combination of evidential reasoning and norm-setting. If you’re playing the ultimatum game over $10 with a similarly-reasoning opponent, then deciding to only accept an (8, 2) split mostly won’t increase the chance they give in, it will increase the chance that they also only accept an (8, 2) split, and so you’ll end up with $2 in expectation. The point of an idea of fairness is that, at least so long as there’s common knowledge of no hidden information, both players should agree on the fair split. So if, while bargaining with a similarly-reasoning opponent, you decide to only accept fair offers, this increases the chance that your opponent only accepts fair offers, and so you should end up agreeing, modulo factors which cause disagreement on what is fair.
Similarly, fair bargaining is a good norm to set, as, once it is a norm, it allows people to trade on/close to the Pareto frontier while disincentivizing any attempts to extort unfair bargains.
Thanks. I know it’s that algorithm, I just want a more detailed and comprehensive description of it, so I can look at the whole thing and understand the problems with it that remain.
It’s really a class of algorithms, depending on how your opponent bargains, such that if the fair bargain (by your standard of fairness) gives X utility to you and Y utility to your partner, then you refuse to accept any other solution which gives your partner at least Y utility in expectation. So if they give you a take-it-or-leave-it offer which gives you positive utility and them Y’>Y utility, then you accept it with probability Y/Y’ - , such that their expected value from giving you that offer is Y - ,. If they have a different standard of fairness which gives you X’ utility and them Y’ utility but also use Adabarian bargaining, then you should agree to a bargain which gives you X’ - , utility and them Y - , utility (this is always possible via randomizing over their bargaining solution, your bargaining solution, and not trading, so long as all the bargaining solutions give positive utility to everyone).
“Any Pareto bargaining method is vulnerable...” Interesting, thanks! I take it there is a proof somewhere of this? Where can I read about this? What is a pareto bargaining method?
Sorry, that should actually be Pareto bargaining solution, which is a just a solution which ends up on the Pareto frontier. In The Pareto World Liars Prospers is a good explainer, and https://www.jstor.org/stable/1914235 shows a general result that every bargaining solution which is invulnerable to strategic dishonesty is equivalent to a lottery over dictatorships (where one person gets to choose their ideal solution) and tuple methods (where the possible outcomes are restricted to a set of two).
I feel like arguably “My bargaining protocol works great except that it incentivises people to try to fool each other about what their utility function is” …. is sorta like saying “my utopian legal system solves all social problems except for the one where people in power are incentivised to cheat/defect/abuse their authority.”
I agree with this, but also it would be pretty great to have a legal system which would work if people in power didn’t abuse their authority; I don’t think any current legal system even has that. Designing methods robust to strategic manipulation is an important part of the problem, but not the only part, and I don’t think it’s unreasonable focus on other parts, especially since there are a lot of scenarios where approximating your partner’s utility function is possible. In particular, if monetary value can be assigned to everything being bargained over, then approximating utility as money is usually reasonable.
I believe it’s the algorithm from https://www.lesswrong.com/posts/z2YwmzuT7nWx62Kfh/cooperating-with-agents-with-different-ideas-of-fairness. Basically, if you’re offered an unfair deal (and the other trader isn’t willing to renegotiate), you should accept the trade with a probability just low enough that the other trader does worse in expectation than if they offered a fair trade. For example, if you think that a fair deal would provide $10 to both players over not trading and the other trader offers a deal where they get $15 and you get $4, then you should accept with probability , so that in expectation they get less than if they offered a fair trade.
Any Pareto bargaining method is vulnerable to lying about utility functions, and so to have a chance at bargaining fairly, it’s necessary to have some idea of what your partner’s utility function is. I don’t think that using this method for dealing with unfair trades is especially vulnerable to deception, though possibly there’s some better way to deal with uncertainty over your partner’s utility function.
But because Eliezer would never precommit to probably turn down a rock with an un-Shapley offer painted on its front (because non-agents bearing fixed offers created ex nihilo cannot be deterred or made less likely through any precommitment) there’s always some state for Bot to stumble into in its path of reflection and self-modification where Bot comes out on top.
This is exactly why Eliezer (and I) would turn down a rock with an unfair offer. Sure, there’s some tiny chance that it was indeed created ex nihilo, but it’s far more likely that it was produced by some process that deliberately tried to hide the process that produced the offer.Moreover, if an as-of-yet ignorant Bot has some premonition that learning more about Eliezer will make Bot encounter truths he’d rather not encounter, Bot can self-modify on the basis of that premonition, before risking reading up on Eliezer on LessWrong.
This depends on the basis of that premonition. At every point, Bot is considering the effect of its commitment on some space of possible agents, which gets narrowed down whenever Bot learns more about Eliezer. If Bot knows everything about Eliezer when it makes the commitment, then of course Eliezer should not give in. If Bot knows some evidence, then actual-Eliezer is essentially in a cooperation-dilemma with the possible agents that Bot thinks Eliezer could be. Then, Eliezer should not give in if he thinks that he can logically cooperate with enough of the possible agents to make Bot’s commitment unwise.
This isn’t true all of the time; I expect that a North Korean version of Eliezer would give into threats, since he would be unable to logically cooperate with enough of the population to make the government see threat-making as pointless. Still, I do expect that in the situations which Eliezer (and I, for that matter) encounters in the future to not be like this.
Context windows could make the claim from the post correct. Since the simulator can only consider a bounded amount of evidence at once, its P[Waluigi] has a lower bound. Meanwhile, it takes much less evidence than fits in the context window to bring its P[Luigi] down to effectively 0.
Imagine that, in your example, once Waluigi outputs B it will always continue outputting B (if he’s already revealed to be Waluigi, there’s no point in acting like Luigi). If there’s a context window of 10, then the simulator’s probability of Waluigi never goes below 1/1025, while Luigi’s probability permanently goes to 0 once B is outputted, and so the simulator is guaranteed to eventually get stuck at Waluigi.
I expect this is true for most imperfections that simulators can have; its harder to keep track of a bunch of small updates for X over Y than it is for one big update for Y over X.