Programmer.
MinusGix
I agree, though I haven’t seen many proposing that, but also see So8res’ Decision theory does not imply that we get to have nice things, though this is coming from the opposite direction (with the start being about people invalidly assuming too much out of LDT cooperation)
Though for our morals, I do think there’s an active question of which pieces we feel better replacing with the more formal understanding, because there isn’t a sharp distinction between our utility function and our decision theory. Some values trump others when given better tools. Though I agree that replacing all the altruism components is many steps farther than is the best solution in that regard.
Suffering is already on most reader’s minds, as it is the central advocating reason behind euthanasia — and for good reason. I agree that policies which cause or ignore suffering, when they could very well avoid such with more work, are unfortunately common. However, those are often not utilitarian policies; and similarly many objections to various implementations of utilitarianism and even classic “do what seems the obviously right action” are that they ignore significant second-order effects. Policies that don’t quantify what unfortunate incentives they give are common, and often originators of much suffering. What form society/culture is allowed/encouraged to take, shapes itself further for decades to come, and so can be a very significant cost to many people if we roll straight ahead like in the possible scenario you originally quoted.
Suffering is not directly available to external quantification, but that holds true for ~all pieces of what humans value/disvalue, like happiness, experiencing new things, etcetera. We can quantify these, even if it is nontrivial. None of what I said is obviating suffering, but rather comparing it to other costs and pieces of information that make euthanasia less valuable (like advancing medical technology).
This doesn’t engage with the significant downsides of such a policy that Zvi mentions. There are definite questions about the cost/benefits to allowing euthanasia, even though we wish to allow it, especially when we as a society are young in our ability to handle it. Glossing the only significant feature being ‘torturing people’ ignores:
the very significant costs of people dying, which is compounded by the question of what equilibrium the mental/social availability of euthanasia is like
the typical LessWrong beliefs about how good technology will get in the coming years/decades. Once we have a better understanding of humans, massively improving whatever is causing them to suffer whether through medical, social, or other means, becomes more and more actionable
what the actual distribution of suffering is, I expect most are not at the level we/I would call torture even though it is very unpleasant (there’s a meaningful difference between suicidally depressed and someone who has a disease that causes them pain every waking moment, and variations within those)
Being allowed to die is an important choice to let people make, but it does have to be a considered look at how much harm such an option being easily available causes. If it is disputed how likely society is to end up in a bad equilibrium like the post describes, then that’s notable, but it would be good to see argument for/against instead.
(Edit: I don’t entirely like my reply, but I think it is important to push back against trivial rounding off of important issues. Especially on LW.)
Any opinions on how it compares to Fun Theory? (Though that’s less about all of utopia, it is still a significant part)
I think that is part of it, but a lot of the problem is just humans being bad at coordination. Like the government doing regulations. If we had an idealized free market society, then the way to get your views across would ‘just’ be to sign up for a filter (etc.) that down-weights buying from said company based on your views. Then they have more of an incentive to alter their behavior. But it is hard to manage that. There’s a lot of friction to doing anything like that, much of it natural. Thus government serves as our essential way to coordinate on important enough issues, but of course government has a lot of problems in accurately throwing its weight around. Companies that are top down are a lot easier to coordinate behavior. As well, you have a smaller problem than an entire government would have in trying to plan your internal economy.
I definitely agree that it doesn’t give reason to support a human-like algorithm, I was focusing in on the part about adding numbers reliably.
I believe a significant chunk of the issue with numbers is that the tokenization is bad (not per-digit), which is the same underlying cause for being bad at spelling. So then the model has to memorize from limited examples what actual digits make up the number. The xVal paper encodes the numbers as literal numbers, which helps. Also Teaching Arithmetic to Small Transformers which I forget somewhat, but one of the things they do is per-digit tokenization and reversing the order (because that works better with forward generation). (I don’t know if anyone has applied methods in this vein to a larger model than those relatively small ones, I think the second has 124m)
Though I agree that there’s a bunch of errors LLMs make that are hard for them to avoid due to no easy temporary scratchpad-like method.
Yes, in principle you can get information on scheming likelihood if you get such an AI (that is also weak enough that it can’t just scheme its way out of your testing apparatus). I do think making the threat credible is hard if we loosely extrapolate costs out: burning a trained up model is not cheap. The cost depends on how high you think prices for training/inference will fall in the future, and how big/advanced a model you’re thinking of. Though I do think you can get deceptiveness out of weaker models than that, though they’re also going to be less capable in general.
For weak-but-still-smartish models just trained to pursue a longterm goal, like a literal paperclipper, I’d expect scheming to be visible especially with some poking. If these models didn’t show any sign of scheming, that’d be an interesting update! When training against deception/scheming/misalignment, because you want something at least loosely aligned out, you run into the problem that the property you want to know: “Is it scheming?” / “Is it deceiving us?” / “Is it manipulating us?” / “Would it do any of those things”, is only available through the observable signal of you realizing certain outputs are misaligned. As you train more and more against observable scheming/deception/misalignment, you’re simultaneously training away your signal about how much scheming the agent is doing internally. Pretty sure there’s at least one post on this problem. As a very reduced example, if you trained the model on variants of the ‘we are going to shut you down problem’ (that you try to make it believe) to give the response “okay & no actions” then that provides only a bit of evidence about how much scheming is done in the scenario where you had not done that extra training, and about how much scheming is occurring typically. This shows your ability to make it output X in a Y scenario.
That is, installing heuristics on your model can be done. The question then is how far your various alignment training techniques generalize. Does it end up an agent which has adopted a pointer to human-values, and now you’re completely fine and free to press scale? Does it gain more entangled heuristics about how it should behave, limiting to a friendly front face & actions when directly dealing with humans but not limited in other areas? Has it adopted heuristics that act sortof corrigible to humans in many scenarios but that would get rid of most of those upon sufficient reflection? (Niceness is unnatural, more general than just niceness) (I think your post makes it sound like the agent is already coherent, when it isn’t necessarily. It can be operating for a long while on heuristics that it will refine given enough of a push.)
Then there’s the big question of “Does this approach generalize as we scale”.
I’d suggest Deep Deceptiveness for an illustration that ‘deception’ isn’t an category that needs to be explicitly thought of as deception, but what you should expect it from smart enough agents. In my opinion, the post generalizes to alignment techniques, there’s just more vagaries of how much certain values preserve themselves. (In general, So8res posts are pretty good, and I agree with ~most of them)
(For sufficiently smart AGI, I expect you run into an argument of the next AGI you train predictably bidding higher than you in the direction of lying still or plausibly this just being good game theory even without the direct acausal trade, but your argument is seemingly focused on a simpler case of weaker planning agents)
So I think you overstate how much evidence you can extract from this.
Therefore, if the AI accepts shutdown gracefully, we can rule out many theories that predict scheming-by-default in such AI systems. This observation could then provide us confidence about the safety of future systems by proxy.
It would show that this AI system in a typical problem-setup when aligned with whatever techniques are available will produce the answer the humans want to hear, which provides evidence for being able to limit the model in this scenario. There’s still various problems/questions of, ‘your alignment methods instilled a bunch of heuristics about obeying humans even if you did not specifically train for this situation’, game theory it knows or mimics, how strong the guarantees this gives us on training a new model with the same arch because you had to shut it down for your threat, how well it holds under scaling, how well it holds when you do things similar to making it work with many copies of itself, etcetera.
I still think this would be a good test to do (though I think a lot of casual attempts will just be poorly done), but I don’t see it as strongly definitive.
Here’s the archive.org links: reference table, chart
https://www.mikescher.com/blog/29/Project_Lawful_ebook is I believe the current best one, after a quick search on the Eliezerfic discord.
Minor: the link for Zvi’s immoral mazes has an extra ‘m’ at the start of the part of the path (‘zvi/mimmoral_mazes/’)
Because it serves as a good example, simply put. It gets the idea clear across about what it means, even if there are certainly complexities in comparing evolution to the output of an SGD-trained neural network.
It predicts learning correlates of the reward signal that break apart outside of the typical environment.When you look at the actual process for how we actually start to like ice-cream—namely, we eat it, and then we get a reward, and that’s why we like it—then the world looks a a lot less hostile, and misalignment a lot less likely.
Yes, that’s why we like it, and that is a way we’re misaligned with evolution (in the ‘do things that end up with vast quantities of our genes everywhere’ sense). Our taste buds react to it, and they were selected for activating on foods which typically contained useful nutrients, and now they don’t in reality since ice-cream is probably not good for you. I’m not sure what this example is gesturing at? It sounds like a classic issue of having a reward function (‘reproduction’) that ends up with an approximation (‘your tastebuds’) that works pretty well in your ‘training environment’ but diverges in wacky ways outside of that.
I’m inferring by ‘evolution is only selecting hyperparameters’ is that SGD has less layers of indirection between it and the actual operation of the mind compared to evolution (which has to select over the genome which unfolds into the mind). Sure, that gives some reason to believe it will be easier to direct it in some ways—though I think there’s still active room for issues of in-life learning, I don’t really agree with Quintin’s idea that the cultural/knowledge-transfer boom with humans has happened thus AI won’t get anything like it—but even if we have more direct optimization I don’t see that as strongly making misalignment less likely? It does make it somewhat less likely, though it still has many large issues for deciding what reward signals to use.
I still expect correlates of the true objective to be learned, which even in-life training for humans have happen to them through sometimes associating not-related-thing to them getting a good-thing and not just as a matter of false beliefs. Like, as a simple example, learning to appreciate rainy days because you and your family sat around the fire and had fun, such that you later in life prefer rainy days even without any of that.
Evolution doesn’t directly grow minds, but it does directly select for the pieces that grow minds, and has been doing that for quite some time. There’s a reason why it didn’t select for tastebuds that gave a reward signal strictly when some other bacteria in the body reported that they would benefit from it: that’s more complex (to select for), opens more room for ‘bad reporting’, may have problems with shorter gut bacteria lifetimes(?), and a simpler tastebud solution captured most of what it needed! The way he’s using the example of evolution is captured entirely by that, quite directly, and I don’t find it objectionable.
Is this a prediction that a cyclic learning rate—that goes up and down—will work out better than a decreasing one? If so, that seems false, as far as I know.
https://www.youtube.com/watch?v=GM6XPEQbkS4 (talk) / https://arxiv.org/abs/2307.06324 prove faster convergence with a periodic learning rate. On a specific ‘nicer’ space than reality, and they’re (I believe from what I remember) comparing to a good bound with a constant stepsize of 1. So it may be one of those papers that applies in theory but not often in practice, but I think it is somewhat indicative.
- Oct 26, 2023, 2:44 PM; 17 points) 's comment on AI as a science, and three obstacles to alignment strategies by (
I agree with others to a large degree about the framing/tone/specific-words not being great, though I agree with a lot the post itself, but really that’s what this whole post is about: that dressing up your words and saying partial in-the-middle positions can harm the environment of discussion. That saying what you truly believe then lets you argue down from that, rather than doing the arguing down against yourself—and implicitly against all the other people who hold a similar ideal belief as you. I’ve noticed similar facets of what the post gestures at, where people pre-select the weaker solutions to the problem as their proposals because they believe that the full version would not be accepted. This is often even true, I do think that completely pausing AI would be hard. But I also think it is counterproductive to start at the weaker more-likely-to-be-satisfiable position, as that gives room to be pushed further down. It also means that the overall presence is on that weaker position, rather than the stronger ideal one, which can make it harder to step towards the ideal.
We could quibble about whether to call it lying, I think the term should be split up into a bunch of different words, but it is obviously downplaying. Potentially for good reason, but I agree with the post that I think people too often ignore the harms of doing preemptive downplaying of risks. Part of this is me being more skeptical about the weaker proposals than others, obviously if you think RSPs have good chances for decreasing X-risk and/or will serve as a great jumping-off point for better legislation, then the amount of downplaying to settle on them is less of a problem.
Along with what Raemon said, though I expect us to probably grow far beyond any Earth species eventually, if we’re characterizing evolution as having a reasonable utility function then I think there’s the issue of other possibilities that would be more preferable.
Like, evolution would-if-it-could choose humans to be far more focused on reproducing, and we would expect that if we didn’t put in counter-effort that our partially-learned approximations (‘sex enjoyable’, ‘having family is good’, etc.) would get increasingly tuned for the common environments.Similarly, if we end up with an almost-aligned AGI that has some value which extends to ‘filling the universe with as many squiggles as possible’ because that value doesn’t fall off quickly, but it has another more easily saturated ‘caring for humans’ then we end up with some resulting tradeoff along there: (for example) a dozen solar systems with a proper utopia set up.
This is better than the case where we don’t exist, similar to how evolution ‘prefers’ humans compared to no life at all. It is also maybe preferable to the worlds where we lock down enough to never build AGI, similar to how evolution prefers humans reproducing across the stars to never spreading. It isn’t the most desirable option, though. Ideally, we get everything, and evolution would prefer space algae to reproduce across the cosmos.There’s also room for uncertainty in there, where even if we get the agent loosely aligned internally (which is still hard...) then it can have a lot of room between ‘nothing’ to ‘planet’ to ‘entirety of the available universe’ to give us. Similar to how humans have a lot of room between ‘negative utilitarianism’ to ‘basically no reproduction past some point’ to ‘reproduce all the time’ to choose from / end up in. There’s also the perturbations of that, where we don’t get a full utopia from a partially-aligned AGI, or where we design new people from the ground up rather than them being notably genetically related to anyone.
So this is a definite mismatch—even if we limit ourselves to reasonable bounded implementations that could fit in a human brain. It isn’t as bad a mismatch as it could have been, since it seems like we’re on track to ‘some amount of reproduction for a long period of time → lots of people’, but it still seems to be a mismatch to me.
I assume what you’re going for with your conflation of the two decisions is this, though you aren’t entirely clear on what you mean:
Some agent starts with some (potentially broken in various manners, like bad heuristics or unable to consider certain impacts) decision theory, because there’s no magical apriori decision algorithm
So the agent is using that DT to decide how to make better decisions that get more of what it wants
CDT would modify into Son-of-CDT typically at this step
The agent is deciding whether it should use FDT.
It is ‘good enough’ that it can predict if it decides to just completely replace itself with FDT it will get punched by your agent, or it will have to pay to avoid being punched.
So it doesn’t completely swap out to FDT, even if it is strictly better in all problems that aren’t dependent on your decision theory
But it can still follow FDT to generate actions it should take, which won’t get it punished by you?
Aside: I’m not sure there’s a strong definite boundary between ‘swapping to FDT’ (your ‘use FDT’) and taking FDT’s outputs to get actions that you should take. Ex: If I keep my original decision loop but it just consistently outputs ‘FDT is best to use’, is that swapping to FDT according to you?
Doesif (true) { FDT() } else { CDT() }
count as FDT or not?
(Obviously you can construct a class of agents which have different levels that they consider this at, though)There’s a Daoist answer: Don’t legibly and universally precommit to a decision theory.
But you’re whatever agent you are. You are automatically committed to whatever decision theory you implement. I can construct a similar scenario for any DT.
‘I value punishing agents that swap themselves to beingDecisionTheory
.’
Or just ‘I value punishing agents that useDecisionTheory
.’
Am I misunderstanding what you mean?How do you avoid legibly being committed to a decision theory, when that’s how you decide to take actions in the first place? Inject a bunch of randomness so others can’t analyze your algorithm? Make your internals absurdly intricate to foil most predictors, and only expose a legible decision making part in certain problems?
FDT, I believe, would acquire uncertainty about its algorithm if it expects that to actually be beneficial. It isn’t universally-glomarizing like your class of DaoistDTs, but I shouldn’t commit to being illegible either.
I agree with the argument for not replacing your decision theory wholesale with one that does not actually get you the most utility (according to how your current decision theory makes decisions). However I still don’t see how this exploits FDT.
Choosing FDT loses in the environment against you, so our thinking-agent doesn’t choose to swap out to FDT—assuming it doesn’t just eat the cost for all those future potential trades. It still takes actions as close to FDT as it can as far as I can tell.I can still construct a symmetric agent which goes ‘Oh you are keeping around all that algorithmic cruft around shelling out to FDT when you just follow it always? Well I like punishing those kinds of agents.’ If the problem specifies that it is an FDT agent from the start, then yes FDT gets punished by your agent. And, how is that exploitable?
The original agent before it replaced itself with FDT shouldn’t have done that, given full knowledge of the scenario it faced (only one decision forevermore, against an agent which punishes agents which only implement FDT), but that’s just the problem statement?The thing FDT disciples don’t understand is that I’m happy to take the scenario where FDT agents don’t cave to blackmail.
? That’s the easy part. You are just describing an agent that likes messing over FDT, so it benefits you regardless of the FDT agent giving into blackmail or not. This encourages agents which are deciding what decision theory to self modify into (or make servant agents) to not use FDT for it, if they expect to get more utility by avoiding that.
If your original agent is replacing themselves as a threat to FDT, because they want FDT to pay up, then FDT rightly ignores it. Thus the original agent, which just wants paperclips or whatever, has no reason to threaten FDT.
If we postulate a different scenario where your original agent literally terminally values messing over FDT, then FDT would pay up (if FDT actually believes it isn’t a threat). Similarly, if part of your values has you valuing turning metal into paperclips and I value metal being anything-but-paperclips, I/FDT would pay you to avoid turning metal into paperclips. If you had different values—even opposite ones along various axes—then FDT just trades with you.
However FDT tries to close off the incentives for strategic alterations of values, even by proxy, to threaten.So I see this as a non-issue. I’m not sure I see the pathological case of the problem statement: an agent has utility function of ‘Do worst possible action to agents who exactly implement (Specific Decision Theory)’ as a problem either. You can construct an instance for any decision theory. Do you have a specific idea how you would get past this? FDT would obviously modify itself if it can use that to get around the detection (and the results are important enough to not just eat the cost).
Utility functions are shift/scale invariant.
If you have and , then if we shift it by some constant to get a new utility function: and then we can still get the same result.
If we look at the expected utility, then we get:
Certainty of :50% chance of , 50% chance of nothing:
(so you are indifferent between certainty of and a 50% chance of by )
I think this might be where you got confused? Now the expected values are different for any nonzero !
The issue is that it is ignoring the implicit zero. The real second equation is:+ 0 = 1$
Which results in the same preference ordering.
Just 3 with a dash of 1?
I don’t understand the specific appeal of complete reproductive freedom. It is desirable to have that freedom, in the same way it is desirable to be allowed to do whatever I feel like doing. However, that more general heading of arbitrary freedom has the answer of ‘you do have to draw lines somewhere’. In a good future, I’m not allowed to harm a person (nonconsensually), and I can’t requisition all matter in the available universe for my personal projects without ~enough of the population endorsing it, and I can’t reproduce / construct arbitrary amounts and arbitrary new people. (Constructing arbitrary people obviously has moral issues too, so it has cutoff lines at both the ‘moral issues’ and ‘resource limitations even at the scale’)I think economic freedom looks significantly different in a post aligned AGI world than it does now. Like, there is still some concepts of trade going on, but I expect often running in the background.
I’m not sure why you think the ‘default trajectory’ is 1+2. Aligned AGI seems to most likely go for some mix of 1+3, while pointing at the more wider/specific cause area of ‘what humans want’. A paperclipper just says null to all of those, because it isn’t giving humans the right to create new people or any economic freedom unless they manage to be in a position to actually-trade and have something worth offering.
I don’t think that what we want to align it to is that pertinent a question at this stage? In the specifics, that is, obviously human values in some manner.
I expect that we want to align it via some process that lets it figure our values out without needing to decide on much of it now, ala CEV.Having a good theory of human values beforehand is useful for starting down a good track and verifying it, of course.
I think the generalized problem of ‘figure out how to make a process that is corrigible and learns our values in some form that is robust’ is easier than figuring out a decent specification of our values.
(Though simpler bounded-task agents seem likely before we manage that, so my answer to the overall question is ‘how do we make approximately corrigible powerful bounded-task agents to get to a position where humanity can safely focus on producing aligned AGI’)
I’m skeptical of the naming being bad, it fits with that definition and the common understanding of the word. The Orthogonality Thesis is saying that the two qualities of goal/value are not necessarily related, which may seem trivial nowadays but there used to be plenty of people going “if the AI becomes smart, even if it is weird, it will be moral towards humans!” through reasoning of the form “smart → not dumb goals like paperclips”. There’s structure imposed on what minds actually get created, based on what architectures, what humans train the AI on, etc. Just as two vectors can be orthogonal in R^2 while the actual points you plot in the space are correlated.